Automating bulk file processing
Thou shalt not modify 30 files by hand.
—Joshua Chen
The problem all began with an attempt to migrate all articles from Weixin official account to this site. Copying & pasting all text and re-typesetting was pretty standard task, but it became tricky when it came to the pictures. Look at this article, for example. (The localized version is here.)
So I was trying to embed the first picture. Initially, to save bandwidth and some GH pages storage space (I try to be as nice to the server as possible, although later I realized that the 200 pictures took up less than half a gigabyte), I planned to link the URL directly, something like:
export const Figure = ({children, src}) => (
<div style={{textAlign: 'center'}}>
<img src={src} />
<p style={{color: 'gray', fontSize: 'small'}}>{children}</p>
</div>);
...
<Figure src="https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1"></Figure>
But it turned up as:
(The tricky part is it doesn't fail on all occasions. For example, it shows up on my localhost. I hope it fails on the GH pages so I don't look like a nut.)
It turns out that Weixin's image host doesn't like outside users sneaking into it. After some struggling, I made up my mind to host all images locally.
At this time, I already have all images inserted in the documents as <Figure>
tags. The next steps are:
- Download all images by the URLs and put them in the correct folders (I want to keep some sort of structure although everything is looked after by the script);
- Change the references of each image to a local URL.
The first thing coming to my mind was a bash script. Sadly, my knowledge of Bash was limited to invoking command-line tools like yarn
or python
—no conditionals, no loops, no variables. So writing each line was a 5-minute StackOverflow search. (A huge thank-you to StackOverflow and all of the amazing contributors!)
Downloading images
The core module is probably one line:
wget --output-document="correct/path/file.png" 'https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1'
So two problems:
- How do I generate the
correct/path/file.png
? - How do I locate all URLs?
To write the logic with pseudocode:
FILE[] files = enumerateFilesUnderPath("./docs/")
for file in files:
imagePath = "./static/img/" + removeExtension(file.path)
makeDirectory(imagePath)
string[] links = file.findInFile(/"(?<=<Figure src=\").*?(?=\">)"/g)
for link in links:
imageName = makeSomeMeaningfulName(link)
downloadImage(link, imageName, imagePath)
To begin with, I will need to enumerate all files within the /docs
folder. This is done via the find
command. The result is stored in a list.
doc_list=( $(find ./docs -mindepth 2) )
Next, we traverse doc_list
which contains path to each document. We have the weird syntax ${doc_list[@]}
instead of the more intuitive ${doc_list}
(referencing the doc_list
variable itself) as one would expect with knowledge of JS or Python's for-each loop.
doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
# TODO
done
Now given the path to a doc
, we need to generate its respective assets folder in the static
directory. The doc
path looks like: docs/Science/cavalieri.md
. We trim out the file name without the extension .md
by the syntax "${doc%.*}
, and append the path ./static/img/
, getting the right directory path ./static/img/docs/Science/cavalieri
to put the images in. The folder creation is done with mkdir
.
doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
mkdir -p "./static/img/""${doc%.*}"
# TODO
done
Running the script now will give a correct folder hierarchy without any contents.
Next, we extract all URLs from the file. Searching in a file with a regex is done using the command grep
. All URLs are enclosed in the format: <Figure src="...">
, so the most natural way is to use regex lookahead and lookbehind. Unluckily, grep
on MacOS doesn't support Perl, so to use the -p
flag, I had to install grep
that provides the GNU-style ggrep
. Now we can grep out all the links.
doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
mkdir -p "./static/img/""${doc%.*}"
links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
# TODO
done
To further extract the identifier for each image (the base-64 string) and the extension, we have to run grep
on each string.
doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
mkdir -p "./static/img/""${doc%.*}"
links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
for link in "${links[@]}"
do
name=$(echo "$link" | ggrep -o -P "(?<=(jpg|png)/).*(?=/640)")
ext=$(echo "$link" | ggrep -o -P "(?<=wx_fmt=).*")
# Almost there!
done
done
Lastly, it's just the wget
that downloads the image from the link and saves it to the path given by the directory, image name (base-64 identifier), and image extension.
doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
mkdir -p "./static/img/""${doc%.*}"
links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
for link in "${links[@]}"
do
name=$(echo "$link" | ggrep -o -P "(?<=(jpg|png)/).*(?=/640)")
ext=$(echo "$link" | ggrep -o -P "(?<=wx_fmt=).*")
wget --output-document="./static/img/""${doc%.*}""/"$name"."$ext "$link"
done
done
And that's it! Run it, and see the cascade of outputs.
Changing URLs
After we've downloaded the images, we will change the references to local URLs. For example,
<Figure src="https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1"></Figure>
becomes
<Figure src="/img/docs/Science/cavalieri/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ.png"></Figure>
Well, given the commands introduced above, this task is pretty trivial. Modifying the content of a text file (like a search-and-replace) is done by the command sed
. Because I'm getting tired, this part is left as exercise. You can cheat and look at the code here.