Automating bulk file processing

March 12, 2021 · 5 min read

Thou shalt not modify 30 files by hand.
—Joshua Chen

The problem all began with an attempt to migrate all articles from Weixin official account to this site. Copying & pasting all text and re-typesetting was pretty standard task, but it became tricky when it came to the pictures. Look at this article, for example. (The localized version is here.)

So I was trying to embed the first picture. Initially, to save bandwidth and some GH pages storage space (I try to be as nice to the server as possible, although later I realized that the 200 pictures took up less than half a gigabyte), I planned to link the URL directly, something like:

/docs/science/cavalieri.md
export const Figure = ({children, src}) => (
    <div style={{textAlign: 'center'}}>
        <img src={src} />
        <p style={{color: 'gray', fontSize: 'small'}}>{children}</p>
    </div>);

...

<Figure src="https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1"></Figure>

But it turned up as:

Oops

(The tricky part is it doesn't fail on all occasions. For example, it shows up on my localhost. I hope it fails on the GH pages so I don't look like a nut.)

It turns out that Weixin's image host doesn't like outside users sneaking into it. After some struggling, I made up my mind to host all images locally.

At this time, I already have all images inserted in the documents as <Figure> tags. The next steps are:

Download all images by the URLs and put them in the correct folders (I want to keep some sort of structure although everything is looked after by the script);
Change the references of each image to a local URL.

The first thing coming to my mind was a bash script. Sadly, my knowledge of Bash was limited to invoking command-line tools like yarn or python—no conditionals, no loops, no variables. So writing each line was a 5-minute StackOverflow search. (A huge thank-you to StackOverflow and all of the amazing contributors!)

Downloading images

The core module is probably one line:

wget --output-document="correct/path/file.png" 'https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1'

So two problems:

How do I generate the correct/path/file.png?
How do I locate all URLs?

To write the logic with pseudocode:

FILE[] files = enumerateFilesUnderPath("./docs/")
for file in files:
    imagePath = "./static/img/" + removeExtension(file.path)
    makeDirectory(imagePath)
    string[] links = file.findInFile(/"(?<=<Figure src=\").*?(?=\">)"/g)
    for link in links:
        imageName = makeSomeMeaningfulName(link)
        downloadImage(link, imageName, imagePath)

To begin with, I will need to enumerate all files within the /docs folder. This is done via the find command. The result is stored in a list.

doc_list=( $(find ./docs -mindepth 2) )

Next, we traverse doc_list which contains path to each document. We have the weird syntax ${doc_list[@]} instead of the more intuitive ${doc_list} (referencing the doc_list variable itself) as one would expect with knowledge of JS or Python's for-each loop.

doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
    # TODO
done

Now given the path to a doc, we need to generate its respective assets folder in the static directory. The doc path looks like: docs/Science/cavalieri.md. We trim out the file name without the extension .md by the syntax "${doc%.*}, and append the path ./static/img/, getting the right directory path ./static/img/docs/Science/cavalieri to put the images in. The folder creation is done with mkdir.

doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
    mkdir -p "./static/img/""${doc%.*}"
    # TODO
done

Running the script now will give a correct folder hierarchy without any contents.

Next, we extract all URLs from the file. Searching in a file with a regex is done using the command grep. All URLs are enclosed in the format: <Figure src="...">, so the most natural way is to use regex lookahead and lookbehind. Unluckily, grep on MacOS doesn't support Perl, so to use the -p flag, I had to install grep that provides the GNU-style ggrep. Now we can grep out all the links.

doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
    mkdir -p "./static/img/""${doc%.*}"
    links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
    # TODO
done

To further extract the identifier for each image (the base-64 string) and the extension, we have to run grep on each string.

doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
    mkdir -p "./static/img/""${doc%.*}"
    links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
    for link in "${links[@]}"
    do
        name=$(echo "$link" | ggrep -o -P "(?<=(jpg|png)/).*(?=/640)")
        ext=$(echo "$link" | ggrep -o -P "(?<=wx_fmt=).*")
        # Almost there!
    done
done

Lastly, it's just the wget that downloads the image from the link and saves it to the path given by the directory, image name (base-64 identifier), and image extension.

doc_list=( $(find ./docs -mindepth 2) )
for doc in "${doc_list[@]}"
do
    mkdir -p "./static/img/""${doc%.*}"
    links=( $(ggrep -o -P "(?<=<Figure src=\").*?(?=\">)" "$doc") )
    for link in "${links[@]}"
    do
        name=$(echo "$link" | ggrep -o -P "(?<=(jpg|png)/).*(?=/640)")
        ext=$(echo "$link" | ggrep -o -P "(?<=wx_fmt=).*")
        wget --output-document="./static/img/""${doc%.*}""/"$name"."$ext "$link"
    done
done

And that's it! Run it, and see the cascade of outputs.

Changing URLs

After we've downloaded the images, we will change the references to local URLs. For example,

/docs/science/cavalieri.md
<Figure src="https://mmbiz.qpic.cn/mmbiz_png/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ/640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1"></Figure>

becomes

/docs/science/cavalieri.md
<Figure src="/img/docs/Science/cavalieri/JGibibkelET68EfhySWuOboVia7FJX8ehwIAicTz2be2JDN7HIibwibjrpYPP1bTCr1TrjDicauU0P6BLCgFIibZK42GCQ.png"></Figure>

Well, given the commands introduced above, this task is pretty trivial. Modifying the content of a text file (like a search-and-replace) is done by the command sed. Because I'm getting tired, this part is left as exercise. You can cheat and look at the code here.

Downloading images​

Changing URLs​

Downloading images

Changing URLs