Downloading 500 web pages to translate

Written by Bruno Fontes, 29 Aug 2018


Last month I received a different project. Actually, the client just needed an estimative of cost and time to translate 500 web pages.

It definitely would be a normal project, but the client hadn't the files. Just a long list of URLs.

My first thought was: I am going to install a download manager, import the list and done. Profit! Instead, I remembered I already had just the right tool for that, at the tips of my fingers: curl or wget.

curl and wget are one of those terminal commands that are still absolutely necessary. Some download programs out there use curl library to real download anything. So, why not use the father instead of installing any of its implementations? While I usually prefer curl over wget, the second one was simpler to use, as all my urls had a well-defined .html file with different names. So that was the perfect tool to me.

There are lots of ways to download files using these tools, but I am going to stick with the simpler one here.

So, to do that, I just saved the big list of URLs as a text file, one URL per line. Any file name will fit it, but I chose urls to make it easy to identify the file list. Then I opened the terminal and typed:

wget -i urls

A few seconds later the .html files started to appear at the same folder I typed the command.

After downloading everything, a fast check is important, to make sure all the files were downloaded. So I just got the downloaded file list and compared it in a spreadsheet with the URL list. It was not a surprise when I realized there were two missing files. Taking a look at the spreadsheet, it was easy to identify the missing files and the reason they were missing: I had duplicated .html files.

As they were just 2 missing files I downloaded all the 4 again (yes, the first and the later version. It was faster than verify which one I had and which one was replaced), being careful to rename them.

With all the files at hands, it was just to open them with the CAT tool. But there is a tricky part here:

The repetitions

When you download multiple web pages, the headers, footers, and maybe some more lateral bars will get repeated over all the downloaded pages. It generates a big amount of repetitions that does not really exist in real life. To deal with them, you must know your CAT tool very well.

But if you are in a hush and your log file has thousands of repetitions that does not exist, just reduce them. Pretend part of them does not exist in the first place and estimative from there. Just make sure the client know it is an estimative, not the actual quote.