Using the command line to OCR a PDF file. Done in Cygwin. First, converted pages of the PDF to PPM files, which tesseract can read. Chose 300 dpi.
$ pdftoppm -r 300 pdf-filename.pdf page
The PDF is ‘pdf-filename.pdf’ and the PPM files will have names of the form ‘page-??.ppm’ since the conversion will add ‘-??.ppm’ to the given stem, where ?? is the page number.
Then, run tesseract
$ for f in *.ppm ; do tesseract $f $f ; done
So this loops over all files with ppm as the extension and runs tesseract, and just gives the file name itself as the stem of the output. That means we’ll end up with a bunch of files with names like ‘page-03.ppm.txt’. I could have used basename to chop off the ppm, but there’s just no need.
Next, combine the txt files.
$ cat *.ppm.txt > pdf-filename.txt
This shows another reason for keeping the .ppm in the file name — if I have other .txt files in the subdirectory, they will not get caught up in the cat.
If you are a very thorough person, you might call your final text file something like ‘pdf-filename-tesseractOCR.txt’ or something, to preserve some information about provenance.
This OCR engine is pretty good.
Once the text file has been examined, don’t need the .ppm and ppm.txt files, so
$ rm page-??.ppm*
Of course, if you are surer of what’s in the folder, you might go
$ rm *ppm*
Now, clearly this could all be wrapped up in a very simple script, something like this (script has some improvements over commands noted above):
$ cat ~/bin/OCR-pdf.sh #!/bin/bash echo "1. Converting to png (limit 9999 pages or your disk space)" gs -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -dUseCropBox -sOutputFile=ZZZZpage-%04d.png "$1" 2> /dev/null > /dev/null echo -n " " for f in ZZZZpage-????.png ; do echo -n "." ; done echo echo -n "2. Performing OCR " for f in ZZZZpage-????.png ; do echo -n "*" ; tesseract $f $f --dpi 600 2> /dev/null > /dev/null ; done echo " done." prename=`basename "$1" .pdf` newname="$prename.txt" echo 3. Creating text file "$newname" cat ZZZZpage-????.png.txt > "$newname" rm ZZZZpage-????.png ZZZZpage-????.png.txt echo 4. Cleaning up
Where I’ve added a few bells and whistles. Note that the error output is all discarded (‘2> /dev/null’ means ‘send output stream 2 (stderr) to /dev/null, which makes it disappear) so if something does not work these bits should be removed.
‘echo -n’ means ‘echo but do not add a linefeed’. I have read that this does not always work, though in a bash implementation it should be fine.
This requires a working GhostScript interpreter. Other conversion paths are possible; the standard tesseract uses Leptonica, which can read ppm and png and other files, so pdftoppm as used above works, though ppm files are big and not compressed, which is why I changed to png — I note that the gs-based version picked up some text that the pdftoppm version did not, possibly because I went up to 600 dpi, but there may be some other factor at work, I can’t say for sure..

I really like these tutorials of yours. Most tech tutorials on WordPress are shit because they’re just mindlessly regurgitating knowledge that anyone can look up on W3Schools or some other site, but your tutorials actually have substance, seem well thought-out, and actually introduce me to topics that I’m not familiar with. Keep up the good work. And I’m going to be emulating some of your style in my own blog because I really think it works.
Thanks for the positive feedback. I must admit I often do these posts really as notes to self. If I get something working on a home computer and want to do it at work, or on a new machine or whatever, I can just pull up the blog. The style comes from making notes while I’m working it out. I’ve just got in the habit of making notes whenever I have a problem to solve. I’m not an expert — just a user who likes to learn. Thanks for the words of encouragement.
Interesting. So it’s not even particularly deliberate, then. I guess that’s what makes the tutorials interesting – the fact that they’re flowing naturally from the learning process instead of being specifically tailored for mass consumption.
In trying to make my own blog grow (because I’m not going to lie, my ultimate intention is to monetize it and hopefully bring in some decent ad revenue), I’ve found it difficult to tailor my tutorials to what the market wants while at the same time keeping my content original, because generally any question that a sizable number of people are searching has already been answered hundreds of times by hundreds of different people. I don’t want to just relay information that’s already all over the Internet, because that would be redundant and I wouldn’t be contributing anything new.
The best solution I’ve come up with is to just look through old code files and screenshots on my computer to see if there’s anything I’ve figured out how to do myself (as opposed to learning from a second-hand source) that other people could possibly find a use for. It’s harder to predict whether others will in fact find that knowledge useful, but at least I’ll be creating original content that isn’t redundant.
I guess the tone comes partly because I don’t care too much. Like I said, it’s partly so I can find my own solutions wherever I am. And I’m not worrying about monetising. I have not bothered to make it thematic — I post on a range of stuff, which reduces the appeal of the blog because no one reader will find all the posts relevant. But I figure most readers come via search rather than subscription anyway. Of course, whether users find what they want when they land on a page, I have no idea. I don’t get massive stats. Around 100 reads a day.
FYI, the most popular posts on the blog are listed below, (most-read at the top). Hints on computing are definitely most popular.
An easy way to get files in and out of a Win98 VM in VirtualBox
Science Spammers
A little trick with tlmgr: Unknown directive …containerchecksum error
Cambridge Scholars Publishing: Spammers but not dodgy
Step by Step Install of Fedora on VirtualBox
Word madness: Can’t save, won’t save. ‘A file error has occurred’
Word: Dot leader sporadically missing from table of contents: Fixed
TeX Live on cygwin: A few tricks
ReactOS on VirtualBox: No need for step-by-step instructions
Using Mate desktop by default in cygwin
Install of EndNote X7 freezes: A few notes
My HP 200LX: More Retrotech…
Cygwin without Windows administrator access
Adding the DOI to the unsrt bibliography style in LaTeX