Simple use of tesseract OCR on a multipage PDF

Using the command line to OCR a PDF file. Done in Cygwin. First, converted pages of the PDF to PPM files, which tesseract can read. Chose 300 dpi.

$ pdftoppm -r 300 pdf-filename.pdf page

The PDF is ‘pdf-filename.pdf’ and the PPM files will have names of the form ‘page-??.ppm’ since the conversion will add ‘-??.ppm’ to the given stem, where ?? is the page number.

Then, run tesseract

$ for f in *.ppm ; do tesseract $f $f ; done

So this loops over all files with ppm as the extension and runs tesseract, and just gives the file name itself as the stem of the output. That means we’ll end up with a bunch of files with names like ‘page-03.ppm.txt’. I could have used basename to chop off the ppm, but there’s just no need.

Next, combine the txt files.

$ cat *.ppm.txt > pdf-filename.txt

This shows another reason for keeping the .ppm in the file name — if I have other .txt files in the subdirectory, they will not get caught up in the cat.

If you are a very thorough person, you might call your final text file something like ‘pdf-filename-tesseractOCR.txt’ or something, to preserve some information about provenance.

This OCR engine is pretty good.

Once the text file has been examined, don’t need the .ppm and ppm.txt files, so

$ rm page-??.ppm*

Of course, if you are surer of what’s in the folder, you might go

$ rm *ppm*

Now, clearly this could all be wrapped up in a very simple script, something like this (script has some improvements over commands noted above):

$ cat ~/bin/OCR-pdf.sh

#!/bin/bash
echo "1. Converting to png (limit 9999 pages or your disk space)"
gs -dBATCH -dNOPAUSE -sDEVICE=pnggray -r600 -dUseCropBox -sOutputFile=ZZZZpage-%04d.png "$1" 2> /dev/null > /dev/null
echo -n "                  "
for f in ZZZZpage-????.png ; do echo -n "." ; done
echo
echo -n "2. Performing OCR "
for f in ZZZZpage-????.png ; do echo -n "*" ; tesseract $f $f --dpi 600 2> /dev/null > /dev/null ; done
echo " done."
prename=`basename "$1" .pdf`
newname="$prename.txt"
echo 3. Creating text file "$newname"
cat ZZZZpage-????.png.txt > "$newname"
rm ZZZZpage-????.png ZZZZpage-????.png.txt
echo 4. Cleaning up

Where I’ve added a few bells and whistles. Note that the error output is all discarded (‘2> /dev/null’ means ‘send output stream 2 (stderr) to /dev/null, which makes it disappear) so if something does not work these bits should be removed.

‘echo -n’ means ‘echo but do not add a linefeed’. I have read that this does not always work, though in a bash implementation it should be fine.

This requires a working GhostScript interpreter. Other conversion paths are possible; the standard tesseract uses Leptonica, which can read ppm and png and other files, so pdftoppm as used above works, though ppm files are big and not compressed, which is why I changed to png — I note that the gs-based version picked up some text that the pdftoppm version did not, possibly because I went up to 600 dpi, but there may be some other factor at work, I can’t say for sure..

FWIW (for what it’s worth)

 

4 thoughts on “Simple use of tesseract OCR on a multipage PDF

  1. I really like these tutorials of yours. Most tech tutorials on WordPress are shit because they’re just mindlessly regurgitating knowledge that anyone can look up on W3Schools or some other site, but your tutorials actually have substance, seem well thought-out, and actually introduce me to topics that I’m not familiar with. Keep up the good work. And I’m going to be emulating some of your style in my own blog because I really think it works.

    1. Thanks for the positive feedback. I must admit I often do these posts really as notes to self. If I get something working on a home computer and want to do it at work, or on a new machine or whatever, I can just pull up the blog. The style comes from making notes while I’m working it out. I’ve just got in the habit of making notes whenever I have a problem to solve. I’m not an expert — just a user who likes to learn. Thanks for the words of encouragement.

      1. Interesting. So it’s not even particularly deliberate, then. I guess that’s what makes the tutorials interesting – the fact that they’re flowing naturally from the learning process instead of being specifically tailored for mass consumption.

        In trying to make my own blog grow (because I’m not going to lie, my ultimate intention is to monetize it and hopefully bring in some decent ad revenue), I’ve found it difficult to tailor my tutorials to what the market wants while at the same time keeping my content original, because generally any question that a sizable number of people are searching has already been answered hundreds of times by hundreds of different people. I don’t want to just relay information that’s already all over the Internet, because that would be redundant and I wouldn’t be contributing anything new.

        The best solution I’ve come up with is to just look through old code files and screenshots on my computer to see if there’s anything I’ve figured out how to do myself (as opposed to learning from a second-hand source) that other people could possibly find a use for. It’s harder to predict whether others will in fact find that knowledge useful, but at least I’ll be creating original content that isn’t redundant.

  2. I guess the tone comes partly because I don’t care too much. Like I said, it’s partly so I can find my own solutions wherever I am. And I’m not worrying about monetising. I have not bothered to make it thematic — I post on a range of stuff, which reduces the appeal of the blog because no one reader will find all the posts relevant. But I figure most readers come via search rather than subscription anyway. Of course, whether users find what they want when they land on a page, I have no idea. I don’t get massive stats. Around 100 reads a day.

    FYI, the most popular posts on the blog are listed below, (most-read at the top). Hints on computing are definitely most popular.

    An easy way to get files in and out of a Win98 VM in VirtualBox
    Science Spammers
    A little trick with tlmgr: Unknown directive …containerchecksum error
    Cambridge Scholars Publishing: Spammers but not dodgy
    Step by Step Install of Fedora on VirtualBox
    Word madness: Can’t save, won’t save. ‘A file error has occurred’
    Word: Dot leader sporadically missing from table of contents: Fixed
    TeX Live on cygwin: A few tricks
    ReactOS on VirtualBox: No need for step-by-step instructions
    Using Mate desktop by default in cygwin
    Install of EndNote X7 freezes: A few notes
    My HP 200LX: More Retrotech…
    Cygwin without Windows administrator access
    Adding the DOI to the unsrt bibliography style in LaTeX

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.