A little script to scan and OCR a bunch of pages

So this little script just uses scanimage, tesseract and vim to scan and process pages from my typewriter. It tries to produce sensible paragraphs, and outputs the results of multiple pages to a text file which can be read in and formatted using a word processor, such as LibreOffice.

It is an interactive script because I do not have a scanner fitted with a sheet feeder. To make it non-interactive, modify the scanimage line after reading the scanimage man page, and remove the line read Response. Nothing fancy, no error checking, no clean-up afterwards, no niceties. But it works pretty well, so far. If you want to use it, install any packages you need to to get scanimage, tesseract and vim to work, and cut and paste the below into a file in your path, and make the file executable.

cat type_ocr.sh
# /bin/bash
#
# type_ocr.sh v. 1.0
#
# Script to scan, ocr, process and concatenate pages, e.g. from a
# typewriter.
#
# D.J.Goossens, 14 July 2016. darren.goossens@gmail.com
#
# Start at 1001 so we can be (pretty!) sure all filenames have 4 digit
# numbers
#
# Create the output file.
echo This is type_ocr.sh v. 1.0
echo
echo Make sure you give it the output filename as a command line argument.
echo Ctrl-D escapes from the scanning, Ctrl-C quits elsewhere.
echo The resulting images and text files are not deleted.
echo They are of the form outXXXX.pnm and outXXXX.pnm.txt and
echo may be quite big.
echo
echo Hit Ctrl-C to exit now or Enter to continue.
read Response
echo 'Text file from type_ocr.sh v. 1.0' > $1
echo Processed `date` to $1 >> $1
echo 'Note: When it says "document 1001", treat it as document (page) 1'
scanimage --batch --batch-prompt --batch-start 1001 -p --mode=Gray --resolution=600
# Outputs are of the form out????.pnm. Loop over them
for f in out????.pnm;
do
tesseract $f $f
# The above produces out????.pnm.txt, which we can process,
# where first we replace double occurrences of newline with a placeholder
# string, then replace single occurrences with a space, then replace the
# placeholder with a return character (it is a trick of regular
# expressions that we search for \n (newline) but write \r (return) when
# we mess with the file).
vim -c "%s/\n\n/pLaCeHoLdErStRiNg/g" -c "wq" $f.txt
vim -c "%s/\n/ /g" -c "wq" $f.txt
vim -c "%s/pLaCeHoLdErStRiNg/\r/g" -c "wq" $f.txt
cat $f.txt >> $1
done
echo Try typing libreoffice $1 to see what you have got.
echo Setting paragraph formatting to indented and one and a
echo half space is a good start.

Your mileage may vary. Buyer beware. You get what you pay for. No guarantees implied or given. No warranty as far as possible. (Add here any other escape clauses you can think of.)

Because.

Advertisements

Tags: , , , , , , , ,

About Darren

I'm a scientist by training, based in Australia.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: