OCR via PDF and Acrobat Professional, and using OneNote
I have looked at some of the open source options for OCR, with a particular focus on text coming off my typewriter, a 1969 Olivetti Dora (though not with the ‘techno pica’ typeface that some of them have). Here I use the same test image as I used on the open source test — a 600dpi scan of some pretty scrappy typing on some pretty scrappy paper — and check out a couple of options on Windows. This is not even pretending to be exhaustive; I am looking at things that are commonly available (though, unlike the Linux options, not free).
I used a 600dpi version of the image here and extracted text using two methods, Adobe Acrobat Pro XI and Microsoft OneNote 2010, since they are bits of software I have access to. For the Acrobat approach I used CutePDF to print the original TIFF image to a PDF file from Windows default viewing application, Photo Viewer. Whenever I had a ‘resolution’ box to click in, I chose the biggest value available. For the OneNote approach, I opened the TIFF in Windows Paint and saved as PNG.
(A) Acrobat Pro XI
(1) Opened the PDF in Acrobat XI Pro.
(2) View –> Tools –> Text recognition
(3) Selected ‘In This File’
(4) Followed the instructions in this tutorial.
(5) ‘Select All’, ‘Copy’ and then pasted into this blog, and here it is:
·ha t i s the typeface? Olivetti Dora~ qwert_yuiop The- quick brown f ox --_ju... m ped over the l·~:iz.Y dog. i ii ·iii. iv v . . . . I" 12 34567896 •• • , semi ; full: pound£ Who£? 2/3 4+3~1 6+ 3=2 7-5=2 ~+~ =l ~~ ! ~ ~- ( no) ' yes" 56% d.goo ssens@adfa. edu a,u $2.34 Underline Days & Nigbt.s 8 or ' eight'. (Parens) . ABCD~FG1fI J KLMNO P RSTUVWXYZ abcdefg.b,i jkl, mnopqrst, uv xyz Well, I .think that's all the c hBrac ters . Make an exclamation - - - mark by holdi!lg down space bar ,and typ 1 .ng aJpostroplie and a s t " p . - like th.is ! and use lovver case 111' f or un ity. C~olo on to .1:- of ·hyphen give s u s a divide sign. Equa.l and slash /= doe s f. give a not equals. Can ·slash a zero. ¢ + ~ · x ~ == l · ,ti. j.. . - It -does have a zero (0) and a capital oh (0) and they are pretty similar ~- 00 (superimposed 0 -- identical, F'd say. No greater than/ less than. No caret. 3l x and - gi v e s a . sort of ~sterisk. No ver tical bar beyond l. Nd hash. No curly brace . o ~- square bra-cks. · No b a.ckslash. '1• ·, ... _. l :· .bi.t s ticky. What's t .he right oil ? f r£9uihjghbnvhgg·fcvcbf grt :it ... ,..;. .. · .. ~- 7 · ((;;7 ;.:1 T f.1 v o 1 J m v m 1 j -;.n ·m v 1 J l J i J 1 J 1] 1 ] l 11 J 1 J 1 l 1 J 1 J 1 J 1 J 1 J i J i J 1 J 1 l 1 J 1 J i ) 1 ] 1 J i J 1 J 1 J i J i 11 J i J l ] 1 J 1 J i J i ] , ) 1l1L1
(6) Unimpressed. Then rather than cut and paste I saved the PDF as an RTF file and opened the RTF in Word. Here is a screen grab:
(7) It does not come close to what tesseract could do, and tesseract did it with a one-line command line command, making it highly scriptable and flexible.
(B) OneNote 2010
(1) Pretty simple. Dragged and dropped the PNG version of the scanned image image onto OneNote.
(2) Right clicked on the image and selected ‘Copy Text from Picture’ from the popup menu.
(3) Went to Micro$oft word an hit ‘Paste’
(4) Was impressed. Here is the screen grab, again from Word:
For what it is worth, here is the OneNote screen, after right-clicking on the image:
OneNote and tesseract made similar, and small, numbers of errors, and both resulted in a extracted text which was quite usable. None of the other options considered here produced results that I would consider usable.
Caveats: I made no effort to optimise the scanned image, or to use particularly clean copy on pristine paper with a dark ribbon — intentionally, since I want an option that is robust. I know there are many more software packages out there I could have tried. I did not take the time to fiddle with any parameters in the OCR programs I did use. I did not explore dedicated OCR options on Windows.
For me, since it it free, scriptable and open source, tesseract wins.