OCR on Linux, and my old Olivetti Dora

I am exploring Linux OCR options. The first step was to type up some text on my old Olivetti Dora, using all the available characters plus a few typewriter tricks (exclamation mark as superimposed stop and single quote, for example). Then I had to scan the sheet. I used xsane and output the text into a high resolution (600dpi) greyscale tiff (dora_page.tiff).

Here is an image (downsampled for the web) of the text:

Text from my Olivetti Dora, scanned in greyscale.

Text from my Olivetti Dora, scanned in greyscale. The image looks blurrier and more uneven in the murky background than the original.  Paper was used on one side already and there is some show-through; a pretty tough test!

Then I ran the text through three of the most widely available Linux-based solutions, the open source tesseract and gocr, and Cuneiform, which Debian considers as ‘non-free’.

In all cases but one (the last, below) I let the program use its default behaviour.

(A) Tesseract

Command line:

tesseract dora_page.tiff dora_page.tesseract

Result: Below is the output. Above each line is an evaluation of the OCR, where:

y = correct (‘yes’)
n = wrong (‘no’)
c = close
e = wrong because of typing error (faint or overlapping characters, for example)

I would note that tesseract gave some characters — single quotes (apostrophe) — as a series of values outside the range of ASCII values,  rather than an ASCII quote. But they do correspond to quote characters — what it is doing is using context to give a quote character rather than an apostrophe.  Note the difference between the characters around ‘eight’ in the text below, and around the lower case ‘l’ a couple of lines further down, which erroneously has a double quote on one side and a dagger on the other. On the same line it has rather mysteriously given ‘3’ for a full stop beneath a dagger, as used to generate an exclamation mark.

yynynyn
+-— + X
yyyyyyyyyyyyyyyyyyyyy
What is the typeface?
yyyyyyyyyyyyyyyyyyyyyyyy
Olivetti Dora qwertyuiop
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
The quick brown fox jumped over the lazy dog. i ii iii iv v
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyynyyyynyyyyyyyyyyyyyycyny
1234567890 ... , semi; full: pound a Who£? 2/3 4+3=7 6+3:2
yyynyynynyyynynynynyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
7-5:2 %+%=l % i % % (no) "yes" 56% d.goossens@adfa.edu.au
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
$2.34 Underline Days & Nights 8 or ‘eight’. (Parens).
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijkl, mnopqrst, uvwxyz
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Well, I think that's all the characters. Make an exclamation
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyenyyyyyyyyyyyyyyyyyyyyyey
mark by holding down space bar and typ hg apostrophe and a stpp
yyyyyyyyyynyyyyyyyyyyyyyyyyyyyynnnyyyyyyyyyyyyyyyyyyyyyyyyyyy
like this 3 and use lower case “l' for unity. Colon on top of
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyenyyyyynyyyyyyy
hyphen.gives us a divide sign. Equal and slash fi=does £ give a
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyynnnnnncnnnnnnnnnnnnnnn
not equals. Can slash a zero. ¢_: % X % # is M Z A d
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyynyyyyyyyyyyyyyy
It does have a zero (0) and a capital oh (0) and they are
yyyyyyyyyyyyyyynynynyyyyyyyyyyyyyyynyynyyyyyyyyyyyyyyyyyyyy
pretty similar r-.00 (superimposed C)—~ identical, I'd say.
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyynyyyyyyynyyyyyyyyyyyyy
No greater than / less than. No caret. X x and ~ gives a sort
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyynyyyyyyyyyyyyyyyyyyyyyyyy
of asterisk. No vertical bar beyond 1. No hash. No curly brace
neyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
03 square bracks. No backslash.
eyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyeeeeeyyyyyyyyyyyyyyyyyyyye
E bit sticky. What's the right oil?frfi9uihjghbnvhggfcvcbfgrt fi
yyeeyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
7 fifl are you my mummy uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
enyeyyyyyyyyeeeeeeeeeeeeeeeeeeeeeyyyyyyyyyyyeeyyyyyyyyyyyeyyyy
Jagaggggggdgaggearkgkfimaxnmyamxxyxxxxxxxxxxxamxxxxxxxxxxxmxxxx

Summary: Very good for regular, sensible, context-rich text. Poor for some of the more unusual characters (for example fractions). Poor for exclamation mark. Context causes an isolated ‘l’ and ‘O’ (little el, big oh) to be read as ‘1’ and ‘0’ (one and zero) respectively; not unreasonable. Some issues with apostrophe/single quote mark. On the other hand, very good (using the greyscale image) at working out faded letters. The last few lines were typed by mad children and no wonder they proved difficult. The ’03’ at the beginning of one line is a result of superimposed characters in the second column being read as a 3 instead of an ‘r’, and then context, I suspect, lead the ‘o’ to being classified as zero.

Conclusion: I’m no expert but this looks excellent to me. Some of the problems come from the typeface, which has no specific exclamation mark or digit for unity, and undifferentiated zero and capital oh. Exclamation mark might work better if it had context (i.e., was at the end of a word); can’t expect to test all possible combinations.

(B) gocr

Command line(s):

convert dora_page.tiff dora_page.pnm

gocr -i dora_page.pnm -o dora_page.gocr -f ASCII

Result: Utter nonsense. Perhaps there are some flags I need to set?

_ 7_5_2_ (?)_+___ll _*t_____ M__ht M_____ __ (n0) ye5_ 56_ d g0o
se_s ad_a d u _

__ + X
_h_t i5 the t,y_p. e__._ce?
U1(0xee)vet_i Do_a., q_er_yu(0xec)_p
The quick '_r__J._n _0x J_,___:m_.__ __v_ c__ve_ t__e 1a,2y _0g, i ii
iii iv v
1_3_567_90 _ _ _ _ 5em__ij _ul__ p_und _ _h.o_? 2/3 _+3_7 6_3___
1 1 __ _J. ___ _ t_ __ _ _ _
_ 2 K_- ____ _ i_ _ O _ _' _ _ _____i U_ _'
__.34 Underl(0xec)-ne Da_.y_ & _,_ig_;__tn! _ __ _eigh_'. (Pa__ens).
_BCD___-''__J___JGP_'____STUV n_YZ a'_Cde__hiJ1cl t _0(0xdf)q__t t
UV_Xy2
W?e1_ _ th_inh thatl_ ___._.,_.___ _he ch.;_M..J._a..c_e.__. __a.ice
__.n o__c___ati,Jn
_a_b by h-0l_ing dc!_n sp__._,ce 1:!a,r __.n__ ty_w, _ _?_e_
a,_0s_r_p_e __nd a st__ p,
li,_e t__.is _ 8nd use l____v_!e_ c__e ''l' _'o_ _ni_y. C,__l_,n,
_,_,n. t___ __
hy__hen _iv__ us a divide sign. _qu&1 (0xed),._nd __l5_sh ._- daecu_ t
g_:__ve a,
n0_ equa_s_ Can 5_a5,h. a, 2e_o, (0xd8) _t M '2 x _2 _ _ _ __' gf _
T_ does have a 2e_o (0) (?)a.nd a ca,pi_a,_ 0h (O) ___d they __e
p_e__y _i_i_a,_ -.- 0_ (supe_(0xee)_po_ed O -- (0xee)dentical, __,'d
say.
No g_ea.,_er _han / less _.han, N0 ca_e__ _ x a,nd - g_ves a, sort
o_ as_e_is_, No ve_tica._ ba_ bey_vnd l. _o .has.h, N0 cu_1_ b_a_.__
o_ 5qua_e b_c_,_bs. No b_-_'_c_5_ash_,
_ b(0xee)_ __ticJc_y _ VY.ha_ ' s __e _igh.t _i_?f
.r._9u(0xee)h,jghb.il_hg_+_-n-?,__cvcb_grt _
7 _ (0x0107);_'_'_"_,. _ e _..,_/' _ 'u :_--;_!--:', ,__, " Ti_u. - ,
_r_ y _u u.uuuuu-__" uuuuuuu uuuuuuuuuuuuuuuu _;'4 __ u uuu_
_,___._,_:;,?_.....' ____..=;:____,gg_._n,;___..g_n_.... '._. _____ ___
___________x_.___x,x_._ xxxx_xxxxxx _xxXX. xxxx x_____ _g0 _x_
~

Conclusion: Not for me. Outputting to a more powerful format, like HTML, did not help.

(C) Cuneiform

Command line(s)

sudo apt-get install cuneiform

cuneiform -o dora_page.cuneiform dora_page.tiff

Result: Better than gocr, but not up with tesseract.

The QU3.ck brown f ox pumped ovex' tte 1BKy dog i i3. 3.3.1, 3.v v
1234/6 j8$0 ..., semi; full: pound. g Who'~ 2/3 4+3=7 6+3=2
7 -)=2 ~+2 — 1 —; 4, g g (no) ' yes $6$ ci~ goossensOadf B
BDU BU.
$2.gg Underline IIBys 8c Nights 8 or 'eight'. (Paxens).
A3C33ZPt"HIJKIMOPQRHTUVWXYZ abcdefghijkl, mnopcLrst, uvwxyz
Well, I think that'8 all the characters. Make Bn exclamation
mark by hold.ing do@In space bar Bnd typ'ng apostrophe and a stpp
like this ! and use lover case "1' for unity. Coj.on on top of
hyphen g3 Ves .US a 43v3.de 83.gn ~ Equal and slash. $ does g give B
not eqUB18 Can slash. a Kex'o g + p x g = 4 pf
It does have a Kexo (0) and, a capital oh (0) and they Bxe
pretty similar -- GO (superimpose@ 0 -- identical, I: d. say.
No gxeater than / lese than,. No cax'et. x x an6 — gives a sort
of asterisk. No vertical bax beyond 1. No hash. 5o cuxly brace
oxen square bracks. No backslash.
g bit sticky. %)hat 8 the xight oil.f r89uihjghbnvbggfcvcbfgrt ®<p/re>

Here is the result of

cuneiform --singlecolumn dora_page.tiff -o dora_page.cuneiform.single
Ithat is the typeface'2
Ulivetti 3ora qwertyuiop
The QU3.ck brown fox pumped ovex' tte 1BKy dog i ii 3.ii iv v
1234/6 j8$0 ..., semi; full: pound. g Who'~ 2/3 4+3=7 6+3=2
7 -)=2 ~+2 — 1 —; 4, g g (no) ' yes $6$ ci~ goossensOadf B eDU BU.
$2.gg Underline IIays 8c Nights 8 or 'eight'. (Paxens).
A3C33ZPt"HIJKIMOPQRHTUVWXYZ abcdefghijkl, mnopcLrst, uvwxyz
Well, I think that'8 all the characters. Make an exclamation
mark by hold.ing do@In space bar and typ'ng apostrophe and a stpp
like this ! and use lover case "1' for unity. Coj.on on top of
hyphen g3 Ves .US a 43v3.de 83.gn ~ Equal Bnd slash. $ does g give B
not eqUB18 Can slash. a Kex'o g + p x g = 4 pf
It does have a Kexo (0) and, a capital oh (0) and they Bxe
pretty similar -- GO (superimpose@ 0 -- identical, I: d. say.
No gxeater than / lese than,. No cax'et. x x an6 — gives a sort
of asterisk. No vertical bax beyond 1. No hash. 5o cuxly brace
ox' square bracks. No backslash.
g bit sticky. %)hat 8 the xight oil.f r89uihjghbnvbggfcvcbfgrt ®
f ®f !=tx'e )'ou Tj Fu.'::.lzly UuuuuuuUUuuuuuUUuuuuuUuuUuuuuuuUuuuuulgk
,-F,,'"::::::gggggg<gsggs

Conclusion: There are quite a few options to play with, so perhaps it could do better, but it would have to do a lot better to get near tesseract. Options ‘dotmatrix’ and ‘fax’ made no difference.

 

Overall conclusion:

Olivetti Dora.

Olivetti Dora.

Tesseract is very impressive and for now is my preference.

What I have not done is used any of the more Windows-y methods like scanning to PDF and getting some Adobe product to extract the text.

 

More retrotech.

Advertisements

Tags: , , , , , , , ,

About Darren

I'm a scientist by training, based in Australia.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: