4 March 2017: A session with Tesseract: Difference between revisions

Latest revision as of 17:54, 5 March 2017

Experiments[edit | edit source]

installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner
Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce.
We made three attempts-

1. Original uncropped page image (img00001.jpg)	Tesseract produced total gibberish (see out0.txt)	”H 3L7“ V U! “it“ ]\.13311\ XVIII JMHII ‘I’HLMH ‘~I14\|!H'\ .)111 JA) \),)llU[ILIU} -.1:ni J\[1 annlmny pm‘ \qnwl \llll ”WINK” '~JEHH[<V.) Jill 1mm ‘ulfl 116M" '\U\‘EIHHPIV1)‘. \.[\Il1‘.\\[l\§ ll“!,[ [[JV) ”:1“ \|‘\ ‘ llt‘ltl .)\.>I[l UN \[ll\!IllTB-J\) )lllﬁw-‘HL'UH‘IHI l!'!ll\‘1lll V{1I\\ \lllllﬂ ) tlv'lll I'MHV PUP SQUIV {‘(I 'JUERH H'IJ’I 1H \y.)‘IlI\I_ ) )\TI\ 1 'I'IHHXH’I ll‘l )111:113)UP111’ ‘HHIHJJHI [«»1111~,».i,1mu \EH'xll‘awII‘ll’1H1\|JI’)~I11’E«II“.1[\I q 11’ [MW EIIxUIL‘ﬁIH [I my \|"]‘711U‘:)’1'(l\-’l up 3n mwitl [myn‘hn n‘.\ r; way! :1: m1;»";‘:.' "'l‘C H {r w.’:[ (seyuown (up; uem a): sawmoa 51 mqemnoaup mm , 3L '1}n'7/n;[ [mu ”ﬁlmy-'11“ “W" lxrx v‘. 1‘ “ ”MN "'y‘g' .‘xn \' )(1 uliuval Hmpx ﬁqvlvmw.» ,xrx‘ 2'1me 11 ‘Illl Ml ILHVHIIH! .HI I" H“ KNIT; “1'33 'I>?Il‘.\ '» H»!II!'T\IH -\'\IHIV‘\ J" ‘ ‘IHIIH‘ 1“ :UI HI ;{\|[ ZUI'IIIW.) In; """‘!“'I!E’ M‘WI'I V * Mt] "Y”“IE‘W J1“ 11} “up.” "’1 111:.va valr“;1v‘lv'1 ~ w H 1!“ JH \':)([ A711 VHM'H WWI“ \Unni ,IIZVllﬂi [w mu] qml I/uw .1 m; 5‘ 'Iruv'1.:{;11\| H 5’ III (In .YIHN minim-II .llHlx pm! ‘lvmttlnm! m: I]! {7'} l m, mulr- illvl‘ﬂn» Hwy] “(I HW'M l{ .1 III IVII ',[ ZI WWI" “‘"(I 14111.: 1,1 Jillllll h) J’fnl .1!“ I" VlHlmH -‘l[1 HULL"! H1 vl'mmlm ,H'] ,{luu [HI .n[1 .hEl'l \lll‘IlI ‘mxunmi ”11"“ ‘ ['[-/ r/nm‘l‘ 'IJIJHH'H) ll! [Hill ’ .10” [)0] [WIJVH .11)“, [U [[lUHJI U I‘VXEV‘r—J? VII-Nil WHEIH‘JI <{l SHUHJIIN 'I MN“ )'I I. (IN S'I\ INC-I I.\'[\' \‘1 1A\ )[YddV 1W ‘ u‘ “*2“ V . ﬂ. r-—--——~.. m‘
2. Cropped and rotated image	Produced mix of words and gibberish (out1.txt)	Al‘l’AILYI‘lfS, MA'I [CRIAIS AND 'I'IitllNlﬂAL METHODS by lrguling llll'lll zxguimt u lt‘llﬂlll ()f glass or metal rm] f} 0r 7 mm in (lléllllt‘H'l. (l‘iiglur‘ 1']. 1],. \\'l1mi pmu'mq plan's misr [1“, lirl uuly llu‘ enough In permit the lllrlllll] 0] [hr tulwur lmtllv tux-1mm l’mu' zilmut l2 13 ml in (-ach plulc. Dry plan's slightly HIM'H Figynr [1.2 in an imulmtor. and \tmr mmlium \i(lt' up in u rvlligriulm‘. N ‘ IUun".1/l;/I j Ilg'm' Ii Twill/lg (In/Hm: .lli «1m. ‘ltf/Hfun} n/ I’M/Mg" [3'01" K ‘v. lmu hm ml; (‘ullmn- mwliu. Inulirulzuly mmlia likv D(l;\ 0r \Vilwu uml lllgm, mu) Hui (nu wlll(’l'£ll)l§' (ind \lmultl lJf' erml in the : lnllmxiu}: \uxyi I’m-pun: Mllnl u-ulnld (lilluiuus. loi‘ (‘xzimplny 10" to 107 of lulnum u]~ \‘guiuus mgzmisms \x'lm’h \\ill gm“ un 01‘ I)? lIll)ll)ltf’d l))' [111- uuwliuuL l'ru' (xumplw, \xlu-n Imtiug I)(I;\ ust- Sthmm', ﬁlly/Mm Stu/Ilium)Mm. sm'x-i'ul ()[lH'I \leixmnvlluv and [Z‘M'IIAY/[II‘ L'se , 10‘5 , 2 COlOnleS 10' No COlOnleS 10“ Uncountable 19 colonies (more than 200 colonies) l‘ig‘un’ Iii. illiin mill Alum (mm! zit lam luui' \\rll-(lriwl plan's nl‘tlits {mt medium {m (‘ﬂ('l1 nrganism zuinl at l(‘£l\‘[ tun plan‘s (‘;1(‘ll(\lk;ll\'11\)\\'ll Sulixllu’tm‘y mumul medium, and hill gmu‘ml mmlium, ugh .\Iun'(7w»11l\'(‘y or erH) agar. Do Klilt's and .\li<i';1 (ll‘nl) (mums \\lIll lllt‘ sx‘rial (lilulinus nlilln' organix'ms on Ilu'sc plzm-s w (11:11 (‘11(‘ll plutv ix ll\\‘(l fur wvrml (lilutiunx‘. ‘Figure 14.3} 'SH' p. llll.) Count Ilu‘ (nlnuim, mlml‘m' the rv‘xulm and rmupan' the per- furmam‘cs‘ of. tho \m‘ium media. 'I'livsv may suggt‘st that in a new llh
3. Cropped, rotated, desaturated and contrasted image.	Produced mostly correct text (out2.txt)	APPARATUS, MATERIALS AND TECHNICAL METHODS by leaning them against a length of glass or metal rod 6 or 7 mm in diameter. (Figure 14.1). - When pouring plates raise the lid only far enough to permit the mouth of the tube or bottle to enter. Pour about 12—15 ml in each plate. Dry plates slightly open (Figure 14.2) in an incubator, and store medium side up in a refrigerator. $ Figure 14.2. Drying a plate Testing Culture Media. ‘Efficienga of Plating’ (EOP) New batches of culture media, particularly media like DCA or Wilson and Blair, may vary considerably and should be tested in the following way. Prepare serial tenfold dilutions, for example, 10'2 to 10‘7 of cultures of various organisms which will grow on or be inhibited by the medium. For example, when testing DCA use Sh.:onnei, Syphi, SJyphimurium, several other salmonellae and Esch.wli. Use 10-5 2 colonies 19 colonies (more than 200 colonies) Figure 14.3. Mile: and Mimi am: at least four well-dried plates of the test medium for each organism and at least two plates each of a known satisfactory control medium, and of a general medium, e.g. MacConkey or Lemco agar. Do Miles and Misra drop counts with the serial dilutions of the organisms on these plates so that each plate is used for several dilutions. (Figure 14.3) (See p. 180.) . Count the colonies, tabulate the results and compare the per- formances of the various media. These may suggest that in a new L116‘

The command-line incantation used was:

Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng
Tesseract Open Source OCR Engine v3.05.00 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

real	0m15.946s
user	0m14.255s
sys	0m0.340s

Lessons from this experiment:
1. We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well
2. The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure.
3. Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc.
4. Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary.
5. Trent has some app UI ideas about how to improve the correction workflow.
6. I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib.

The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it.

4 March 2017: A session with Tesseract: Difference between revisions

Latest revision as of 17:54, 5 March 2017

Experiments[edit | edit source]

Navigation menu

Search