4 March 2017: A session with Tesseract: Difference between revisions
Jump to navigation
Jump to search
(Created page with "=== Experiments === * installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner * Took a book page image from the scanner (using scan.py, which stil...") |
(add CLI and output) |
||
| (2 intermediate revisions by the same user not shown) | |||
| Line 4: | Line 4: | ||
* Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce. | * Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce. | ||
* We made three attempts- | * We made three attempts- | ||
{| class="wikitable" style="text-align: left;" | |||
[[File: | |- style="vertical-align: top;" | ||
| [[File:Img00001 25pct.jpg|100px]] <br /> 1. Original uncropped page image (img00001.jpg) | |||
| Tesseract produced total gibberish (see out0.txt) | |||
| <blockquote> | |||
”H | |||
3L7“ V U! “it“ ]\.13311\ XVIII JMHII ‘I’HLMH ‘~I14|!H'\ .)111 JA) \),)llU[ILIU} | |||
-.1:ni J\[1 annlmny pm‘ \qnwl \llll ”WINK” '~JEHH[<V.) Jill 1mm | |||
:‘ulfl 116M" | |||
'\U\‘EIHHPIV1)‘. \.[\Il1‘.\*\[l\§ ll“!,[ [[JV) ”:1“ |‘\ ‘ llt‘ltl .)\.>I[l | |||
UN \[ll\!IllTB-J\) )lllfiw-‘HL'UH‘IHI l!'!ll\‘1lll V{1I\\ \lllllfl ) tlv'lll I'MHV PUP | |||
SQUIV {‘(I 'JUERH H'IJ’I 1H \y.)‘IlI\I_ ) )\TI\ 1 'I'IHHXH’I ll‘l )111:113)UP111’ | |||
‘HHIHJJHI [«»1111~,».i,1mu \EH'xll‘awII‘ll’1H1|JI’)~I11’E«II*“.1[\I q 11’ [MW | |||
EIIxUIL‘fiIH [I my |"]‘711U‘:)’1'(l\-’l up 3n mwitl [myn‘hn n‘.\ r; way! :1: | |||
m1;»";‘:.' "'l‘C H {r w.’:[ | |||
(seyuown (up; uem a): | |||
sawmoa 51 mqemnoaup | |||
mm , 3L | |||
'1}n'7/n;[ [mu ”filmy-'11“ “W" lxrx v‘. 1‘ “ | |||
”MN "'y‘g' .‘xn \' )(1 uliuval Hmpx fiqvlvmw.» ,xrx‘ 2'1me 11 ‘Illl Ml | |||
ILHVHIIH! .HI I" H“ KNIT; “1'33 'I>?Il‘.\ '» H»!II!'T\IH -\'\IHIV‘\ J" ‘ ‘IHIIH‘ | |||
1“ :UI HI ;{|[ ZUI'IIIW.) In; """‘!“'I!E’ M‘WI'I V * Mt] | |||
"Y”“IE‘W | |||
J1“ 11} “up.” "’1 111:.va valr“;1v‘lv'1 ~ w H 1!“ | |||
JH \':)([ A711 VHM'H WWI“ \Unni ,IIZVllfli [w mu] qml | |||
I/uw .1 m; 5‘ | |||
'Iruv'1.:{;11| H 5’ III (In .YIHN minim-II .llHlx | |||
pm! ‘lvmttlnm! m: I]! {7'} l m, mulr- illvl‘fln» Hwy] “(I HW'M | |||
l{ .1 III IVII ',[ ZI WWI" “‘"(I 14111.: 1,1 Jillllll h) J’fnl .1!“ I" VlHlmH | |||
-‘l[1 HULL"! H1 vl'mmlm ,H'] ,{luu [HI .n[1 .hEl'l \lll‘IlI ‘mxunmi ”11"“ | |||
‘ ['[-/ r/nm‘l‘ 'IJIJHH'H) | |||
ll! [Hill ’ .10” [)0] [WIJVH .11)“, [U [[lUHJI U I‘VXEV‘r—J? VII-Nil WHEIH‘JI <{l | |||
SHUHJIIN 'I MN“ )'I I. (IN S'I\ INC-I I.\'[\' \‘1 1A\ )[YddV | |||
1W | |||
‘ u‘ | |||
“*2“ V . fl. | |||
r-—--——~.. | |||
m‘ | |||
</blockquote> | |||
|- style="vertical-align: top;" | |||
| [[File:Img00001-crop-25pct.jpg|100px]] <br /> 2. Cropped and rotated image | |||
| Produced mix of words and gibberish (out1.txt) | |||
| <blockquote> | |||
Al‘l’AILYI‘lfS, MA'I [CRIAIS AND 'I'IitllNlflAL METHODS | |||
by lrguling llll'lll zxguimt u lt‘llfllll ()f glass or metal rm] f} 0r 7 mm in | |||
(lléllllt‘H'l. (l‘iiglur‘ 1']. 1],. | |||
\\'l1mi pmu'mq plan's misr [1“, lirl uuly llu‘ enough In permit the | |||
lllrlllll] 0] [hr tulwur lmtllv tux-1mm l’mu' zilmut l2 13 ml in (-ach | |||
plulc. Dry plan's slightly HIM'H Figynr [1.2 in an imulmtor. and | |||
\tmr mmlium \i(lt' up in u rvlligriulm‘. | |||
:N ‘ | |||
IUun".1/l;/I j | |||
Ilg'm' Ii | |||
Twill/lg (In/Hm: .lli «1m. ‘ltf/Hfun} n/ I’M/Mg" [3'01" | |||
K ‘v. lmu hm ml; (‘ullmn- mwliu. Inulirulzuly mmlia likv D(l;\ 0r | |||
\Vilwu uml lllgm, mu) Hui (nu wlll(’l'£ll)l§' (ind \lmultl lJf' erml in the : | |||
lnllmxiu}: \uxyi | |||
I’m-pun: Mllnl u-ulnld (lilluiuus. loi‘ (‘xzimplny 10" to 107 of | |||
lulnum u]~ \‘guiuus mgzmisms \x'lm’h \\ill gm“ un 01‘ I)? lIll)ll)ltf’d | |||
l))' [111- uuwliuuL l'ru' (*xumplw, \xlu-n Imtiug I)(I;\ ust- Sthmm', | |||
filly/Mm Stu/Ilium)Mm. sm'x-i'ul ()[lH'I \leixmnvlluv and [Z‘M'IIAY/[II‘ L'se , | |||
10‘5 , | |||
2 COlOnleS | |||
10' | |||
No COlOnleS | |||
10“ | |||
Uncountable 19 colonies | |||
(more than 200 colonies) | |||
l‘ig‘un’ Iii. illiin mill Alum (mm! | |||
zit lam luui' \\rll-(lriwl plan's nl‘tlits {mt medium {m (‘fl('l1 nrganism | |||
zuinl at l(‘£l\‘[ tun plan‘s (‘;1(‘ll(\lk;ll\'11\)\\'ll Sulixllu’tm‘y mumul medium, | |||
and hill gmu‘ml mmlium, ugh .\Iun'(7w»11l\'(‘y or erH) agar. Do Klilt's | |||
and .\li<i';1 (ll‘nl) (mums \\lIll lllt‘ sx‘rial (lilulinus nlilln' organix'ms on | |||
Ilu'sc plzm-s w (11:11 (‘11(‘ll plutv ix ll\\‘(l fur wvrml (lilutiunx‘. ‘Figure | |||
14.3} 'SH' p. llll.) | |||
Count Ilu‘ (*nlnuim, mlml‘m' the rv‘xulm and rmupan' the per- | |||
furmam‘cs‘ of. tho \m‘ium media. 'I'livsv may suggt‘st that in a new | |||
llh | |||
</blockquote> | |||
|- style="vertical-align: top;" | |||
| [[File:Img00001-crop-unsat-25pct.jpg|100px]] <br /> 3. Cropped, rotated, desaturated and contrasted image. | |||
| Produced mostly correct text (out2.txt) | |||
| <blockquote> | |||
APPARATUS, MATERIALS AND TECHNICAL METHODS | |||
by leaning them against a length of glass or metal rod 6 or 7 mm in | |||
diameter. (Figure 14.1). - | |||
When pouring plates raise the lid only far enough to permit the | |||
mouth of the tube or bottle to enter. Pour about 12—15 ml in each | |||
plate. Dry plates slightly open (Figure 14.2) in an incubator, and | |||
store medium side up in a refrigerator. | |||
$ | |||
Figure 14.2. Drying a plate | |||
Testing Culture Media. ‘Efficienga of Plating’ (EOP) | |||
New batches of culture media, particularly media like DCA or | |||
Wilson and Blair, may vary considerably and should be tested in the | |||
following way. | |||
Prepare serial tenfold dilutions, for example, 10'2 to 10‘7 of | |||
cultures of various organisms which will grow on or be inhibited | |||
by the medium. For example, when testing DCA use Sh.:onnei, | |||
Syphi, SJyphimurium, several other salmonellae and Esch.wli. Use | |||
10-5 | |||
2 colonies | |||
19 colonies | |||
(more than 200 colonies) | |||
Figure 14.3. Mile: and Mimi am: | |||
at least four well-dried plates of the test medium for each organism | |||
and at least two plates each of a known satisfactory control medium, | |||
and of a general medium, e.g. MacConkey or Lemco agar. Do Miles | |||
and Misra drop counts with the serial dilutions of the organisms on | |||
these plates so that each plate is used for several dilutions. (Figure | |||
14.3) (See p. 180.) . | |||
Count the colonies, tabulate the results and compare the per- | |||
formances of the various media. These may suggest that in a new | |||
L116‘ | |||
</blockquote> | |||
|} | |||
* The command-line incantation used was: | |||
<pre> | |||
Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng | |||
Tesseract Open Source OCR Engine v3.05.00 with Leptonica | |||
Warning. Invalid resolution 0 dpi. Using 70 instead. | |||
real 0m15.946s | |||
user 0m14.255s | |||
sys 0m0.340s | |||
</pre> | |||
* Lessons from this experiment: | |||
*# We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well | |||
*# The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure. | |||
*# Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc. | |||
*# Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary. | |||
*# Trent has some app UI ideas about how to improve the correction workflow. | |||
*# I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib. | |||
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it. | |||
Latest revision as of 17:54, 5 March 2017
Experiments[edit | edit source]
- installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner
- Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce.
- We made three attempts-
- The command-line incantation used was:
Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng Tesseract Open Source OCR Engine v3.05.00 with Leptonica Warning. Invalid resolution 0 dpi. Using 70 instead. real 0m15.946s user 0m14.255s sys 0m0.340s
- Lessons from this experiment:
- We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well
- The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure.
- Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc.
- Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary.
- Trent has some app UI ideas about how to improve the correction workflow.
- I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib.
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it.