4 March 2017: A session with Tesseract: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
(add CLI and output) |
||
| Line 180: | Line 180: | ||
|} | |} | ||
* The command-line incantation used was: | |||
<pre> | |||
Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng | |||
Tesseract Open Source OCR Engine v3.05.00 with Leptonica | |||
Warning. Invalid resolution 0 dpi. Using 70 instead. | |||
real 0m15.946s | |||
user 0m14.255s | |||
sys 0m0.340s | |||
</pre> | |||
* Lessons from this experiment: | * Lessons from this experiment: | ||
*# We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well | *# We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well | ||
Latest revision as of 17:54, 5 March 2017
Experiments[edit | edit source]
- installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner
- Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce.
- We made three attempts-
- The command-line incantation used was:
Warrens-MBP:diybookscanner hdpe$ time tesseract ~/build/samples/img00001.jpg out0 -l eng Tesseract Open Source OCR Engine v3.05.00 with Leptonica Warning. Invalid resolution 0 dpi. Using 70 instead. real 0m15.946s user 0m14.255s sys 0m0.340s
- Lessons from this experiment:
- We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well
- The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure.
- Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc.
- Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary.
- Trent has some app UI ideas about how to improve the correction workflow.
- I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib.
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it.