4 March 2017: A session with Tesseract: Difference between revisions
Jump to navigation
Jump to search
No edit summary |
No edit summary |
||
| Line 180: | Line 180: | ||
|} | |} | ||
* Lessons from this experiment: | |||
*# We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well | |||
*# The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure. | |||
*# Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc. | |||
*# Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary. | |||
*# Trent has some app UI ideas about how to improve the correction workflow. | |||
*# I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib. | |||
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it. | |||
Revision as of 17:23, 5 March 2017
Experiments
- installed tesseract-ocr via homebrew onto the mac mini attached to the book scanner
- Took a book page image from the scanner (using scan.py, which still works), and ran it through tesseract to see what it would produce.
- We made three attempts-
- Lessons from this experiment:
- We're gonna need to automate image processing in the pipeline, to transform images to high contrast in order to get tesseract functioning reasonably well
- The text was shot at a slight angle, and this may have affected tesseract, or maybe not? Unsure.
- Where tesseract didn't do well on the third image is partly due to math symbols, non-English terminology, binomial nomenclature, tables of figures, etc.
- Net result will be, no matter how clean and perfect the input images, tesseract will encounter things it just can't handle. Manual editing may be necessary.
- Trent has some app UI ideas about how to improve the correction workflow.
- I found an Angular 1.x web app project on Github which accepts image uploads, passes them through tesseract, and returns the processed text. The backend is a node.js express server which invokes tesseract through a npm wrapper lib.
The working files and folders installed are under ~/build. Also installed nvm and node 7. The workstation was backed up before and after working on it.