<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.extremist.software/index.php?action=history&amp;feed=atom&amp;title=30_May_2017%3A_Test_a_copy_of_PDFScanner</id>
	<title>30 May 2017: Test a copy of PDFScanner - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.extremist.software/index.php?action=history&amp;feed=atom&amp;title=30_May_2017%3A_Test_a_copy_of_PDFScanner"/>
	<link rel="alternate" type="text/html" href="https://wiki.extremist.software/index.php?title=30_May_2017:_Test_a_copy_of_PDFScanner&amp;action=history"/>
	<updated>2026-04-11T04:17:11Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.39.13</generator>
	<entry>
		<id>https://wiki.extremist.software/index.php?title=30_May_2017:_Test_a_copy_of_PDFScanner&amp;diff=58703&amp;oldid=prev</id>
		<title>Plausible deniability: added link to spammy file hosting service</title>
		<link rel="alternate" type="text/html" href="https://wiki.extremist.software/index.php?title=30_May_2017:_Test_a_copy_of_PDFScanner&amp;diff=58703&amp;oldid=prev"/>
		<updated>2017-05-31T10:30:50Z</updated>

		<summary type="html">&lt;p&gt;added link to spammy file hosting service&lt;/p&gt;
&lt;table style=&quot;background-color: #fff; color: #202122;&quot; data-mw=&quot;interface&quot;&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;col class=&quot;diff-marker&quot; /&gt;
				&lt;col class=&quot;diff-content&quot; /&gt;
				&lt;tr class=&quot;diff-title&quot; lang=&quot;en&quot;&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;← Older revision&lt;/td&gt;
				&lt;td colspan=&quot;2&quot; style=&quot;background-color: #fff; color: #202122; text-align: center;&quot;&gt;Revision as of 03:30, 31 May 2017&lt;/td&gt;
				&lt;/tr&gt;&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot; id=&quot;mw-diff-left-l1&quot;&gt;Line 1:&lt;/td&gt;
&lt;td colspan=&quot;2&quot; class=&quot;diff-lineno&quot;&gt;Line 1:&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Experiments ===&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;=== Experiments ===&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;br/&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td colspan=&quot;2&quot; class=&quot;diff-side-deleted&quot;&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot; data-marker=&quot;+&quot;&gt;&lt;/td&gt;&lt;td style=&quot;color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #a3d3ff; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;&lt;ins style=&quot;font-weight: bold; text-decoration: none;&quot;&gt;* PDF resulting from this testing session is here: [http://s000.tinyupload.com/index.php?file_id=93967682158344247289 pdfscanner_test_MQ.pdf]&lt;/ins&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* We installed PDFScanner onto the newly rebuilt BookScanner Mac Mini&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* We installed PDFScanner onto the newly rebuilt BookScanner Mac Mini&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract&lt;/div&gt;&lt;/td&gt;&lt;td class=&quot;diff-marker&quot;&gt;&lt;/td&gt;&lt;td style=&quot;background-color: #f8f9fa; color: #202122; font-size: 88%; border-style: solid; border-width: 1px 1px 1px 4px; border-radius: 0.33em; border-color: #eaecf0; vertical-align: top; white-space: pre-wrap;&quot;&gt;&lt;div&gt;* Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;</summary>
		<author><name>Plausible deniability</name></author>
	</entry>
	<entry>
		<id>https://wiki.extremist.software/index.php?title=30_May_2017:_Test_a_copy_of_PDFScanner&amp;diff=58702&amp;oldid=prev</id>
		<title>Plausible deniability: Add review page for PDFScanner</title>
		<link rel="alternate" type="text/html" href="https://wiki.extremist.software/index.php?title=30_May_2017:_Test_a_copy_of_PDFScanner&amp;diff=58702&amp;oldid=prev"/>
		<updated>2017-05-31T10:17:09Z</updated>

		<summary type="html">&lt;p&gt;Add review page for PDFScanner&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;=== Experiments ===&lt;br /&gt;
&lt;br /&gt;
* We installed PDFScanner onto the newly rebuilt BookScanner Mac Mini&lt;br /&gt;
* Asked it to convert the images from the [[4_March_2017:_A_session_with_Tesseract|previous]] experiment with Tesseract&lt;br /&gt;
* PDFScanner yielded a .pdf file containing the fourteen pages from that session&lt;br /&gt;
** Positive&lt;br /&gt;
*** It produced indexed, searchable PDF&lt;br /&gt;
*** OCR accuracy was fairly good, though not 100%, recognizing most English words and correctly spelling many scientific terms&lt;br /&gt;
*** It recognizes tables, non-standard indents&lt;br /&gt;
*** Appears to safely ignore images and diagrams, preserving page layout around images&lt;br /&gt;
*** Tables and such were paginated to match original layout in the resulting PDF, lining up exactly with original scanned images&lt;br /&gt;
*** It indexes scientific terminology, and unrecognizable (non-dictionary) terms - within the limits of its OCR abilities&lt;br /&gt;
** Neutral&lt;br /&gt;
*** The resulting PDF pages contain not just text, diagrams and images, but also the original scanner images - adds to file size, but retains the book&amp;#039;s original look&lt;br /&gt;
** Negative&lt;br /&gt;
*** Text is not accurately reproduced&lt;br /&gt;
*** Problems in adding spaces between letters within a single word, and losing spaces between words, losing spaces between words and punctuation&lt;br /&gt;
*** Pages were not automatically oriented for English LRTB&lt;br /&gt;
*** Pages were not automatically straightened&lt;br /&gt;
*** None of the pages was automatically cropped (the reviewer cropped all images beforehand)&lt;br /&gt;
*** PDFScanner is faster than ABBYY FineReader&lt;br /&gt;
* PDFScanner vs FineReader&lt;br /&gt;
** FineReader approaches 100% accuracy&lt;br /&gt;
** PDFScanner appears to reach into the 80-90% accuracy range, but since page structure and background are retained, this would only impact copying text as Unicode/ASCII to another program, and indexing by search engines&lt;br /&gt;
** Both PDFScanner and FineReader grok a page&amp;#039;s structure and reproduces it in the resulting PDF.&lt;br /&gt;
** Both PDFScanner and FineReader includes the background image&lt;br /&gt;
** As with the FineReader test, image is fairly low contrast. and again may be distracting for some. Contrast could be bumped up during capture.&lt;br /&gt;
** PDFScanner is about US$16 and is based on FOSS libraries, including Tesseract.  ABBYY&amp;#039;s library licensing terms are unknown&lt;br /&gt;
** PDFScanner requires manual manipulation of image files, but provides built-in tools for rotation and cropping.&lt;br /&gt;
** FineReader is hands-off, for what it does.  After it receives images, it works without prompt or interrupt.&lt;br /&gt;
** PDFScanner doesn&amp;#039;t provide contrast or saturation filters - these would have to be added to the post-capture pipeline to help increase OCR accuracy (a low contrast scan image adversely affects Tesseract&amp;#039;s results)&lt;br /&gt;
&lt;br /&gt;
Summary-&lt;br /&gt;
* PDFScanner is faster than FineReader.  It looks to be a very good product, not as good as FineReader, but is worthwhile considering the price and results&lt;br /&gt;
* The price for a single station is US$16&lt;br /&gt;
* Orientation/rotation and color management should be handled prior to PDFScanner&lt;br /&gt;
* A user should be able to run a script to capture images from cameras, have them be automatically oriented and their color processed, then output to a &amp;quot;ready-for-PDFScanner&amp;quot; folder&lt;br /&gt;
* IF all goes well, that folder can be drag-and-dropped onto the PDFScanner app icon, and be automatically OCRed&lt;br /&gt;
* User will be required to press Cmd-S at the end, to export as PDF&lt;br /&gt;
* This is a reasonably usable and useful product.   It&amp;#039;s a contender w/r/t ABBYY FineReader.  Either would do well.&lt;/div&gt;</summary>
		<author><name>Plausible deniability</name></author>
	</entry>
</feed>