11 April 2017 - A Bit More Than Mostly Searchable: Scanned Paper You Can Find with Stewart Russell
Pre-Meeting Announcements
- Upcoming Meetings:
- 9 May - Lessons learnt as a maintainer with Dhaval Giani
- 13 June - Intelligent Availability; The evolution of High Availability with Madison Kelly
Notes
- Slide deck: ABitMoreThanMostlySearchable.odp
- Good-enough-for-me scan → OCR → PDF script: dwim-ocr.sh. Uses ghostscript, tesseract, pdfbeads (Ruby) and (Gnu) parallel. The original idea came from an outline script on the DIY Book Scanner forum circa 2010.
- sample multi-page duplex document: https://scruss.com/talks/02017/EPSON012.PDF
Moar Notes
- Scan Tailor does a very decent job of slicing up scanned books.
- gscan2pdf, if you like graphical things, probably does what you want with scanned images …
- PDF specs and refs:
- PDF 32000-1:2008 standard and accompanying Public Patent License.
- PDF/A archival format: Creating a PDF/A document from PostScript in ghostscript, converting a PDF to PDF/A:
gs -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output_filename.pdf input_filename.pdf
. If you need to work out if the document is colour or not, ghostscript's inkcov device can tell you if colour was used at all. - Cryptographic signing: Rough process here, using an X.509 certificate issued by ARRL: Creating secure digital QSL cards with your LoTW certificate. I wonder if you could use P12 certs issued by Let’s Encrypt?
- other OCR software: Cuneiform used to be a contender and may still be useful. Not so sure about Gnu ocrad, gocr/jocr, …
- tesseract:
tesseract -psm 0 …
will output orientation and language heuristics, useful if your scanner spits out duplex pages flipped. It's a bit slow, but faster than manually digging through a PDF object stream looking at the individual convolution matrices.- training tesseract is definitely worth it if you have a limited domain, say line-printed numerals. (You might have less success with the completely illegible old German Sütterlin cursive, though.)
- JPEG 2000: The least painful way I found of making JPEG 2000 files from colour images was with ImageMagick/GraphicMagick's convert command:
convert -define 'jp2:rate=0.008' in.png out.jp2
. More (possibly dated) ways of mucking about with JP2s are here: JPEG 2000 on Ubuntu — without anyone getting stabbed
Post meeting observations
- WeasyPrint is the rather clever Python-based web→PDF printer application I mentioned
- If you're running a Gnome-based system and the Tracker desktop search engine is running amuck, it's probably getting stuck indexing a particular file. I posted a workaround here: no progress updates from gnome tracker - Ask Ubuntu
- Master PDF Editor for Linux — though the website looks dodgy af, this closed-source program seems to be quite a capable PDF editor.
Meta
- Dinner: Doner Kebab House, 391 Yonge Street
- Attendance: 15