11 April 2017 - A Bit More Than Mostly Searchable: Scanned Paper You Can Find with Stewart Russell

This is an old revision of the document!

Upcoming Meetings:
- 9 May - Lessons learnt as a maintainer with Dhaval Giani
- 13 June - Intelligent Availability; The evolution of High Availability with Madison Kelly

Good-enough-for-me scan → OCR → PDF script: dwim-ocr.sh. Uses ghostscript, tesseract, pdfbeads (Ruby) and (Gnu) parallel. The original idea came from an outline script on the DIY Book Scanner forum circa 2010.
sample multi-page duplex document: https://scruss.com/talks/02017/EPSON012.PDF

alternative to cbz format that's readable as a PDF yet retains JPEG files intact: collate with img2pdf, burst apart with pdfimages -j … from Poppler.

PDF specs and refs:
- PDF 32000-1:2008 standard and accompanying Public Patent License.
- PDF/A archival format: Creating a PDF/A document from PostScript in ghostscript, converting a PDF to PDF/A: gs -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output_filename.pdf input_filename.pdf. If you need to work out if the document is colour or not, ghostscript's inkcov device can tell you if colour was used at all.
- Cryptographic signing: Rough process here, using an X.509 certificate issued by ARRL: Creating secure digital QSL cards with your LoTW certificate. I wonder if you could use P12 certs issued by Let’s Encrypt?

other OCR software: Cuneiform used to be a contender and may still be useful. Not so sure about Gnu ocrad, gocr/jocr, …

tesseract:
- tesseract -psm 0 … will output orientation and language heuristics, useful if your scanner spits out duplex pages flipped. It's a bit slow, but faster than manually digging through a PDF object stream looking at the individual convolution matrices.
- training tesseract is definitely worth it if you have a limited domain, say line-printed numerals. (You might have less success with the completely illegible old German Sütterlin cursive, though.)

JPEG 2000: The least painful way I found of making JPEG 2000 files from colour images was with ImageMagick/GraphicMagick's convert command: convert -define 'jp2:rate=0.008' in.png out.jp2. More (possibly dated) ways of mucking about with JP2s are here: JPEG 2000 on Ubuntu — without anyone getting stabbed

Pre-Meeting Announcements