11 April 2017 - A Bit More Than Mostly Searchable: Scanned Paper You Can Find with Stewart Russell

  • Upcoming Meetings:
    • 9 May - Lessons learnt as a maintainer with Dhaval Giani
    • 13 June - Intelligent Availability; The evolution of High Availability with Madison Kelly

Moar Notes

  • gscan2pdf, if you like graphical things, probably does what you want …
  • alternative to cbz format that's readable as a PDF yet retains JPEG files intact: collate with img2pdf, burst apart with pdfimages -j … from Poppler.
  • other OCR software: Cuneiform used to be a contender and may still be useful. Not so sure about Gnu ocrad, gocr/jocr, …
  • tesseract:
    • tesseract -psm 0 … will output orientation and language heuristics, useful if your scanner spits out duplex pages flipped. It's a bit slow, but faster than manually digging through a PDF object stream looking at the individual convolution matrices.
    • training tesseract is definitely worth it if you have a limited domain, say line-printed numerals. (You might have less success with the completely illegible old German Sütterlin cursive, though.)
  • JPEG 2000: The least painful way I found of making JPEG 2000 files from colour images was with ImageMagick/GraphicMagick's convert command: convert -define 'jp2:rate=0.008' in.png out.jp2. More (possibly dated) ways of mucking about with JP2s are here: JPEG 2000 on Ubuntu — without anyone getting stabbed

Post meeting observations

  • Dinner: Doner Kebab House, 391 Yonge Street
  • Attendance: 15