Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
meeting:2017-04 [2017/04/11 12:55]
myles [Meta]
meeting:2017-04 [2018/01/15 08:25]
scruss [Post meeting observations]
Line 10: Line 10:
  
 ===== Notes ===== ===== Notes =====
 +
 +  * Video! [[https://​www.youtube.com/​watch?​v=EH_txB_hJWw|A Bit More Than Mostly Searchable: Scanned Paper You Can Find with Stewart Russell - YouTube]]
 +  * Slide deck: [[https://​scruss.com/​talks/​02017/​gtalug201704-ABitMoreThanMostlySearchable.odp|ABitMoreThanMostlySearchable.odp]]
 +
 +  * Good-enough-for-me //scan → OCR → PDF// script: [[https://​scruss.com/​talks/​02017/​dwim-ocr.sh|dwim-ocr.sh]]. Uses ghostscript,​ tesseract, pdfbeads (Ruby) and (Gnu) parallel. The original idea came from an outline script on the [[https://​forum.diybookscanner.org/​index.php?​sid=0be98f164e262176b6689f9892aa5a64|DIY Book Scanner]] forum circa 2010.
 +  * sample multi-page duplex document: https://​scruss.com/​talks/​02017/​EPSON012.PDF
 +
 +==== Moar Notes ====
 +
 +  * [[http://​scantailor.org/​|Scan Tailor]] does a decent job of slicing up scanned books.
 +
 +  * [[http://​gscan2pdf.sourceforge.net/​|gscan2pdf]],​ if you like graphical things, //​probably//​ does what you want with scanned images …
 +
 +  * alternative to [[https://​en.wikipedia.org/​wiki/​Comic_book_archive|cbz]] format that's readable as a PDF yet retains JPEG files intact: collate with [[https://​gitlab.mister-muffin.de/​josch/​img2pdf|img2pdf]],​ burst apart with ''​pdfimages -j …''​ from [[https://​poppler.freedesktop.org/​|Poppler]].
 +
 +  * PDF specs and refs:
 +    * [[https://​www.adobe.com/​content/​dam/​Adobe/​en/​devnet/​acrobat/​pdfs/​PDF32000_2008.pdf|PDF 32000-1:​2008]] standard and accompanying [[https://​www.adobe.com/​pdf/​pdfs/​ISO32000-1PublicPatentLicense.pdf|Public Patent License]].
 +    * PDF/A archival format: [[https://​ghostscript.com/​doc/​current/​Ps2pdf.htm#​PDFA|Creating a PDF/A document from PostScript in ghostscript]],​ [[https://​stackoverflow.com/​questions/​1659147/​how-to-use-ghostscript-to-convert-pdf-to-pdf-a-or-pdf-x/​9343820#​9343820|converting a PDF to PDF/A]]: ''​gs -dPDFA -dBATCH -dNOPAUSE -dUseCIEColor -sProcessColorModel=DeviceCMYK -sDEVICE=pdfwrite -sPDFACompatibilityPolicy=1 -sOutputFile=output_filename.pdf input_filename.pdf''​. If you need to work out if the document is colour or not, ghostscript'​s [[https://​ghostscript.com/​doc/​current/​Devices.htm|inkcov]] device can tell you if colour was used at all.
 +    * Cryptographic signing: Rough process here, using an X.509 certificate issued by ARRL: [[http://​scruss.com/​blog/​2011/​10/​09/​creating-secure-digital-qsl-cards-with-your-lotw-certificate/​|Creating secure digital QSL cards with your LoTW certificate]]. I wonder if you could use P12 certs issued by [[https://​letsencrypt.org/​|Let’s Encrypt]]?
 +
 +  * other OCR software: [[https://​launchpad.net/​cuneiform-linux|Cuneiform]] used to be a contender and may still be useful. Not so sure about Gnu ocrad, gocr/jocr, … 
 +
 +  * tesseract:
 +    * ''​tesseract -psm 0 …''​ will output orientation and language heuristics, useful if your scanner spits out duplex pages flipped. It's a bit slow, but faster than manually digging through a PDF object stream looking at the individual convolution matrices.
 +    * training tesseract is definitely worth it if you have a limited domain, say line-printed numerals. (You might have less success with the completely illegible old German [[https://​en.wikipedia.org/​wiki/​S%C3%BCtterlin|Sütterlin]] cursive, though.)
 +
 +  * JPEG 2000: The least painful way I found of making JPEG 2000 files from colour images was with ImageMagick/​GraphicMagick'​s **convert** command: ''​convert -define '​jp2:​rate=0.008'​ in.png out.jp2''​. More (possibly dated) ways of mucking about with JP2s are here: [[http://​scruss.com/​blog/​2014/​06/​16/​jpeg-2000-on-ubuntu-without-anyone-getting-stabbed/​|JPEG 2000 on Ubuntu — without anyone getting stabbed]]
 +
 +==== Post meeting observations ====
 +
 +  * Chris B. suggested [[https://​github.com/​oniony/​TMSU|oniony/​TMSU:​ TMSU lets you tags your files and then access them through a nifty virtual filesystem from any other application.]]
 +  * [[http://​weasyprint.org/​|WeasyPrint]] is the rather clever Python-based web→PDF printer application I mentioned
 +  * If you're running a Gnome-based system and the [[https://​wiki.gnome.org/​Projects/​Tracker/​|Tracker]] desktop search engine is running amuck, it's probably getting stuck indexing a particular file. I posted a workaround here: [[https://​askubuntu.com/​questions/​914581/​no-progress-updates-from-gnome-tracker/​914602#​914602|no progress updates from gnome tracker - Ask Ubuntu]]
 +  * [[https://​code-industry.net/​free-pdf-editor/​|Master PDF Editor for Linux]] — though the website looks dodgy af, this closed-source program seems to be quite a capable PDF editor.
  
 ===== Meta ===== ===== Meta =====
  
-  * **Dinner**: ​+  * **Dinner**: ​Doner Kebab House, 391 Yonge Street
   * **Attendance**:​ 15   * **Attendance**:​ 15
   * [[operations:​meeting:​2017-04|Ops Notes]]   * [[operations:​meeting:​2017-04|Ops Notes]]
  • meeting/2017-04.txt
  • Last modified: 3 years ago
  • by scruss