Creating Searchable PDFs

OSCAR stores most incoming documents as PDF's. These are usually generated by scanning or faxing software (see Hylafax) and are plain pdf's that you cannot search. The ability to select text can be added to a pdf with the appropriate use of open source OCR software.


Creating Searchable PDF's from regular image PDF's

A searchable but hidden "text" layer can be added to an scanned or faxed image

Selecting Text in a Searchable PDF

Figure 1: Example of a PDF from a scanned document to which a text layer has been added and selected

Document Version History

The document is copyright by Peter Hutten-Czapski © 2013 under the Creative Commons Attribution-Share Alike 3.0 Unported License

Contents

1. Installing a Script

Installation Instructions

Here we will be using ghost script, cuneiform, hocr2pdf.  Acceptable results are available even in Ubuntu Lucid Linux 10.04 LTS

  • GPL Ghostscript 8.71 (2010) 
  • Cuneiform for Linux 0.7.0 (2010)
  • hocr2pdf version 0.7.4 (2009)

I suggest using the latest stable versions of your preferred image conversion, OCR and PDF creation software and testing settings before putting into production.  If suboptimal use an alternate OCR library such as tesseract v 3.0 or newer.

Open a terminal and type the following to install the set available for your version of Ubuntu

sudo apt-get install cuneiform gs exactimage 

Open a text editor (such as vi, nano, gedit)

and paste the following into it.

#!/bin/bash
# Run OCR on a multi-page PDF file and create a new PDF with
# the extracted text (if any) in hidden layer. 
# Requires cuneiform, hocr2pdf, gs.
# Usage: ./ocrpdf.sh input.pdf output.pdf

set -e

input="$1"
output="$2"

tmpdir="$(mktemp -d)"

# extract images of the pages as tif files 
# note: resolution hard-coded, do not go below fax resolution of 150dpi
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tif" -dNOPAUSE -dBATCH -- "$input"

# OCR each tif image into an anointed html and then convert into PDF
for page in "$tmpdir"/page-*.tif
do
    base="${page%.tif}"
    cuneiform -f hocr -o "$base.html" "$page"
    hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done

# combine each of the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf

# cleanup
rm -rf -- "$tmpdir"

Save and chomd 777 the file

setup a cron job to take the scanned files from where they come in and process them into the directory from which you take files to load into the Inbox from.