Creating Searchable PDFs
OSCAR stores most incoming documents as PDF's. These are usually generated by scanning or faxing software (see Hylafax) and are plain pdf's that you cannot search. The ability to select text can be added to a pdf with the appropriate use of open source OCR software.
Creating Searchable PDF's from regular image PDF's
A searchable but hidden "text" layer can be added to an scanned or faxed image
Figure 1: Example of a PDF from a scanned document to which a text layer has been added and selected
Document Version History
- v1.0 – initial public release to oscarmanual.org on July 5, 2013
The document is copyright by Peter Hutten-Czapski © 2013 under the Creative Commons Attribution-Share Alike 3.0 Unported License
Contents 1. Installing a Script |
Installation Instructions
Here we will be using ghost script, cuneiform, hocr2pdf. Acceptable results are available even in Ubuntu Lucid Linux 10.04 LTS
- GPL Ghostscript 8.71 (2010)
- Cuneiform for Linux 0.7.0 (2010)
- hocr2pdf version 0.7.4 (2009)
I suggest using the latest stable versions of your preferred image conversion, OCR and PDF creation software and testing settings before putting into production. If suboptimal use an alternate OCR library such as tesseract v 3.0 or newer.
Open a terminal and type the following to install the set available for your version of Ubuntu
sudo apt-get install cuneiform gs exactimage
Open a text editor (such as vi, nano, gedit)
and paste the following into it.
#!/bin/bash
# Run OCR on a multi-page PDF file and create a new PDF with
# the extracted text (if any) in hidden layer.
# Requires cuneiform, hocr2pdf, gs.
# Usage: ./ocrpdf.sh input.pdf output.pdf
set -e
input="$1"
output="$2"
tmpdir="$(mktemp -d)"
# extract images of the pages as tif files
# note: resolution hard-coded, do not go below fax resolution of 150dpi
gs -SDEVICE=tiffg4 -r300x300 -sOutputFile="$tmpdir/page-%04d.tif" -dNOPAUSE -dBATCH -- "$input"
# OCR each tif image into an anointed html and then convert into PDF
for page in "$tmpdir"/page-*.tif
do
base="${page%.tif}"
cuneiform -f hocr -o "$base.html" "$page"
hocr2pdf -i "$page" -o "$base.pdf" < "$base.html"
done
# combine each of the pages into one PDF
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile="$output" "$tmpdir"/page-*.pdf
# cleanup
rm -rf -- "$tmpdir"
Save and chomd 777 the file
setup a cron job to take the scanned files from where they come in and process them into the directory from which you take files to load into the Inbox from.