ScanningToPDF

From code/src wiki
Jump to navigationJump to search

HowTo improve the quality of scanned documents.

Step 0: Generate PNM files

If you already have a set of PNM files from scanadf(1) you can skip this step.

If you have a PDF document, split it into 1 PNM file per page with:

pdftoppm -r 300 scanned.pdf pages

Step 1: Deskew the pages

Run unpaper to fix page rotations and remove noise at the edge of the page. It's amazing how much each page is rotated when fed through an ADF.

unpaper -v -ni 2 -ms 100,100 -dn bottom,right  --no-border-align --no-grayfilter page-%02d.ppm fixed-%03d.ppm

Options:

  • -ni 2: Dot your I's. The default "noise" parameters of 4 lonely pixels may remove all dots from the 'i' characters.
  • -ms 100,100: Try not to truncate headers and footers where text extends beyond the rest of the body text.
  • -dn bottom,right: Detect skew-angles from the right-hand-side and bottom. The values given here should depend on the scanned content. Choose side(s) where there is a definitive column where the content starts.
  • --no-border-align: Don't reposition content to the centre of the image. This option can cause pages with content only at the top of the page (eg. a single paragraph) to be moved to the centre.
  • --no-grayfilter: Don't remove large blocks of gray, as is used for table headings etc.

Step 2: Recombine images into PDF

The ImageMagick convert utility can be used to re-create a PDF document from a set of images.

convert fixed-0*.ppm -colorspace Gray -unsharp 0.5x0.5+0.5+0.008 -page A4 -quality 100 "Output.pdf"