Digitize books: Searchable OCR PDF with text overlay from scanned or photographed books on Linux
This blog post was published 9 years ago and may or may not have aged well. While reading
please keep in mind that it may no longer be accurate or even relevant.
Here is my method to digitize books.
It is a tutorial about how to produce searchable, OCR (Optical Character Recognition) PDFs from a hardcopy book using free software tools on Linux distributions.
You probably can find more convenient proprietary software, but that’s not the objective of this post.
Digitize books
To scan a book, you basically have 2 choices:
Scan each double page with a flatbed scanner
Take a good photo camera, mount it on a tripod, have it point vertically down on the book, and then take photos of each double page. Professional digitizers use this method due to less strain on the originals.
No matter which method, the accuracy of OCR increases with the resolution and contrast of the images. The resolution should be high enough so that each letter is at least 25 pixels tall.
Since taking a photo is almost instant, you can be much faster with the photographing method than using a flatbed scanner. This is especially true for voluminous books which are hard to repeatedly take on and off a scanner. However, getting sharp high-resolution images with a camera is more difficult than using a flatbed scanner. So it’s a tradeoff that depends on your situation, equitpment and your skills.
Using a flatbed scanner doesn’t need explanation, so I’ll only explain the photographic method next.
Photographing each page
If you use a camera, and you don’t have some kind of remote trigger or interval-trigger at hand, you would need 2 people: someone who operates the camera, and another one who flips the pages. You can easily scan 1 double page every 2 seconds once you get more skilled in the process.
Here are the steps:
Set the camera on a tripod and have it point vertically down. The distance between camera and book should be at least 1 meter to approximate orthagonal projection (imitates a flatbed scanner). Too much perspective projection would skew the text lines.
Place the book directly under the camera - avoid pointing the camera at any non-90-degree angles that would cause perspective skewing of the contents. Later we will unskew the images, but the less skewing you get at this point, the better.
Set up uniform lighting, as bright as you are able. Optimize lighting directions to minimize possible shadows (especially in the book fold). Don’t place the lights near the camera or it will cause reflections on paper or ink.
Set the camera to manual mode. Use JPG format. Turn the camera flash off. All pictures need to have uniform exposure characteristics to make later digital processing easier.
Maximize zoom so that a margin of about 1 cm around the book is still visible. This way, aligning of the book will take less time. The margin will be cropped later.
Once zoom and camera position is finalized, mark the position of the book on the table with tape. After moving the book, place it back onto the original position with help of these marks.
Take test pictures. Inspect and optimize the results by finding a balance between the following camera parameters:
Minimize aperture size (high f/value) to get sharper images.
Maximize ISO value to minimize exposure time so that wiggling of the camera has less of an effect. Bright lighting helps lowering ISO which helps reducing noise.
Maximize resolution so that the letter size in the photos is at least 25 pixels tall. This will be important to increase the quality of the OCR step below, and you’ll need a good camera for this.
Take one picture of each double page.
If you found a mistake in this blog post, or would like to suggest an improvement to this blog post,
please me an e-mail to michael@franzl.name; as subject
please use the prefix "Comment to blog post" and append the post title.