Convert a scanned pdf to text with Linux command line using OCRmyPDF

2 min readSep 26, 2019

OCRmyPDF is a free utility that allows you to convert a scanned pdf to text (ocr — optical character recognition). In fact, OCRmyPDF adds an OCR text layer to scanned PDF files over the original one, allowing them to be searched or copy-pasted.

Main features

Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a “lossless” operation without disrupting any other content
Optimizes PDF images, often producing files smaller than the input file
If requested deskews and/or cleans the image before performing OCR
Validates input and output files
Distributes work across all available CPU cores
Uses Tesseract OCR engine to recognize more than 100 languages
Scales properly to handle files with thousands of pages
Battle-tested on millions of PDFs

Installation

Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows.

Users of Debian 9 or later or Ubuntu 16.10 or later may simply

apt-get install ocrmypdf

and users of Fedora 29 or later may simply

dnf install ocrmypdf

and macOS users with Homebrew may simply

brew install ocrmypdf

Usage

Quick convert a single pdf document

ocrmypdf input.pdf output.pdf

Convert all pdf files in a folder

for f in ./*.pdf; do ocrmypdf "$f" "$(basename "$f" ".pdf")_ocr.pdf"; done

References

jbarlow83/OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf # it's a…

github.com

OCRmyPDF documentation - ocrmypdf 9.0.3.post6+g68c852a documentation

PDF is the best format for storing and exchanging scanned documents. Unfortunately, PDFs can be difficult to modify…

ocrmypdf.readthedocs.io