Convert a scanned pdf to text with Linux command line using OCRmyPDF
2 min readSep 26, 2019
OCRmyPDF
is a free utility that allows you to convert a scanned pdf to text (ocr — optical character recognition). In fact, OCRmyPDF
adds an OCR text layer to scanned PDF files over the original one, allowing them to be searched or copy-pasted.
Main features
- Generates a searchable PDF/A file from a regular PDF
- Places OCR text accurately below the image to ease copy / paste
- Keeps the exact resolution of the original embedded images
- When possible, inserts OCR information as a “lossless” operation without disrupting any other content
- Optimizes PDF images, often producing files smaller than the input file
- If requested deskews and/or cleans the image before performing OCR
- Validates input and output files
- Distributes work across all available CPU cores
- Uses Tesseract OCR engine to recognize more than 100 languages
- Scales properly to handle files with thousands of pages
- Battle-tested on millions of PDFs
Installation
Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows.
Users of Debian 9 or later or Ubuntu 16.10 or later may simply
apt-get install ocrmypdf
and users of Fedora 29 or later may simply
dnf install ocrmypdf
and macOS users with Homebrew may simply
brew install ocrmypdf
Usage
Quick convert a single pdf document
ocrmypdf input.pdf output.pdf
Convert all pdf files in a folder
for f in ./*.pdf; do ocrmypdf "$f" "$(basename "$f" ".pdf")_ocr.pdf"; done