Convert a scanned pdf to text with Linux command line using OCRmyPDF

OCRmyPDF is a free utility that allows you to convert a scanned pdf to text (ocr — optical character recognition). In fact, OCRmyPDF adds an OCR text layer to scanned PDF files over the original one, allowing them to be searched or copy-pasted.

Main features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a “lossless” operation without disrupting any other content
  • Optimizes PDF images, often producing files smaller than the input file
  • If requested deskews and/or cleans the image before performing OCR
  • Validates input and output files
  • Distributes work across all available CPU cores
  • Uses Tesseract OCR engine to recognize more than 100 languages
  • Scales properly to handle files with thousands of pages
  • Battle-tested on millions of PDFs

Installation

Linux, UNIX, and macOS are supported. Windows is not directly supported but there is a Docker image available that runs on Windows.

Users of Debian 9 or later or Ubuntu 16.10 or later may simply

apt-get install ocrmypdf

and users of Fedora 29 or later may simply

dnf install ocrmypdf

and macOS users with Homebrew may simply

brew install ocrmypdf

Usage

Quick convert a single pdf document

ocrmypdf input.pdf output.pdf

Convert all pdf files in a folder

for f in ./*.pdf; do ocrmypdf "$f" "$(basename "$f" ".pdf")_ocr.pdf"; done

References

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chi Thuc Nguyen

Chi Thuc Nguyen

More from Medium

Reduce your automation script development time using selenium, python and pytest

What is Docker? What are the Usage Areas?

HelloWordl in CircleCI

Create Word File with Python