How to OCR a PDF file and get the text stored within the PDF?

First, apologies if this has been asked before – I searched for a while through the existing posts, but could not find support.

I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?

This seems to describe a solution – but unfortunately I am already lost when retrieving exact-image.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

ocrmypdf does a good job and can be used like this:

ocrmypdf in.pdf out.pdf

To install:

pip install ocrmypdf

or

sudo apt install ocrmypdf     # ubuntu
sudo dnf -y install ocrmypdf  # fedora

Method 2

After learning that Tesseract can now also produce searchable PDFs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/

after installing dependencies (this might not be the complete list)

sudo dnf install svn ocaml unpaper tesseract

I followed the script’s guide for compiling from source

Compile from sources

pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:

svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich

If OCaml is installed on your system, you can compile and install as follows:

cd pdfsandwich
./configure
make
sudo make install

and this now allows me to run

sandwich multipaged-non-searchable.pdf

resulting in a searchable PDF.

Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.

Method 3

An easy tool available in Ubuntu is ‘ocrfeeder’ it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/’unpaper’, etc, as well.

Method 4

I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.

Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF

Instructions to install & use pdf2searchablepdf:

Tested on Ubuntu 18.04 on 11 Nov 2019 and on Ubuntu 20.04 Nov. 2020.

Install:

git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh

sudo apt update
sudo apt install tesseract-ocr

Use:

# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]

# Make a PDF searchable:
pdf2searchablepdf mypdf.pdf

# Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs

You’ll now have a pdf called mypdf_searchable.pdf, which contains searchable text!

Done. It has no python dependencies, as it’s currently written entirely in bash.

See pdf2searchablepdf -h for the help menu and more options and examples.

References or Related Resources:

  1. PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
  2. https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
  3. https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
  4. https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
  5. pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x