First, apologies if this has been asked before – I searched for a while through the existing posts, but could not find support.
I am interested in a solution for Fedora to OCR a multipage non-searchable PDF and to turn this PDF into a new PDF file that contains the text layer on top of the image. On Mac OSX or Windows we could use Adobe Acrobat, but is there a solution on Linux, specifically on Fedora?
This seems to describe a solution – but unfortunately I am already lost when retrieving exact-image.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
ocrmypdf does a good job and can be used like this:
ocrmypdf in.pdf out.pdf
To install:
pip install ocrmypdf
or
sudo apt install ocrmypdf # ubuntu sudo dnf -y install ocrmypdf # fedora
Method 2
After learning that Tesseract can now also produce searchable PDFs, I found the script sandwich: http://www.tobias-elze.de/pdfsandwich/
after installing dependencies (this might not be the complete list)
sudo dnf install svn ocaml unpaper tesseract
I followed the script’s guide for compiling from source
Compile from sources
pdfsandwich is open source software (license: GPL). You can download the sources either as .tar.bz2 package from the download area on the project website or check them out by subversion:
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
If OCaml is installed on your system, you can compile and install as follows:
cd pdfsandwich ./configure make sudo make install
and this now allows me to run
sandwich multipaged-non-searchable.pdf
resulting in a searchable PDF.
Here is a list of repositories (e.g., Debian Stable, AUR, Homebrew) containing pdfsandwich.
Method 3
An easy tool available in Ubuntu is ‘ocrfeeder’ it allows the generation of PDFs with OCR text overlaid on the original documents. It makes use of Tesseract plus other OCR engines (not sure which) and provides for image rotation/’unpaper’, etc, as well.
Method 4
I had this same problem so I wrote this over the weekend. Give it a shot; it works great! It is a simple wrapper around tesseract. It uses pdftoppm to convert a PDF into a bunch of TIFF files, then it uses tesseract to perform OCR (Optical Character Recognition) on them and produce a searchable PDF as output. All intermediate temporary files are automatically deleted when the script completes.
Source code: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
Instructions to install & use pdf2searchablepdf:
Tested on Ubuntu 18.04 on 11 Nov 2019 and on Ubuntu 20.04 Nov. 2020.
Install:
git clone https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.git
./PDF2SearchablePDF/install.sh
sudo apt update
sudo apt install tesseract-ocr
Use:
# General:
pdf2searchablepdf [options] <input.pdf|dir_of_imgs> [lang]
# Make a PDF searchable:
pdf2searchablepdf mypdf.pdf
# Make an entire directory of images into a single searchable PDF:
pdf2searchablepdf directory_of_imgs
You’ll now have a pdf called mypdf_searchable.pdf, which contains searchable text!
Done. It has no python dependencies, as it’s currently written entirely in bash.
See pdf2searchablepdf -h for the help menu and more options and examples.
References or Related Resources:
- PDF2SearchablePDF: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF
- https://askubuntu.com/questions/473843/how-to-turn-a-pdf-into-a-text-searchable-pdf/1187881#1187881
- https://askubuntu.com/questions/16268/whats-the-best-simplest-ocr-solution
- https://askubuntu.com/questions/150100/extracting-embedded-images-from-a-pdf/1187844#1187844
- pdfsandwich: Alternative software wrapper I just discovered, that is worth checking out too! http://www.tobias-elze.de/pdfsandwich/
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0