How to view and edit the code of a PDF file

I was wondering how to view and edit the code of a PDF file?

  1. By viewing, I don’t want to see the binary format, so I think hexdump may not be what I want. I tried gedit,
    but no encoding method can be used to decode the PDF content.
  2. By editing, I would like to search for /Fit and change them to
    /XYZ by for example sed. But my command sed s//Fit//XYZ/ < 1.pdf > 2.pdf seem not change the appearance of my PDF as I expected,
    although it doesn’t report any error. I was wondering if sed can
    actually work on PDF files as if they were plain text?

The context of my questions can be found from this question. My OS is Ubuntu 10.10.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Regarding your 1st question (“viewing source code, but no binary”): there are a few options which you have in order to de-compress the internal binary streams which are attached to many objects.

My favorite tool for this is QPDF, available on all major OS platforms. The following command de-compresses all streams and all object streams:

 qpdf --qdf --object-streams=disable orig.pdf expanded.pdf

Now you can open your PDF in any text editor. (There may still be some binary blobs in there: for example, font files and ICC profiles, which wouldn’t make sense for QPDF to expand).

To re-compress the expanded.pdf again after editing, you can run:

 qpdf expanded.pdf orig2.pdf

(Careful when manually editing PDFs! You need to know a lot about their internal syntax in order to do this right. As soon as you add or delete a single byte, you can get error messages from PDF readers who may no longer be able to open it, because the PDFs internal ToC is corrupted, which is based on byte-offset calculations. Just replacing Fit by XYZ strings should go fine, though…)

Method 2

You can use sed with binary files (at least GNU sed; some implementations may have trouble with files containing null characters or not ending with a newline character). But the command you used only replaces the first occurrence of /Fit on each line, and lines are pretty much meaningless in a PDF file. You need to replace all occurrences:

 sed s//Fit//XYZ/g

It would be more robust only replace /Fit if it’s not followed by a word constituent (e.g. not replacing /Fitness; I don’t know if your file contains occurrences of /Fit that would cause trouble). Here’s one way:

perl -pe 's!/Fitb!/XYZ!g'

Method 3

sed is line-oriented, that makes it not well suited for binary files, which are structured as blocks not lines.
Try using bbe (bbe-.sourceforge.net) instead.

Alternatively, both Emacs (GNU and XEmacs) and vim open PDF files seamlessly. It’s not very pretty printed of course, as it is mixed text and binary, but it’s sufficient for your editing purposes.
There is a Pdftk plugin for vim that makes everything easier, download here (zip file).
As you probably know, both above editors have powerful search-and-replace capabilities.

Also, converting the PDF to QDF mode before makes editing PDF files really easy.

Method 4

Use LibreOffice or OpenOffice to open the PDF, view it, replace things, write a new PDF, etc. I think that you can even use it from the command line or programmatically if there are a lot of documents to process.

Note that PDFs from some sources, e.g. Scanners, often contain the pages as images rather than as text so you will be out of luck with them for using search and replace.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x