Wednesday, October 6, 2010

PDF Document Manipulation

With the advent of advanced mobile reading devices that support full graphic displays, such as the iPad and soon to be released Android based tablets, we in the network security space find ourselves wanting to bring our entire reference library of books, ISO Standards, RFC, and diagrams with us everywhere we go without the concern of do we have internet connectivity and do we remember where all of the documents are without having to dig around the Internet.

The problem, however, is that a lot of the reference documents that we use are published on-line in text or PDF format and are only available in pieces (chapter by chapter).  This was probably done to save bandwidth and for ease of downloading and viewing.

This type of on-line publishing can work well when using a full sized computer that has disk storage or a real file system and keyboard/mouse.  This does not work well for mobile readers that do not have disk storage file system to store hundreds of separate files and may be used in places where you do not have internet access (on a plane over the Atlantic or in that basement corner conference room that is just out of wifi range).

We are going to address how we can manipulate on-line PDF documents that are published in parts to turn them in to one PDF “book” for easy of use on a mobile reader.  We will also discussion in Example 2 how to fix bookmarks that no longer point to the right place due to adding or taking pages away. 

Example 1:
We have found a book online that has 4 separate chapters that we would like to make a single PDF “book” out of for our mobile reader.  Each chapter is a separate download called chapter1.pdf, chapter2.pdf, etc.  The first page of each PDF is actually a blank page and the second page is note/summary/title page, neither of these pages we want to keep.

Step 1.1:
Download all related files to your computer. You will need to have local copies of all the files for the book or standard document in question.  So lets download them to say /home/pdf/

Step 1.2:
Remove any leading or artifact pages that are not relevant to the combined book.  Some times you will find that PDF documents have a leading blank page, or leader “summary” page, or master title page for every chapter.  When this is not desirable, we can easily remove them prior to stitching the chapters together.  (Be mindful of any copyright or usage rules when making changes).  The tool we will be using is called “pdftk”.

[jordan]:/home/pdf-> pdftk chapter1.pdf cat 3-end output new-chap1.pdf

What this command will do is take the current PDF of “chapter1.pdf” and create a new PDF document called “new-chap1.pdf” but it will skip the first two pages.  It will start on page 3 and go to the end of the PDF document.  Lets do this for the other 3 chapters as well.


[jordan]:/home/pdf-> pdftk chapter2.pdf cat 3-end output new-chap2.pdf
[jordan]:/home/pdf-> pdftk chapter3.pdf cat 3-end output new-chap3.pdf
[jordan]:/home/pdf-> pdftk chapter4.pdf cat 3-end output new-chap4.pdf


Step 1.3:
Now that we have removed all of the leading white pages and title pages, let us combine the 4 chapters in to one PDF document called “book.pdf”


[jordan]:/home/pdf-> pdftk new-chap1.pdf new-chap2.pdf new-chap3.pdf new-chap4.pdf cat book.pdf


Step 1.4:
Now that we have a single PDF document that has all the chapters stitched together, we can now look in to adding PDF bookmarks to the file to make jumping around easier on our mobile reader.  This can easily be done with a tool called “jpdfbookmarks” that you can download here: http://sourceforge.net/projects/jpdfbookmarks/.  As of this writing I am using version 2.4.1.  From this tool you can add bookmarks and sub-bookmarks very easily from their graphical interface.  We will create a bookmark for Chapter 1, Chapter 2, etc, and we will also create sub-bookmarks called Chapter 1.1, Chapter 1.2, etc for all of the sub elements in the chapter. Once you are done save your changes.


Example 2:
We have a PDF document that we either created in Example 1 or that we downloaded from the Internet that we would like to add a title page or a picture page (picture of the cover of the book) to the front of the PDF document.  Now ideally if this is from Example I, we would have done this during Step 1.3 so as not to mess up all of our bookmarks.  If we try to just use the pdftk command by itself, all of our bookmarks will be off by the number of pages that we insert.  This method would also work if we needed to remove a page and wanted to keep all of our bookmarks.  So what we can do is the following:

Step 2.1:
Create our title page as a PDF document or convert the PNG/JPG picture of the book to a PDF document and call it “cover.pdf”.

Step 2.2:
Lets add the “cover.pdf” to the first of the “book.pdf” document.  NOTE: when we do this all of the bookmarks will be off by the number of pages that we insert, but I will show you how to fix this. For the sake of explanation, let assume that the “cover.pdf” is only 1 page long.


[jordan]:/home/pdf-> pdftk cover.pdf book.pdf cat output bookwithcover.pdf


This will create a new PDF document called “bookwithcover.pdf” that will have the new title page or picture cover added to the front of the book.  For our example, all of the bookmarks will now be off by one page.

Step 2.3
First we need to export our current bookmarks so we can fix all of them in mass.  If we use the “jpdfbookmarks” tool that we used up in Step 1.4 we can “dump”/export all of the current bookmarks for the “bookwithcover.pdf” document to a text file.  This file will look like:


Chapter 1 - BookmarkNameAAA/1,Black,notBold,notItalic,closed,FitPage
    1.1 BookmarkNameBBB/1,Black,notBold,notItalic,open,FitPage
    1.2 BookmarkNameCCC/2,Black,notBold,notItalic,open,FitPage
Chapter 2 - BookmarkNameDDD/3,Black,notBold,notItalic,closed,FitPage
    1.1 BookmarkNameEEE/3,Black,notBold,notItalic,open,FitPage
    1.2 BookmarkNameFFF/4,Black,notBold,notItalic,open,FitPage
etc….


The “BookmarkName???” is just the name that you gave the bookmark with you created it and is what shows up in the bookmarks pane in your PDF viewer.  The slash “/” # that follows the name is the page in the PDF document that it points to. 

Step 2.4
We can now use a short piece of PERL to go in and fix all of the bookmarks in mass.  This assumes that you saved the bookmarks as “dump.txt” and the changes will be saved to “dump1.txt”.  We will also be changing all of the bookmarks by one page since we only added one page.  NOTE: this is written in long hand PERL for readability.


#!/usr/bin/perl

$file="dump.txt";
open (DATAIN, $file);
@fileline = ;
close (DATAIN);

$file1="dump1.txt";
open (DATAOUT, ">$file1");

foreach (@fileline)
{
    m/\/(.{1,3})\,/;

    # This is where we increase it by 1.
    $newpage = $1 + 1;
    s/\/.{1,3}\,/\/$newpage\,/;
    print DATAOUT "$_";
}
close (DATAOUT);


Step 2.5
Now that we have a file called “dump1.txt” that has all of the corrected bookmarks, we need to import that back into our “bookwithcover.pdf” document.  Let’s use the “jpdfbookmarks” tool once again to do this for us.  Open the PDF document with jpdfbookmarks and then use the “Load” function in the “Tools” menu to load “dump1.txt” which is the new bookmarks.  Save the PDF and you are done.

No comments:

Post a Comment