January 21, 2011

Extracting a subset of pages from a PDF document

For an upcoming plane trip, I wanted to extract one chapter from an Intel IA32 Architecture Software Developer manual for reading on my Kindle. In the past, I would have used pdftk for this purpose, but this is an old, unsupported tool that cannot handle the AESV2 encryption used in Intel’s manuals. I then tried to use the Multivalent tools, which supposedly contain various PDF manipulation tools, but this package seems to have suffered from software rot, and the supplied .jar file no longer contains the necessary classes.

Finally, I stumbled on this article in Linux Journal, which shows how to use gs (Ghostscript) to extract pages. This works well enough with Intel’s manuals, though it does change various Microsoft fonts to their more standard equivalents. Here’s a slightly modified version of the article’s script:


# this script take 3 arguments:
#     $1 is the first page of the range to extract
#     $2 is the last page of the range to extract
#     $3 is the input file
#     output file will be named "inputfile_pXX-pYY.pdf"

if [ -z "$first" -o -z "$last" -o -z "$input" ] ; then
   echo "usage: pdfextract firstpage lastpage inputpdffile"
   exit 1
gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER \
       -dFirstPage=$first \
       -dLastPage=$last \
       -sOutputFile="${input%.pdf}_p$first-p$last.pdf" \