pdftk – DSPACE

Windows without admin rights…

Yay, new work computer. Boo, Windows. Boo, no admin rights.

But

Cygwin

Install Cygwin via running the setup program in the CMD prompt:

setup-x86_64.exe -B -d

then use the same (maybe wrapped in a shortcut) to update it, and we can use it without admin. That gives all the software within Cygwin, which is a lot, including powerful GUI tools as well as the command line (Inkscape, for example).

Java

OpenJDK zip file can just be unzipped in any folder (I have it in C:\Users\darren\installs\jdk-19.0.2) then you can add its bin to your path.

crgrep

crgrep needs Java, so once OpenJDK is in place, can be installed in much the same way, with JAVA_HOME created to point to where you want it.

pdftk

pdftk was a bit trickier, but if you have Java, then put the jar file from https://gitlab.com/pdftk-java/pdftk into some folder, then put a small batch file in your path:

java -jar c:\Users\darren\installs\pdftk\pdftk-all.jar %*

%* means ‘all command line args’, so this basically just works like pdftk.

I put the Cygwin /bin folder on the END of my windows path, always on the end so if I am running in Windows nothing gets gazumped. I also go into the Cygwin bin folder and copy find to gfind, so I can use Windows native find and ‘gfind’ for find at the Windows prompt.

Of course I could install a bunch of portable Apps, and I will if I need them.

PDF — make smaller, flatten, make harder to copy

Say you want to reduce the size of a PDF, flatten it and make it harder to extract the text from.

$ pdftk original.pdf output final.pdf compress flatten allow printing owner_pw lorumipsum encrypt_128bit

Huh?

pdftk — this is the program we use
original.pdf — the, uh, original PDF file
output — tells pdftk to write the result to final.pdf
final.pdf — see above
compress — save some space at the expense of making the PDF code less readable
flatten — if there are any forms or other editable fields in the PDF, take that functionality out
allow — after this comes a list of words defining what users are allowed to do with the PDF besides view it
printing — this is what they are allowed to do
owner_pw — the next bit is the password of the PDF’s owner
lorumipsum — the password
encrypt_128bit — specify the encryption strength

Some flags depend on others — for example, if you don’t use owner_pw then you don’t need a password or an encryption strength. Similarly, you cannot only allow certain operations to users without defining them as not the owner, or without encrypting it, so the allow section makes not sense without those other flags.

Wherever you go, there you are.

Processing Word files with internal crosslinks to PDF

The problems:

If you print to PDF, you can get high resolution but no internal crosslinks. If you save as PDF, you keep the crosslinks, but the quality is poor even if you set it to be high in all the available menus.
Sometimes (in the documents I have received), the crosslinks look fine in the Word document, but when you save to PDF, the document gets reformatted. Big chunks of text are moved around.
Empty pages are inserted! (Yes, really!)
Images that look fine in Word are cropped in the PDF.

Clearly, the Word save to PDF option is very buggy. The workarounds are annoying and time-consuming. It is a pity more businesses don’t use LibreOffice, because all three of these issues are not a problem in LibreOffice. But the client insists on Word. I have tried importing the Word documents into LibreOffice, but the documents are complex and the import is incomplete.

OK, first, the low resolution: We ended up getting around that by dropping in high resolution-images using Acrobat after making the PDFs in Word. You need the full-on Pro version of Acrobat to do that, but when I am at work we have that.

Second, the crosslinks. What happens is that sometimes a crosslink to, say, Figure 1 will grab not only the text ‘Figure 1’ and dump it in where you want to reference the figure, it will grab some of the surrounding material. This might include the figure itself (so the figure appears wherever you refer to it!) or some the the text before or after the figure. It is impossible to predict. Seems to occur when Word’s built-in numbering tools that automatically number figures etc are used.

Solution: Remake the bookmarks manually (always manually with Word). If I highlight ‘Figure 1’ and click Insert > Bookmark I can manually make a bookmark that grabs the text ‘Figure 1’. If I then use Insert > Cross-reference to replace the old cross-reference, it just picks up the right bit of text. Note that the old one looked fine in Word; the error showed up when saved to PDF.

Empty pages — yes, random new empty pages appear. Luckily, these documents do not have recto/verso pages nor folios (printed page numbers) so I can just excise the unwanted pages. Say I want to remove page 4:

$ pdftk infile.pdf cat 1-3 5 output outfile.pdf

Cropped images: Right-click on the image, select Format Picture and adjust the crop box manually.

Too much badly coded semi-functionality requiring manual fixes.

But we get there

Custom watermark PDF using Inkscape and pdftk

The path to making a watermark and applying it is actually pretty simple. I did it using Cygwin, but these tools are available for plain Windows and of course Linux and Mac as well.

Open Inkscape and make the watermark.
In Inkscape: Filters → Fill and Transparency → Opacity and set it to something suitable; I tried about 0.05 to 0.1.
Save as a PDF, using default settings (make sure Rasterise filter effects is checked). It’s good to make sure Inkscape is using the same paper size as the PDF to be watermarked.
Use pdftk as below, using the stamp and output keywords.

$ pdftk.exe original.pdf stamp watermark.pdf output original-watermarked.pdf

Screen capture of the watermarked PDF viewed in Acrobat — Not for retail sale watermarked onto a PDF

Caveats: I have not tested how removable said watermark may be. Possibly one should use a PDF flattening tool after application.

$ pdftk input.pdf output output.pdf flatten

I can verify that the flattened PDF is a little smaller than the watermarked one. I do not know if the flattening prevents watermark removal.

Experiments

First, I created the watermark — just used xFig and made a PDF page the same size and orientation as my document with ‘Not for distribution’ in very light grey Helvetica on the diagonal. Put an invisible rectangle around it to prevent cropping by fig2dev.

$ pdftk.exe original.pdf stamp watermark.pdf output original-watermarked.pdf

OK, that works — now tune the watermark. The watermark is in the foreground, of course. So it would be good to make it transparentish. xFig won’t do that as far as I know, so I created the PDF watermark in PowerPoint where it is easy to make the text mostly transparent. Except save as or print to PDF gives a white background … maybe Inkscape?

First, made a simple PNG in a paint program with grey rather than black text and used Gimp to make it semitransparent using Ctrl+L and the opacity slider. Export as PNG and PDF.

As above, put the exported PDF over the test file. Nope, does not work.

In Inkscape, made a simple one-page PDF using the Filters → Fill and Transparency → Opacity menu to set the text box opacity to 0.05. Then Save As PDF, without rasterisation.

Nope, still did not work. Try changing text colour to grey (maybe even semiopaque black is still opaque since black is kind of binary). Nope — but maybe it is the PDF viewer? Try a few others… Not Acrobat. OK, don’t turn off rasterise.

Looks promising! Now tune darkness of watermark — combination of grey vs black and opacity.

Yep, that works.

PDF booklet-y stuff using pdfjam and pdftk

Say you scan an A5 booklet by opening it flat and scanning each pair of pages. You can then print it out in landscape, stapled on the left and you can read the whole booklet in order. But the pages are out of order if you want to make a new saddle-stapled booklet.

So, let’s say I have a PDF like this one: http://site.xavier.edu/polt/typewriters/quietriter.pdf

Scrteenshot of the pdf, showing the arrangement of pages. — Spread from the PDF of the booklet — pages 4 and 5 scanned onto a single landscape A4 or letter paper page.

And I want to rearrange it so that I can make it into a proper (roughly A5-sized) booklet, stapled in the middle rather than along the edge. Well, there might be a tool for this, but …

(1) We open it in gv and find out that it’s 788 wide, half of which is 394. It’s also 598 high.

(2) Use pdfcrop .sh to make 2 PDFs, one of the left half and one of the right half.

$ pdfcrop.sh -t "0 0 394 0" quietriter.pdf quietriter1.pdf
$ pdfcrop.sh -t "394 0 0 0" quietriter.pdf quietriter2.pdf

Looks good. (Hint: Some PDF viewers don’t view the cropped files correctly — if it looks wrong, try a different viewer before messing with the dropping commands).

Page order in quietriter1.pdf is: back cover (24) inside front cover (2) 4 6 8 10 12 14 16 18 20 22
Page order in quietriter2.pdf is: front cover (1) 3 5 7 9 11 13 15 17 19 21 23

(3) Now, for booklet order, the simplest thing to do would be to put these in order (1, 2, 3, …, 24) then use pdfbook (part of pdfjam).

Sounds like a job for pdftk …

First, we’ll put the first page of quietriter1.pdf to the back. From what I can see, this should work:

$ pdftk quietriter1.pdf cat 2-12 1 output quietriter1a.pdf

(This conCATenates the selected page ranges in the order given.)

(4) Then we interleave 1a and 2 using shuffle, which is designed for just this sort of job:

$ pdftk A=quietriter1a.pdf B=quietriter2.pdf shuffle B A output quietriter_inorder.pdf

(5) Then we use pdfbook to reorder into booklet order.

pdfbook quietriter_inorder.pdf

That gives quietriter_inorder-book.pdf.

(6) We print the file double-sided with flip on long edge. (I just printed it from Acrobat, having done the command line manipulation running the PDF tools within Cygwin.)

(7) Looks good! Of course, there are no bleeds, but a quick saddle staple and then trimming with a guillotine and it looks very nifty, and a lot like an original booklet.

photo of the typewriter and manual. — The Remington Rand (Sperry Rand) Letter-Riter (what a terrible name for a typewriter!) with a facsimile of the manual, produced as outlined here.

Booklet.

Crop every page in a multipage PDF file

Apply the same crop to every page in a multipage PDF file. Requires pdftk, Ghostscript, and sometimes pdf2ps/ps2pdf.

See end of post for copy of bash script. Found it at:

https://askubuntu.com/questions/270493/how-to-crop-a-multi-page-image-scanned-pdf-file-which-wont-crop-with-pdfcrop

I found best to use gv (or GSview) to get the right, left, top, bottom pixels to crop, then explicitly specify using the -t switch.

$ ./pdfcrop.sh -t "2 182 3 183" cw16.pdf cw16_crop.pdf

Note the quote marks around the crop values. They are in order left, bottom, right, top

Try on Casiowriter manual from https://www.manualslib.com/download/777108/Casio-Cw-16.html. (See this bit of nonsense.)

Before:

Image of front page of manual, showing large white space bands at top and bottom. — Before using pdfcrop.sh

After (though one page — the second — came out wrong and I don’t know why, but I fixed that by first processing the original PDF, first I went pdf2ps then ps2pdf and made a cleaned up PDF; then all was perfect when I ran the cropper):

Image of front page of manual, showing no large white space bands at top and bottom. — After pdfcrop.sh

The script:

$ cat pdfcrop.sh
#!/bin/bash

function usage () {
  echo "Usage: `basename $0` [Options]  []"
  echo
  echo " * Removes white margins from each page in the file. (Default operation)"
  echo " * Trims page edges by given amounts. (Alternative operation)"
  echo
  echo "If only  is given, it is overwritten with the cropped output."
  echo
  echo "Options:"
  echo
  echo " -m \" [ [ ]]\""
  echo "    adds extra margins in default operation mode. Unit is bp. A single number"
  echo "    is used for all margins, two numbers \" \" are applied to the"
  echo "    right and bottom margins alike."
  echo
  echo " -t \" [ [ ]]\""
  echo "    trims outer page edges by the given amounts. Unit is bp. A single number"
  echo "    is used for all trims, two numbers \" \" are applied to the"
  echo "    right and bottom trims alike."
  echo
  echo " -hires"
  echo "    %%HiResBoundingBox is used in default operation mode."
  echo
  echo " -help"
  echo "    prints this message."
}

c=0
mar=(0 0 0 0); tri=(0 0 0 0)
bbtype=BoundingBox

while getopts m:t:h: opt
do
  case $opt
  in
    m)
    eval mar=($OPTARG)
    [[ -z "${mar[1]}" ]] && mar[1]=${mar[0]}
    [[ -z "${mar[2]}" || -z "${mar[3]}" ]] && mar[2]=${mar[0]} && mar[3]=${mar[1]}
    c=0
    ;;
    t)
    eval tri=($OPTARG)
    [[ -z "${tri[1]}" ]] && tri[1]=${tri[0]}
    [[ -z "${tri[2]}" || -z "${tri[3]}" ]] && tri[2]=${tri[0]} && tri[3]=${tri[1]}
    c=1
    ;;
    h)
    if [[ "$OPTARG" == "ires" ]]
    then
      bbtype=HiResBoundingBox
    else
      usage 1>&2; exit 0
    fi
    ;;
    \?)
    usage 1>&2; exit 1
    ;;
  esac
done
shift $((OPTIND-1))

[[ -z "$1" ]] && echo "`basename $0`: missing filename" 1>&2 && usage 1>&2 && exit 1
input=$1;output=$1;shift;
[[ -n "$1" ]] && output=$1 && shift;

(
    [[ "$c" -eq 0 ]] && gs -dNOPAUSE -q -dBATCH -sDEVICE=bbox "$input" 2>&1 | grep "%%$bbtype"
    pdftk "$input" output - uncompress
) | perl -w -n -s -e '
  BEGIN {@m=split /\s+/, $mar; @t=split /\s+/, $tri;}
  if (/BoundingBox:\s+([\d\.\s]+\d)/) { push @bbox, $1; next;}
  elsif (/\/MediaBox\s+\[([\d\.\s]+\d)\]/) { @mb=split /\s+/, $1; next; }
  elsif (/pdftk_PageNum\s+(\d+)/) {
    $p=$1-1;
    if($c){
      $mb[0]+=$t[0];$mb[1]+=$t[1];$mb[2]-=$t[2];$mb[3]-=$t[3];
      print "/MediaBox [", join(" ", @mb), "]\n";
    } else {
      @bb=split /\s+/, $bbox[$p];
      $bb[0]+=$mb[0];$bb[1]+=$mb[1];$bb[2]+=$mb[0];$bb[3]+=$mb[1];
      $bb[0]-=$m[0];$bb[1]-=$m[1];$bb[2]+=$m[2];$bb[3]+=$m[3];
      print "/MediaBox [", join(" ", @bb), "]\n";
    }
  }
  print;
' -- -mar="${mar[*]}" -tri="${tri[*]}" -c=$c | pdftk - output "$output" compress

Thanks to the inventor!

PDFishness.

Share this:

Share this:

Share this:

Experiments

Share this:

Share this:

Share this: