Checking the fonts on a page in a PDF

Xpdf provides pdffonts.

pdfinfo will tell you how many pages. Say 20. So at the Linux/Cygwin prompt, say you want to check for Times (in this case, it should not be there!):

$ for f in {1..20} ; do echo Page $f ; pdffonts -f $f -l $f mypdffile.pdf 2> /dev/null | grep Times ; done

the 2> /dev/null gets rid of warnings, like font weight missing. -f is first page, -l is last page — the same because we are stepping through.

My file should not have any Times New Roman in it, but often when there’s a wrong font, this is the one (because it is so default).

I can run this, and immediately see that I have TNR on pages 19, 8 and 7.

Now, the text may not be visible (PDFs are replete with invisible text, especially if they have images in them), but it helps me find out where to look.

It would be easy enough to write a script to work out the number of pages and feed it into this line; you could have something that takes the PDF name and the font you are looking for.

List your page lengths:

$ for f in *.pdf ; do echo -n "$f"...." " ; pdfinfo "$f" | grep Pages ; done

For a quick and very dirty solution, if your longest file is 20 pages, then:

$ for g in *.pdf ; do echo "$g" ; for f in {1..20} ; do echo Page $f ; pdffonts -f $f -l $f "$g" 2> /dev/null | grep Times ; done ; done

If you have the needed programs installed, you might use a script:

#! /usr/bin/bash

## Check fonts in a pdf file.
## v 1 22 Oct 22

## filefont -h gives help, but so does just looking at the script.

while getopts ":h" option ; do
  case $option in
  h)
    echo
    echo Usage is simple and limited:
    echo
    echo filefont filename.pdf fontpattern
    echo
    echo where fontpattern might be Times, say.
    echo
    echo Search is case sensitive unless you put -i in front of grep within this script.
    echo "(That is, you edit the script.)"
    echo "(Or you could just search for imes or oman or talic and skip the first letter...)"
    echo
    exit;;
  esac
done

FILE=$1
echo "$FILE"

## I use dos2unix because I am running Cygwin and I grab binaries from all over the place.

PAGES=$(pdfinfo "$FILE" | grep Pages | cut -d ':' -f 2 | dos2unix)
PAGE=1
while [[ $PAGE -le $PAGES ]]
do
  echo Page $PAGE
  pdffonts -f $PAGE -l $PAGE "$FILE" 2> /dev/null | grep "$2"
  ((PAGE=PAGE+1))
done

This just shows a few things you can do in bash, as well.

The getopts bit just sees if the user has passed a -h option to the script, and prints out some help. More for my own amusement than anything else.

PAGES=$() takes the value produced by that line of instructions and puts it into the variable PAGES.

We run pdfinfo, use grep and cut to isolate the page number, then run that through dos2unix in case the string has the wrong line ending.

We then loop over the file page by page, checking for the bit of text in the second command line argument.

Crude, I know. But handy for checking hundreds of pages or many, many documents. So far it ha been more than useful. Even if such a thing takes time to write and debug, it finds instances that I had missed by eye.

Editors away!

Author: Darren

I'm a scientist by training, currently working as a writer, trainer and editor. View all posts by Darren

Share this:

Related

Author: Darren

Leave a comment Cancel reply