Word count protected PDF

I don’t know if this is a good way to do this, but it worked for me. I could not run pdftotext on the file, it just would not work! (It was protected.) Instead, used pdfinfo to get the number of pages, then wrote out each page to a PPM file:

$ for f in {1..84} ; do echo $f ; pdftoppm -f $f -l $f -r 150 myfile.pdf myfile ; done

Now I had lots of ppm:

myfile-000001.ppm         myfile-000016.ppm     myfile-000031.ppm      myfile-000046.ppm  ... etc

I could OCR them (note: Cygwin, hence the “.exe” on Tesseract):

$ for f in *.ppm ; do tesseract.exe "$f" "$f" --dpi 150 ; done

Then concatenate and word count:

$ cat *.txt | wc

You could make a script etc or make it some kind of one-liner, but I like to spot-check the outputs for errors, so I create the intermediate files, do a few quick checks, then clean up when done. See also here.

Tesserwhat?

Scanner with document feeder to OCR

I want to scan a bunch of typewritten pages. I have an HP OfficeJet Pro 8600, which has a sheet feeeder on the scanner (that is why I got it). It works well with xsane. So … run xsane.

  • Setting up xsane to run how I want it toChoose ADF instead of Flatbed (ADF = automatic document feeder).
  • Choose PNM (Tesseract reads these).
  • Choose 300 DPI.
  • Set the correct number of pages (this is important — leave this as ‘1’ and it will scan all pages but you’ll only get a file for the first one).
  • Where the boot is, set the increment in file names (usually this will be 1, but if you are scanning fronts then backs, you can set this to 2, then start at 1 for the fronts and 2 for the backs, and interleave them).
  • Select Gray (grey), Full colour range.
  • Hit ‘Scan’.

This will/should result in a whole series of PNM files with graduated names. Something like:

north0001.pnm
north0002.pnm
north0003.pnm

Here’s a bit of one.

Extract from a PNG fileThen we can try to scan one of them:

tesseract --dpi 300 north0001.pnm n1
vim n1.txt

OK, works, so automate:

for f in *.pnm ; do tesseract --dpi 300 $f $f ; echo $f done! ; done

Detected 155 diacritics
north0001.pnm done!
north0002.pnm done!
Error in boxClipToRectangle: box outside rectangle
Error in pixScanForForeground: invalid box
north0003.pnm done!

Lots of diacritics usually means dirty paper. boxClip stuff, I don’t know, but does not seem to affect the results.

Assuming your files are in the right order, you can just cat them all together and happily edit.

cat north*.txt > North.txt
vim North.txt

Then edit as normal.

This works really well if you have a decent ribbon and clean paper. For example, the snippet above came out as:

Just because you have to try, thought Jake, doesn’t mean you have to Care.

So you can see that despite some grottiness on the paper and a general greyness to the scan, and the fact that this is a typical typewriter font, not a font designed for OCR, the only error is that Tesseract has put a capital ‘C’ on ‘care’. That’s pretty good!

I have tried a few OCR tools, and these days I go straight to Tesseract.

 

OCR

The Australian manual of style is now free

The Australian manual of style (AMOS) (stylemanual.com.au) is a ‘practical guide to help you produce clear, accurate, engaging content’. In other words, it is a style guide to help you with grammar, formatting, document structure, presenting data, designing graphs, writing about science, doing user research, and many other aspects of writing, editing and layout.

It has 6 main sections:

  1. Engaging — who are you writing for, how can you best reach them?
  2. Writing — how to structure your content, decide what is in and what is out, and write it effectively.
  3. Editing — how to refine your work, present it consistently, reference it properly, and so on.
  4. Showing — how to present information using graphics, diagrams and infographics, and incorporate them into the text.
  5. Subject areas — tricks and conventions specific to various fields and subjects.
  6. Resources — quick guides (these are really handy), terms to watch out for, and some other useful information.

Until the end of 2023, AMOS was a subscription-based product, but now it is free and open. Take a look!

It has considerable breadth. It covers the core of English but also gives specifics like how to present chemical equations and mathematics, when to italicise species names, what the current conventions are for presenting graphs, and so on.

 

My favourite Windows CMD editing environment (for now)

If I must edit a text file on Windows, I prefer using the CMD window and a text-mode editor like Vim or TDE.

I’ve been working in that kind of environment a long time, and I find it comfortable — old-fashioned, yes, but comfortable.

But part of that is getting the right font. And the one I like best is Nouveau IBM Stretch (https://www.dafont.com/nouveau-ibm.font) . Install it like any other TTF font on Windows, and then in the CMD window, right-click on the top bar and choose Defaults then choose your font and size.

TDE is wonderfully businesslike. The status lines give you plenty of information, the menus are easy to customise — so are key bindings — and it has a very useful array of text blocking commands — by stream, line, block and so on. Plus syntax highlighting. All in a megabyte or so!

Nouveau IBM Stretch is designed for terminal use, and it works a treat.

Screenshot of TDE in action with this font -- looks very retro

TDEasy

Joining PDF pages into a multipage PDF on the command line

This is kind of odd but it works. I did it because I could not be bothered looking up the command line syntax of a proper tool, like pdftk or pdfjam. The files are called 1.pdf, 2.pdf … up to 4.pdf.

$ for f in {1..4} ; do pdf2ps $f.pdf ; done

$ cat ?.ps > intermediate.ps
$ ps2ps intermediate.ps final.ps
$ ps2pdf final.ps

In other words:

  1. Convert them all to PostScript files.
  2. Put the PostScript files nose to tail in the desired order, treating them as text files.
  3. Use ps2ps to clean up the result.
  4. Convert back to PDF.

Works for me!

ImageMagick convert security policy

Try to convert a file with ImageMagick convert, and:

convert-im6.q16: attempt to perform an operation not allowed by the security policy `PDF' @ error/constitute.c/IsCoderAuthorized/421.
convert-im6.q16: no images defined `i.png' @ error/convert.c/ConvertImageCommand/3229.

WTF? There is some security policy that stops it processing PDFs in specific ways.

https://imagemagick.org/script/security-policy.php

It’s a configuration file, so it’s in /etc

$ find /etc/ -name policy.xml
/etc/ImageMagick-6/policy.xml

And let’s have a look for PDF:

$ grep -i pdf /etc/ImageMagick-6/policy.xml 
<!-- <policy domain="module" rights="none" pattern="{PS,PDF,XPS}" /> -->
<policy domain="coder" rights="none" pattern="PDF" />

What’s that all about? Well, what else can appear in the “rights=” place?

$ grep -i rights /etc/ImageMagick-6/policy.xml 

Rights include none, read, write, execute and all. Use | to combine them

OK, well I want to convert from PDF to else, so what if I make that ‘read’?

Note that the first PDF line is commented out anyway. It ought to have no effect — except I don’t know what the defaults are. Maybe ‘none’ is the default. What if I change ‘none’ on the first line (module) to ‘read’? (And uncomment it.)

Nope.

OK, re-comment it. Try second line (coder).

$ convert 1.pdf 1.png
convert-im6.q16: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `1.png' @ warning/png.c/MagickPNGWarningHandler/1668

Well, it’s a new message. And sounds like just a technical snafu. What if I make it ‘all’:

 <policy domain="coder" rights="all" pattern="PDF" />

Yes, that works. Whose great idea the annoying default is, I do not know.

 

 

 

 

Simple PostScript viewer on Windows 11

So — no admin rights; I want a PostScript viewer; my old go-to tool, Gsview, is no longer maintained.

Evince, the viewer from the Linux Gnome desktop, to the rescue!

https://evince.en.uptodown.com/windows/download

Simply run the installer, as for ‘advanced’ and choose to install for user. Then does not need admin rights. Installs the binary to something like:

C:\Users\USERNAMDE\AppData\Local\Apps\Evince-2.32.0.145\bin>

and this folder can be added to your path or you can put a batch file that calls evince in a folder that is already in your path.

I have a folder c:\User\USERNAME\installs\bin in my path, so I put evince.bat in there:

C:\>type \Users\USERNAME\installs\bin\evince.bat

C:\Users\USERNAME\AppData\Local\Apps\Evince-2.32.0.145\bin\evince.exe "%*"

And all good! Very nice viewer for several file types, including PostScript and encapsulated PostScript.

Seems like you might want to put the installer file somewhere permanent before running it. Seems odd, but when I ran the shortcut after deleting the installer, it complained that it could not find the installer…!

Inkscape cannot find Ghostscript

Wanted to open a PS/EPS in Inkscape, but got this error message about how Inkscape could not find Ghostscript..

Now, I don’t have admin rights (work computer), so I downloaded the GS installer (version 10.x for 64-bit Windows) and rather than running it, I unzipped it into:

C:\Users\USERNAME\installs\gs10

and added

C:\Users\USERNAME\installs\gs10\bin

and

 C:\Users\USERNAME\installs\gs10\lib

to my personal path. Then I created 2 new environment variables (personal ones, no need for admin):

GS=gswin64

and

GS_LIB=C:\Users\USERNAME\installs\gs10\lib

Then ran Inkscape and now it can find the GS binary… I can choose built-in or Poppler import and all is well!

I can also now use GS from the command line if I want to.

Emergency PDF crop

PDF with a really big, weird paper size. Instructions in 6 languages. I wanted to crop out the English bit since it was about the size of an A4 and would print at reasonable size.

In the end, I used pdf2ps to make a PostScript file, viewed it in gv to get the coordinates of the bottom left (x1, y1)  and top right (x2,y2) of the box I wanted, used ps2epsi to turn the PS into EPS, manually put the numbers from gv into the EPS bounding box (x1 y1 x2 y2) and hi-res bounding box, and then used epspdf to convert to PDF.

The main thing about this is that it is very flexible and bespoke in the page sizes it can deliver.

And it worked.

Notes:

$ epspdf -h
Epspdf version 0.6.5.1
Copyright (c) 2006-2023 Siep Kroonenberg

Convert between [e]ps and pdf formats
Usage: epspdf[.tlu] [options] infile [outfile]
Default for outfile is file.pdf if infile is file.eps or file.ps
Default for outfile is file.eps if infile is file.pdf

-p, --page, --pagenumber PNUM
Page number; must be a positive integer
-g, --grey, --gray, -G, --GREY, --GRAY
Convert to grayscale
-b, --bbox, --BoundingBox
Compute tight boundingbox
-T, --target TARGET
One of screen, ebook, printer, prepress or default
-N, --pdfversion VERSION
One of 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 or default
-U Use pdftops if available
-I Reverses the above
-s, --save Save some settings to configuration file
-i, --info Info: display detected filetype and exit
-d Debug: do not remove temp files
-v, --version
Display version info and exit
-h, --help Display this help message and exit

$ man ps2epsi (part of Ghostscript)

PS2EPSI(1) Ghostscript Tools PS2EPSI(1)

NAME
ps2epsi - generate conforming Encapsulated PostScript

SYNOPSIS
ps2epsi infile.ps [ outfile.epsi ] (Unix)
ps2epsi infile.ps [ outfile.epi ] (DOS)

DESCRIPTION
ps2epsi uses gs(1) to process a PostScript(tm) file and generate as output a new file which conforms to Adobe's Encapsulated PostScript Interchange (EPSI) format. EPSI is a special form of encapsulated PostScript (EPS) which adds to the beginning of the file in the form of PostScript comments a bitmapped version of the final displayed page. Programs which understand EPSI (usually word processors or DTP programs) can use this bitmap to give a preview version on screen of the PostScript. The displayed quality is often not very good (e.g., low resolution, no colours), but the final printed version uses the real PostScript, and thus has the normal PostScript quality.


[etc]

$ man pdf2ps

PDF2PS(1) Ghostscript Tools PDF2PS(1)

NAME
pdf2ps - Ghostscript PDF to PostScript translator

SYNOPSIS
pdf2ps [ options ] input.pdf [output.ps]

DESCRIPTION
pdf2ps uses gs(1) to convert the Adobe Portable Document Format (PDF) file "in‐put.pdf" to PostScript(tm) in "output.ps". Normally the output is allowed to use PostScript Level 2 (but not PostScript LanguageLevel 3) constructs; the -dLanguageLevel=1 option restricts the output to Level 1, while -dLanguageLevel=3 allows using LanguageLevel 3 in the output.

FILES
Run "gs -h" to find the location of Ghostscript documentation on your system, from which you can get more details.

VERSION
This document was last revised for Ghostscript version 10.00.0.

AUTHOR
Artifex Software, Inc. are the primary maintainers of Ghostscript.

10.00.0 21 September 2022 PDF2PS(1)

 

 

 

 

Halibut syntax highlighting on Vim on FreeDOS

Adding Halibut syntax highlighting to Vim on FreeDOS (because everybody needs this, right?).

First, found and downloaded Halibut 1.2 from BTTR Software, Ports & Builds.

Vim was installed using fdnpkg and is in c:\apps\vim.

Copied the halibut.vim into c:\apps\vim\syntax.

This makes the new syntax highlighting available to Vim — but how does it know when to use it?

Added these lines to c:\apps\vim\filetype.vim (there’s a big alphabetic list of types, so it is easy to see where to best put it):

"Halibut
au BufNewFile,BufRead *.but setf halibut

And now, with _vimrc (in my %HOME% folder) containing

syntax on

I can use Vim to edit Halibut files with syntax highlighting.

 

syntax