I have bee using crgrep to search for bits of text in Word files, but he project has not changed since 2016 (for some reason it’s not all that widely used) and, while I find it very useful, I am beginning to worry. It is throwing errors it did not used to, probably because of changes in Java. It still works, and well, and you should try it, but I began to think about other options.
But I am not a programmer, as the contents of this blodge make perfectly clear. I am not going to put together a like-for-like replacement.
But what is it I need to do?
I basically need to grep through Word files, looking for phrases.
So I thought about the LibreOffice command line.
Regular grep is good for text files, so I figured a good way to work, for a simple brain like mine, would be to convert the files to text and then grep. I often want to do lots of tests on my set of files, so rather than convert and grep on the fly, it makes more sense to convert to text, keep the text files somewhere, and grep them as needed, then delete when the project is over.
So the mandate then became — go find all the relevant files, and dump plain text versions of them in some place. (I don’t have to worry about duplicate file names.)
Let’s assume I am working in the directory of interest, in a Cygwin terminal window that gives me all the Unix tools — for example, this find
command is not the Windows find.exe
command. If unsure, I can see which I am running:
$ which find /usr/bin/find
And see it is not from the Windows file tree. I can find the ones in the Windows tree using Unix tools (/cygdrive/c
is the C: in Cygwin language, and 2> /dev/null
just dumps any error messages [mostly directories I am not allowed to search] to the null device — that is, discards that output):
$ find /cygdrive/c/Windows/ -name find.exe 2> /dev/null /cygdrive/c/Windows/System32/find.exe /cygdrive/c/Windows/SysWOW64/find.exe
Anyway, I don’t want to clutter up existing directories, so I create a place to put the text files:
$ mkdir txtfiles
I then find
the DOCX files and process them (WordPress might put line breaks in this, but there should really be none; it’s one line):
$ find . -iname "*ting*.docx" -exec swriter --convert-to txt:Text --outdir txtfiles {} \;
And now I can use text grep to hunt through them. That is it.
To parse that find
command:
find
— the command.
(a full stop) — the current directory (this process seems to work best when this is a . and not some other folder)-iname
— search by name, ignore case (-name
does not ignore case)"*ting*.docx"
— find files withting
in the name and ending in.docx
(change the search to suit the occasion)-exec
— treat the next bit as a command to apply to the file you foundswriter
— the LibreOffice Writer binary--convert-to
— tells Writer to convert the input to some other formattxt:Text
— says to convert it to a text file; note thatText
must be capped--outdir txtfiles
— write the text files tooutdir
(txtfiles
in this case, created previously){}
— this stands for the file name found in the search\;
— end.
Works very nicely. First run can take a while, but then plain text greps are nice and fast.