Search insde Word, PDF, XML and other files—installing and using crgrep

I am an editor in a business that uses Micro$oft products, but I want to be able to use the Linux CLI tools with which I am moderately familiar. In particular, I want to be able to grep Word documents, and that’s a problem because the new Word file format chops the text up and zips it up and hides it away. I googled and read a bit about crgrep (‘common resource grep’). Here is my experience so far.

Downloaded from

https://bitbucket.org/cryanfuse/crgrep

or

https://sourceforge.net/projects/crgrep/

Created a subdirectory c:\Users\username\installs\crgrep and downloaded the zip file into it. Worked in Cygwin, hence the forward slashes and dollar signs in the following. This could also be done through the GUI or in a PowerShell or CMD window. Choice is a wonderful thing.

$ unzip crgrep-1.0.5.zip
$ cd crgrep-1.0.5/
$ vim INSTALL.txt

OK, so it needs java. Does it need the compiler (probably not, but check…). In the crgrep folder, typed:

$ grep -ir javac

Returned no results calling the javac compiler. So it looks like the program needs the runtime but not the development kit (JDK), so that’s good. It’s what you’d expect. Now, I have the wonderful ImageJ installed (works effortlessly in userspace), and it installs the Java runtime environment, JRE. Maybe I can use that.

Now, according to the INSTALL.txt file, the JAVA_HOME variable that crgrep wants points at something like

JAVA_HOME=C:\Program Files\Java\jdk1.8.0_xx

and my grepping told me that java.exe should be in %JAVA_HOME%\bin\java.exe

In Cygwin, my ImageJ tree looks like:

/cygdrive/c/Users/username/installs/ij/ImageJ/jre/bin

Which meant I needed to set JAVA_HOME to be C:\Users\username\installs\ij\ImageJ\jre (Windows-style path) (that is, the variable points the directory with the bin directory inside it, not the bin directory or the binary file itself.)

But first checked the version — needs 1.8.

$ cd ../../ij/ImageJ/jre/bin/

$ ./java.exe -version
java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)

OK.

I installed in userspace (work computer, no root/admin access), so I went to my Windows account settings (given the various versions of Windows, I’ll assume a user can find their own account setting page) and (for Win 10; Win 7 is no doubt different) in the ‘Find a Setting’ box I typed ‘env’ for ‘environment variables’, and chose ‘edit the variables for your account’. Note that searching for ‘path’ turns up nothing. It’s a little trick!

Added an entry to the path:

(Path → Edit → New)

c:\Users\username\installs\crgrep\crgrep-1.0.5\bin

and created a new environment variable:

JAVA_HOME=C:\Users\username\installs\ij\ImageJ\jre

And exited everything, esp. the command line window, then opened it again, typed SET in a CMD (‘DOS’) window to see if the new variables were present, then tried the command:

H:>crgrep --help
usage: crgrep [options]  []
crgrep: Common Resource Grep.
 -a,--text               Process binary files or database columns as if
                         they were text
    --color        Alias for 'colour'.
    --colour       Colour-highlight matched text ('always', 'auto'
                         or 'never'). Default colour is red, see USAGE.txt
                         for other colour settings.
 -d,--database           Database grep (disables file search)
 -h,--help               Help
 -i,--ignore-case        Ignore case distinctions in matched text
 -l,--list               List resources which produce a match by name. No
                         content is searched.
 -m,--maven              Include Maven POM file dependencies in search
    --mood    Only include matching content expressing a
                         specific sentiment; values include 'positive',
                         'negative' or 'neutral'. Ignored if -l specified.
                         Requires model data; see INSTALL.txt
    --ocr                Enable OCR text extraction from images; requires
                         tesseract libraries. See INSTALL.txt
 -p,--password      Password required to access a resource,
                         optionally used with -u
 -P,--proxy         Proxy settings for http access, specified as
                         [:]
 -r,--recurse            Recursive search into resources
 -u,--user          User ID or username required to access a resource
 -U,--uri           URI to specify a JDBC database resource
 -V,--version            Print the version number of CRGREP to the
                         standard output stream
    --warn               Display all warnings to standard output
 -X,--extensions    Enable one or more extensions; comma sep. list
                         such as -Xdebug,trace
If  is not specified, or is '-', read from stdin
Please report issues at https://bitbucket.org/cryanfuse/crgrep/issues

OK, promising.

I want it for grepping Word files, so let’s see… yes, it finds ‘data’ in the test file, and outputs a nice clean stream:

H:>crgrep data text.docx
text.docx:T:A key part of his research was the analysis of large
datasets. As part of this he developed a software suite that included
data modelling, reduction and correction techniques, and made of use the
National Computing Infrastructure and other supercomputers. He enjoys
the challenge of analysing and explaining complex data using words and
carefully designed graphics. He likes Linux and the LATEXtypesetting
system.

How about PDF? Converted the Word doc to PDF using the ‘Save as’ dialogue in Word. Then…

H:\>crgrep data text.pdf
text.pdf:1:36:datasets. As part of this he developed a software suite
text.pdf:1:37:that included data modelling, reduction and correction
text.pdf:1:40:challenge of analysing and explaining complex data using

Different output because of how PDF and Word chop up the text, but instances found in both cases. No need to specify a file type or anything. I have not explored the command line options, but I am already finding the program useful — for example, when I want to find multiple instances of multiple expressions (say acronyms or references) in multifile projects.

Kudos.

Just grepping around.

Advertisements

Tags: , , , , ,

About Darren

I'm a scientist by training, based in Australia.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: