Text search a whole lot of Word DOCX files

I have bee using crgrep to search for bits of text in Word files, but he project has not changed since 2016 (for some reason it’s not all that widely used) and, while I find it very useful, I am beginning to worry. It is throwing errors it did not used to, probably because of changes in Java. It still works, and well, and you should try it, but I began to think about other options.

But I am not a programmer, as the contents of this blodge make perfectly clear. I am not going to put together a like-for-like replacement.

But what is it I need to do?

I basically need to grep through Word files, looking for phrases.

So I thought about the LibreOffice command line.

Regular grep is good for text files, so I figured a good way to work, for a simple brain like mine, would be to convert the files to text and then grep. I often want to do lots of tests on my set of files, so rather than convert and grep on the fly, it makes more sense to convert to text, keep the text files somewhere, and grep them as needed, then delete when the project is over.

So the mandate then became — go find all the relevant files, and dump plain text versions of them in some place. (I don’t have to worry about duplicate file names.)

Let’s assume I am working in the directory of interest, in a Cygwin terminal window that gives me all the Unix tools — for example, this find command is not the Windows find.exe command. If unsure, I can see which I am running:

$ which find
/usr/bin/find

And see it is not from the Windows file tree. I can find the ones in the Windows tree using Unix tools (/cygdrive/c is the C: in Cygwin language, and 2> /dev/null just dumps any error messages [mostly directories I am not allowed to search] to the null device — that is, discards that output):

$ find /cygdrive/c/Windows/ -name find.exe 2> /dev/null
/cygdrive/c/Windows/System32/find.exe
/cygdrive/c/Windows/SysWOW64/find.exe

Anyway, I don’t want to clutter up existing directories, so I create a place to put the text files:

$ mkdir txtfiles

I then find the DOCX files and process them (WordPress might put line breaks in this, but there should really be none; it’s one line):

$ find . -iname "*ting*.docx" -exec swriter --convert-to txt:Text --outdir txtfiles {} \;

And now I can use text grep to hunt through them. That is it.

To parse that find command:

  1. find — the command
  2. . (a full stop) — the current directory (this process seems to work best when this is a . and not some other folder)
  3. -iname — search by name, ignore case (-name does not ignore case)
  4. "*ting*.docx" — find files with ting in the name and ending in .docx (change the search to suit the occasion)
  5. -exec — treat the next bit as a command to apply to the file you found
  6. swriter — the LibreOffice Writer binary
  7. --convert-to — tells Writer to convert the input to some other format
  8. txt:Text — says to convert it to a text file; note that Text must be capped
  9. --outdir txtfiles — write the text files to outdir (txtfiles in this case, created previously)
  10. {} — this stands for the file name found in the search
  11. \; — end.

Works very nicely. First run can take a while, but then plain text greps are nice and fast.

greps away!

USB to serial (DE9, often called DB9) — what chip is that?

Just my notes on a small fraction of what’s out there.

Any cheap unit (under about $10) is likely to have a WinChipHead CH340/341 (or whatever it is called), and might work but might not. Mixed results at best. One of these almost bricked one of my machines. Also, even though they are notionally a Prolific copy — in that on Linux you’ll use the pl2303 kernel module as a driver — Linux will often not be able to drive them. Windows seems to do better, because the manufacturer-supplied drivers are better for Win; but still not quite right.

It’s not always easy to know whether a cable will have an original chip in it or a cheap knockoff. I guess if you buy an FTDI-made cable you can be sure it will contain an FTDI chip. I’ve seen a lot of cables advertised as FTDI but the outer mouldings are the same as on the really cheap WinChip cables, and I have my doubts about what’s inside.

Some experts recommend FTDI — http://www.usconverters.com/index.php?main_page=page&id=62&chapter=0.

Here is what I found out about some of the infinite number of options.

  1. The really cheap (often blue) no-name ones that are on ebay for like $5. Good luck! It might work, and you’ll save $30+ compared to some options, but it might not; I’ve had limited luck. They have a knock-off of the Prolofic PL2303 — at least, that’s the kernel module that Linux loads (though that does not mean the device will work as you expect).
  2. ATEN UC232A or A1. What’s the chip? Linux driver (see https://assets.aten.com/product/manual/uc232a-uc232a1_um_w_2020-10-13.pdf) uses the pl2303 module. So that is … Prolific.
  3. Alogic UD29A PL2303 — Prolific (see https://www.alogic.co/pub/media/mageworx/downloads/attachment/file/a/p/ap1103_20140103.zip).
  4. CHIPI-X10 — FTDI (https://ftdichip.com/products/chipi-x10/) — actual FTDI product, so probably a good bet! (FT231X); also other options too (eg US232R-10-BULK).
  5. Astrotek 205153-D00000 — manufacturer website down. Probably PL2303 or clone.
  6. Klik KU2DB9015 — shops say FTDI, but cannot find manufacturer page. ‘CD’ at https://www.comsol.com.au/Products-by-Category/USB-Converters/KU2DB9015 has PL2303 and FTDI drivers on it … so probably FTDI.
  7. MCT U232-P9 — this is advertised in some places as FTDI, but a little bit of digging suggests it is Prolific. Some older ones use a Phillips chip. Seems like an old device.
  8. Sunix UTD1009DF — manufacturer says FTDI: https://www.sunix.com/en/product_detail.php?cid=1&kid=3&gid=18&pid=1924.
  9. Sunix UTS1009GC is Prolific. So is UTS1009D — varies by model.
  10. Unnamed one from Jaycar — (https://www.jaycar.com.au/usb-to-db9m-rs-232-converter-1-5m/p/XC4834) Prolific (based on drivers supplied).
  11. CableCreation — seems to do both. Seems too cheap … sold on Amazon, which I try to avoid.

Tourmaline by Randolph Stow

I really liked this book. I also think it is a good book, which is of course not the same thing. It’s lyrical, dreamlike, almost mythic at times, evocative and free of cliche. Stow’s prose is a times poetic (he was a poet), and his characters speak with directness and honesty real people don’t often use, and they do so to address issues — love, purpose, hope — that real people often avoid confronting. So while the town of Tourmaline is evoked with clarity and power, the book is not, on the whole, realist. Every town is full of people struggling to make sense of their life and their choices; in Tourmaline, they articulate that.

front cover

The plot? A man, almost dead from exposure in the desert, is brought into town by the monthly lorry that brings supplies. The town — a dozen people, perhaps, who hang on while the town decays — gather round to gawk, help, wonder or just observe. He heals. He says his name is Michael Random, which no-one believes but no-one questions. He says he is a water diviner — in a town that is desiccating day by day, year by year, that looks across a salt lake, whose oldest inhabitants (like our narrator) are the only ones who can remember it raining. He fascinates them. Each reacts to him, and to his affect on the others, in their own way.

The narration is unusual; first person, but… The story is told by the town policeman — but he narrates episodes for which he has only hearsay, invention or later reports. He admits as much. On the whole, it works very well.

The wider context — there is none, or very little. A somewhat odd author’s note tells us the story takes place in the future, making it, by some definitions, and a few little hints in the text that all is not well in the wider world, possibly a post-apocalyptic novel, though you would not call it science fiction. It may well be climate fiction — we can imagine the arid Tourmaline as what is left after the rain patterns move south (as they will in Australia) in a warming world.

In feel, the closest you might get is J G Ballard’s disaster novels, like The Drowned World or The Crystal World, where the protagonists embrace the strange new world and plough on into it (instead of running away) at the behest of some unexplained but oddly believable internal need.

I think this is a terrific Australian novel. Subject to its depiction of Indigenous Australians being acceptable to Indigenous Australians, which I cannot answer, I’d like to see it being much more widely known; at least as much as the works of Patrick White or Tim Winton.

Highly recommended.

short read of pkg_summary truncated bzip2 input

NetBSD 9.1 on RPi 1B (earmv6hf). Boot it up, su to root and:

# pkgin update

gives

short read of pkg_summary truncated bzip2 input

Tried changing the path in /etc/pkg_install.conf from ftp://ftp... to https://cdn....

No good.

Now, root is currently logged into the default shell. Try using bash and running this:

PATH="/usr/pkg/sbin:$PATH:/usr/sbin:/sbin"
PKG_PATH=https://cdn.NetBSD.org/pub/pkgsrc/packages/NetBSD/earmv6hf/9.1/All/
export PATH PKG_PATH
pkg_add -v lintpkgsrc

That all works, but # pkgin update gives the same error.

To where does it unzip pkg_summary.bz2?

Let’s get it manually:

# wget ftp://ftp.NetBSD.org/pub/pkgsrc/packages/NetBSD/earmv6hf/9.1/All/pkg_summary.bz2

Unpack it locally to see if it seems intact.

Compressed file ends unexpectedly!

What if we try the gzipped pkg_summary?

Seems OK. OK, seems to be a corrupt file on the mirror. Send in a bug report. Is there a way to force pkgin to use the gz file?

After a few days the non-truncated bz2 file showed up on the server and all was well.

Whereas some bug reports go ignored for 24 years…

SiPix A6 on Linux — works a treat

This is relatively simple. We set the serial port settings as per the SiPix instructions (see link below), we make a binary file according to the recipe at OpenPrinting (link below) and we send that binary file to the requisite device. That shows that it works. But then … then it turns out we can use CUPS, like a real printer.

We have a hardware serial port, and the user is in the dialout group so:

$ stty -F /dev/ttyS0 -a
$ stty -F /dev/ttyS0 115200 cs8 -cstopb -parenb
$ stty -F /dev/ttyS0 -ixon
$ stty -F /dev/ttyS0 crtscts

Then we follow the steps here to make a binary file then we cat it to the printer. Make a new file then print it. (Writer is my script that creates an empty ODT file then opens it.)

$ Writer newfile.odt

Make an A4 page of text, export to PDF from LibreOffice, then …

$ pdftops newfile.pdf
$ psresize -h14.8cm -w10.5cm -PA4 newfile.ps newfilea6.ps
$ gs @sipixa6.upp -sPAPERSIZE=a6 -sOutputFile=file-to-print.bin newfilea6.ps -c quit
$ cat file-to-print.bin > /dev/ttyS0

Note: psresize does not have an A6 size, so used the actual dimensions. The UPP file comes from http://openprinting.org/driver/sipixa6.upp and is used unmodified. We copy it to somewhere gs will find it.

In other words, it ‘just works’ with a real hardware serial port. Now, to set it up as a CUPS printer … we can ask openprinting.org to generate us a PPD file, and we have the serial port settings, because you can find a manual online (eg https://images-na.ssl-images-amazon.com/images/I/81kQgwUCBXL.pdf) so …

  1. CUPS is installed and the daemon is running, so … point Firefox at http://localhost:631/admin
  2. Click Add Printer and log in
  3. Choose Local Printers: Serial Port #1
  4. Choose settings: 115200 baud, no parity, 8-bit data, DTR/DSR hardware control (this was a guess, and turned out to be wrong)
  5. Name it (sipix)
  6. Choose the make and model — SiPix is in there! And this is the only model (can browse to the PPD from http://openprinting.org/driver/sipixa6.upp if it is not in the list of printers already)
  7. Do not change anything else. Leave input paper size as A4, to be scaled by the driver not the application
  8. Add printer
  9. Print test page
  10. Nope, printer just turns off
  11. Try RTS/CTS flow control
  12. Success!
The CUPS test page — looks fine (real size is on 1/4 of an A4 page).

The printer itself:

The printer -- image captured on the flatbed scanner

Can we print from the command line? Plain text:

$ lp -d sipix a-file-with-some-text.txt

Yep, fine, though it feeds some wasted paper through first. PDF file:

$ lp -d sipix newfile.pdf

Yep. All right.

Now, If I use a good-quality USB to serial dongle, I can print from a USB port. That means the SiPix is still a viable portable printer

Now, what next…?

The llama not eaten (yet), by T1

anotherkatewilson

This is another of Twin 1’s poems for his English anthology assignment, again on the llama theme.  It was inspired by Henry Hogge’s The Pig Poets, one of my favourite books of poetry, and comes with apologies to Frost – although as Frost originally intended The Road Not Taken to be humorous, perhaps he wouldn’t mind too much:

Two llamas diverged in a yellow wood,
And sorry I could not eat them both
In a single sitting, long I stood
Then chased one llama as far as I could
To where it fell in the undergrowth;

Then caught the other, just as round,
And skinned and quartered it right there,
And built a fire upon the ground;
With sticks and twigs that I had found
And cooked it all (except the hair),

And later that morning, filled with llama
I sneaked quietly past the llama farmer.
Oh, I kept…

View original post 52 more words

Monarch by Remington II: Where will it end?

Well …

Monarch by Remington, purchased for $11.65 (of which $1.65 was auctioneer’s commission) in late 2020. Things that were/are wrong with it include:

  • quite dirty
  • many sticky keys
  • paint a little blistered
  • no case.

Good points include sound basic condition, completeness, and low price. Normally, I would not bother with a copy of a machine I already have, especially when I am not overfond of my other Monarch. But I find it hard to pass up a $10 (+ commission) typewriter, so I put in the lowest possible bid and if I get it, great and if not then at least someone bought it. I hate to see a typewriter passed in!

Case

I built what I think is quite a nice little case for it, which sort of fixed one problem (I mean, not if you insist on original equipment, I guess), and had to add a second spring to help return the ‘l’ (el), which I just could not get to retract, and of course it needed a ribbon (but at least it had spools, if not original, at least ones with the big Remington holes in the middle…).

It is one of the Made in Holland ones (Ser no. AY 44 29 80; I’m guessing 1964…) and nothing flash at all. The line spacing is not completely consistent, which is probably my only gripe now, but the platen is nice and grippy, which makes up for a lot. On the whole, a useful, basic machine that one could carry around on the understanding that if something happened to it, that would be unfortunate but not heartbreaking.

Sample:

Sample text and charaxcter set for this typewriter

And gently down

 

 

 

 

 

Websites that look like this are scams

Little pop-up internet shops crop up here and there, and a lot of them are fake. This is just one example. You could go here:

https://www.fpalotinto.com/contact_us.htm

But here is a screenshot because this website will fold up soon and be replaced by others with pseudorandom names — but the pages will all look similar, and the contact page, which I have posted here, especially.

  • Key features — format of contact email address (name@somewords.com).
  • If you email, no-one will answer.
  • Domain name (fpalotinto) is not a word.
  • General look of the page is consistent from scam to scam, but fonts will change from scam site to scam site.
  • Offer free shipping (they won’t ship anything anyway).

You can look them up on https://www.whois.com/whois/fpalotinto.com, say, and get some ideas — for example, the domain name is registered through some cheap domain name supplier, and was only registered less than a year ago. I mean, some businesses are new, but that is a hint. There’s other stuff in the output you could look up — like on Google maps (3571 James St, Buffalo, NY 14219, USA). There’s no coherence between the registered details and the website. All names, emails etc are different. One problem is that websites that pretend to tell you what to trust (scamadviser, trustpilot) may not be trustworthy themselves.

It can be hard. You just have to look for hints, and you’ll soon get a sense not so spend money there.

Just a few words