Note to self.
When you have to do something 1 or 2 times, a GUI is fine. When you have to repeat it 100 times, a GUI is less fine.
I have a very specific use-case. I want to compare many pairs of Word files. I only want to compare the text, and the changes will be very small. I don’t want to have to manually run the Word compare documents tools on all hundred pairs of files.
First, I converted all to text, using a command like this for each one:
soffice --convert-to "txt:Text (encoded):UTF8" "filename(17Jun20).docx"
I could not use
soffice --convert-to "txt:Text (encoded):UTF8" *.docx
because the filenames have special characters in them. But a script can be made.
The trick is that one of the pair of files was back-converted from PDF, so there are hard returns and soft returns all over the place, so even though the texts look quite similar to the eye, the line breaking and other issues can be quite different. I want to make sure diff, which works line-by-line, won’t produce lots of output just because line-breaking has changed and one doc has long wrapped lines and the other has a return at the end of each visible line, or something. I decided to turn each text file into a single column of words.
In this script, $1 is a 5 digit number given on the command line of the script. It identifies the file. Some of the character codes might render funny in the blog.
# Replace all white space with newline sed -E -e 's/[[:blank:]]+/\n/g' input-$1-file.txt > $1_f1.txt # Replace all ^M with newline sed -i "s/^M/\n/g" $1_f1.txt # various hyphens; breaking, nonbreaking, en rules, etc; I am not looking for them, so # Replace all - with newline, etc sed -i "s/-/\n/g" $1_f1.txt sed -i "s/â?"/\n/g" $1_f1.txt sed -i "s/â?"/\n/g" $1_f1.txt sed -i "s/A-/\n/g" $1_f1.txt sed -i "s/â?`/\n/g" $1_f1.txt sed -i "s/â?O/\n/g" $1_f1.txt sed -i "s/â??/\n/g" $1_f1.txt # Remove all empty lines sed -i '/^\s*$/d' $1_f1.txt
Putting odd characters in sed or tr commands can be done several ways. Things that render OK (modern terminal emulators can cope with en dashes, etc, for example) can be middle-button pasted into the script. Others (things that look like <200b> in Vim, for example) you enter by Ctrl+V u200d (or whatever) (in Vim). That is, Ctrl+V then u then the code that you saw in the angle brackets. ^M is inserted by holding down Ctrl then hitting V then M then releasing Ctrl.
The script turns the file into a single long column, one word/character group per line. The various sed commands just put in line breaks in place of various characters that I want removed from the comparison. This must be tuned to the project in question. I then remove all the empty lines.
I then do the same to the other file in the pair and run diff.
diff $1_?1.txt > $1_diff.txt paste -d '\t' $1_?1.txt > $1_paste.txt wc $1_diff.txt
I also paste the two into a single file side-by-side, and word count the diff result. If the wc output is zero, they are identical. If not, I look at the others and isolate the differences. Of the 100-odd files, I was able to eliminate more than half immediately.