Wednesday, January 6, 2016

It's not You, it's Me


Clearly that has to be the problem. I mean millions of people use computers every day and have little to no problems...right?  I must be doing something wrong.

I'll tell you my pathetic story.

So I wanted to combine some content from an excel spreadsheet into a word document as comments, so first approach was to just extract the text, add comments in a fancy "html" page. Did this, but tables and the like didn't export to plain text well. So I thought I could come up with something better.
Racoon accidentally dissolves cotton candy in water

Next shot, let's try VBA and just extract some comments from a CSV to insert in a Word Doc. (Seems like it should be straightforward, open CSV, grab data in columns 1 and 2, search in word doc for data in column 1, add comment from column 2). Nope Couldn't figure out how on earth to extract column data from a Excel application within a Word macro (using VBA).

Next, well, Word is a Zip file, lemme try unzipping the file and seeing how that works. Tried adding a single comment, Saved. Clearly they're not going to make this easy for me. Comments are in a comments.xml file with some unique identifiers and then in the document.xml file is the.... you guessed it, document, with the comment reference. This is do-able, but painful, so I thought I'd avoid this for the time being. (Oh by the way, if you haven't looked at the Office Open XML yea it's roughly 7k pages... AINT NO BODY GOT TIME FOR THAT)

Oh wait I nearly forgot, I tried adding a comment, very blindly, that failed, so I tried just unzipping the contents with python, tossed the output into a folder and rezipped, then renamed to .docx, yea word thinks the file is corrupted, no clue what the appropriate "re-zipping" is but good luck figuring that out too.

Next I tried out Mammoth something that attempts to convert a docx to a passable html file using python.... yea tried printing the resulting "HTML" and get unicode errors. Tried a few google searches, and approaches, no luck. Finally decided on a loop through all characters, then doing a Try Catch when printing to command line (Stderr), and storing "successful" prints to an array, for later printing. otherwise passing the "exception" catch. Only problem with a large doc, it takes FOREVER to print all to the command line (due to delays from printing to the command line, in this case not file IO). Oh Wait..... mammoth died, about 2 hours later..... yay for me.

It seems like anytime I need to fight with a Word doc to extract data, I can't seem to find the right incantation in google to solve my problem. Case in point again, find document line number of a comment, Not the section offset but the ABSOLUTE LINENUMBER OF THE DOCUMENT. Good luck. I dare you. I flipping dare you to try. If it takes you less than 8 hours before you give up you haven't tried hard enough, if you've been going on more than 8 hours.....Good luck, but you'll likely miss out on the rest of your life if you keep going.

You see, you can't just go through a word document line by line, you have to go section by section, but then lines aren't really lines in there, they're paragraphs, so how can you ever figure out what line in a paragraph something is? You can't you have to somehow figure out how many characters are on a line and possibly have to do some kind of mod operation ON EVERY PARAGRAPH to get EACH PARAGRAPH linecount and then FINALLY you can sum them up.... but oh wait, you have to do that for all preceeding sections. I won't get into the pain too much more, suffice to say, I finally finished with something that just gives you a section number and the line of the paragraph on that PAGE. No flipping clue how you can even figure that out, but it was the first solution I found and I finally had to give up and use that.

I probably could come up with a dozen horror stories of word if I was really pressed to it, but I'm fairly certain my PTSD would attack and I'd just blog.

Did I mention that 2015 just wasn't my year for computers? I gotta be honest 2016 isn't looking any better.