I'm in on the "shit on microsoft for hard to use formats train" but as someone who did a LOT of .docx parsing - it turned into zen when I realized that I can just convert my docs into the easily parsed .html5 using something like pandoc.
This is a good blog post and Xan is a really neat terminal tool.
This is a good blog post and Xan is a really neat terminal tool.