Finding styled paragraphs in a Word document


When working with Word documents, you'd often like to be able to extract relevant content automatically thanks to some scripts. That's exactly what I wanted to do this afternoon. I have a Word document which contains some scripts written with a specific style, and I wanted to extract those snippets, so that I may be able to check that they compile or that the unit tests they represent be asserted successfully.

As usual, you need to choose the right tool for the job. And for me, the right tool was my lovely Scriptom module for Groovy. Scriptom allows you to fill in the gap between Groovy scripts and your native ActiveX/COM components (on Windows obvioulsy).

I've found an interesting article explaining how one can do to find Word styles efficiently. The code was in C#, so the small burden was to convert C# code into Groovy, but that wasn't very complicated. I also put together all adjacent styled text so that I may build entire scripts (and not one line each time). And I was done!

The little program takes two arguments: the file name, and the style you wish to find in that document. You use simple and double quotes to surround the style name. I've regrouped together all contiguous styles together. And what else... well, the other comments are in the source code! So let's see what the code looks like!

import org.codehaus.groovy.scriptom.ActiveXProxy

def style = ""
def fileName = ""

if (args.length != 2) {
    println "Usage: groovy extractStyle yourFile.doc yourStyle"
    System.exit(1)
} else {
    fileName = args[0]
    style = removeQuotes(args[1])
}

// creating the Word application
def word = new ActiveXProxy("Word.Application")
try {
    // opening your document
    word.Documents.Open(new File(fileName).canonicalPath)

    // getting the content of the document
    def doc = word.ActiveDocument
    def range = doc.Content
    // prepare the search of styled text
    range.Find.ClearFormatting()
    range.Find.Style = style

    def start, end = 0
    def text = ""
    def found = true
    // while there is styled text to find...
    while (found) {
        // execute the find command

        found = range.Find.Execute().value
        start = range.Start.value
        if (end != start && end != 0) {
            // seperate found occurencies with a separator
            println("=" * 72)
            println text
            text = ""
        }
        // get the text content of the styled text found
        // without the last newline
        text += removeNewLine(range.Text.value) + "\n"
        end = range.End.value
    }
} finally {
    // quitting the Word application
    // without saving the changes (wdDoNotSaveChanges = 0)
    word.Quit(0)
}

/** Remove potential simple or double quotes around the name of the style to find */
String removeQuotes(String s) {
    if (s.startsWith(/'/) && s.endsWith(/'/) || s.startsWith(/"/) && s.endsWith(/"/))
        return s.substring(1, s.length() - 1)
    else
        return s
}

/** Remove the last new line character which doesn't render well under DOS */
String removeNewLine(String s) {
    switch (s.length()) {
        case 0: return ""
        case 1: return ""
        default: return s.substring(0, s.length() - 1)
    }
}

 

 
© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.