Converting a Word document to HTML
If you're under Windows, and that you need to do some shell scripting with ActiveX/COM components, Scriptom will certainly help you. Today, my boss just asked me if we could use Scriptom to convert a Word document into an HTML equivalent. And I decided to see if that was possible. To my delight, my little Scriptom module, backed by Jacob, helped me solve this integration problem with only 6 lines of Groovy code!
import org.codehaus.groovy.scriptom.ActiveXProxy import java.io.File word = new ActiveXProxy("Word.Application") word.Documents.Open(new File(args).canonicalPath) word.ActiveDocument.SaveAs(new File(args).canonicalPath, 8) word.Quit()
Now, I just need to launch:
groovy word2html.groovy specification.doc specification.html
And I've got a nice Word to HTML converter! Well... I know, not that nice. First of all, it's a Windows-only solution, but that fits my requirements regarding the platform I'm running on, but the other negative aspect is that the generated HTML is really, really ugly. I really wonder why Microsoft can't do a cleaner output. For the moment, I'm happy with that solution.
You probably noticed the magic number 8. It's the HTML format option. The available formats are:
- 0: wdFormatDocument (no conversion)
- 1: wdFormatTemplate
- 2: wdFormatText
- 3: wdFormatTextLineBreaks
- 4: wdFormatDOSText
- 5: wdFormatDOSTextLineBreaks
- 6: wdFormatRTF
- 7: wdFormatUnicodeText
- 8: wdFormatHTML
I haven't yet figured out how to be able to use constants directly in Groovy. I'll have to make Scriptom grok M$'s constants.
The example I've talked about has been tested with groovy-beta-9, Word 2000 and my additional Scriptom module for Groovy (don't forget to install it if you want to try that sample).