BT
x Your opinion matters! Please fill in the InfoQ Survey about your reading habits!

Using Java to Crack Office 2007

Posted by Ted Neward on Jun 04, 2007 |

In the last installment, "Office Rich Client Applications", we talked of using the Office 2007 platform as a baseline from which to build rich client applications that interoperate with Java technologies in a variety of different ways. One area of Office/Java interoperability that wasn't covered, however, is probably the oldest means by which Office and Java work together, that of Java applications manipulating Office documents: creating documents, editing them, gathering data out of them, and so on.

Historically, this has always been something of a problem because Office documents (principally Word, Excel, and PowerPoint) were stored in a binary format known to COM developers worldwide as the structured storage format, an intrinsically hierarchical binary format accessed using COM interfaces. While a genuinely useful approach to take for COM developers (and developers working in COM-aware languages, like Visual Basic, Delphi and C++/ATL), the resulting file contents were inaccessible to any language that didn't "speak COM". Workarounds to allow Java access to the contents of these files varied wildly from one application to another; for example, it was well known that Excel could read comma-separated value (CSV) files, so Java applications that wanted to export data into Excel-friendly format would export it in CSV format (an ugly format if ever there was one). Word, of course, could also read Rich Text Format (RTF) files, and the RTF specification was open and somewhat well-documented. Subsequent versions of Office, namely Office 2003, introduced new XML formats unique to itself (such as WordML), which Java developers could use to read or write Office documents, but the formats were not well-documented, and Java developers frequently found themselves learning the WordML format through trial-and-error development. Various open-source projects stepped in to try and mitigate the situation, such as the POI framework from Apache, for reading and writing Excel documents, or various Java-COM solutions suggested that a Java developer could read and/or write Office files using the same structured storage APIs that Office itself used, but that was hardly sufficient, since now a developer had to figure out the internal structure of the Office document format, itself a fairly complex format, and, of course, completely undocumented.

All in all, the Java/Office story was, to put it mildly, a pretty ugly situation. Java developers either put up with it, consoled themselves in a manner highly reminiscent of one of Aesop's Fables by saying that "Office sucks, anyway, why would anyone want to use it?", or simply told their Office-using customers that Java couldn't understand Office because of the Microsoft/Sun lawsuit.

With Office 2007, Microsoft has, without a doubt, made a significant part of these problems "go away". Without anything more complicated than the native JDK itself-in other words, no third-party libraries are necessary-a Java application can now read and write any Office 2007 document, because Office 2007 documents are now nothing more than ZIP files of XML documents. Known as the "OpenXML" specification and submitted to the European Computer Manufacturers Association (ECMA), the same home under which the C# language and CLI runtime specifications live, the OpenXML specs are available for any and all to download for free at ECMA . Armed with these, an installation of Office 2007 (for verification and some testing), and a standard Java6 JDK installation, Java can now crack open an Office 2007 document, scoop out the juicy middle, manipulate the contents, and re-save the data.

In this article, unlike previous articles, rather than building a simple application, the code will instead use a technique first pioneered by Stuart Halloway, called exploration testing. In an exploration test, developers write unit tests that allow developers to explore the API, verifying the results using the traditional test-assertion semantics of the unit test world. The side benefit of exploration testing, then, is when a new version of the API becomes available-in this case, a new version of Office-the tests can be run against that new version to verify that nothing has changed in the API's usage model.)

For starters, let's just have a quick look at an Office 2007 document. Given a simple Word 2007 document that contains just some plain text, like so:


When saved, Word 2007 saves this as "Hello.docx" unless told to save as a "backwards-compatible" format, namely either WordML from Office 2003, or the older binary structured storage format from Word97. The ".docx" format is the OpenXML format, and, according to Microsoft's own documentation, is a ZIP file of XML documents containing the data and formatting of the document in an analogous fashion to the way the binary structured storage APIs stored data in previous versions of Office. If this is the case, then the Java "jar" utility, which understands ZIP and TAR formats, should be able to display the contents, and sure enough, it does:


The basic structure of the Word 2007 document format is already fairly clear, just from looking at the resulting output. (And the fact that the "jar" utility understands this is, in of itself, exciting, as it means that the java.util.jar and/or java.util.zip packages will also be able to access the contents pretty easily.) Without even cracking any of the specification documents open, it's a fair guess that the core document content will be stored in "document.xml", and the remainder of the XML files will be various supplementary parts, such as the fonts used in the document (fontTable.xml) and the Office theme used (theme/theme1.xml), and so on.

'Tis time to write some exploration tests. (Interested readers are encouraged to fire up the text editor or IDE, and add these to a JUnit 4 test class, and extend the tests as their imagination takes them.) Using JUnit 4, the first test will simply ensure the file is present in the expected location (an obvious requirement if the remainder of the tests are going to work):

@Test public void verifyFileIsThere() {
assertTrue(new File("hello.docx").exists());
assertTrue(new File("hello.docx").canRead());
assertTrue(new File("hello.docx").canWrite());
}

The next test simply verifies that we can open the file using the Java library's most obvious candidate, java.util.zip.ZipFile:

@Test public void openFile()
throws IOException, ZipException
{
ZipFile docxFile =
new ZipFile(new File("hello.docx"));
assertEquals(docxFile.getName(), "hello.docx");
}

So far, so good. Java's ZipFile recognizes that the file is, in fact, a zip file, and, if luck holds, will let us iterate through the contents and discover the data inside. Let's write a quick test that iterates through the contents, looking for the "document.xml" file:

@Test public void listContents()
throws IOException, ZipException
{
boolean documentFound = false;

ZipFile docxFile =
new ZipFile(new File("hello.docx"));
Enumeration entriesIter =
docxFile.entries();
while (entriesIter.hasMoreElements())
{
ZipEntry entry = entriesIter.nextElement();

if (entry.getName().equals("document.xml"))
documentFound = true;
}
assertTrue(documentFound);
}

Strangely enough, though, when run, this test produces a test failure; "document.xml" doesn't seem to be found-this is because, using the ZipFile/ZipEntry API requires the complete directory/file name inside the archive to match. Change the test above to read "word/document.xml", and it passes.

Finding the document is fine; next, let's crack the thing open and look at the XML inside. Doing this is fairly simple, as ZipFile has an API to return the ZipEntry itself by name:

@Test public void getDocument()
throws IOException, ZipException
{
ZipFile docxFile =
new ZipFile(new File("hello.docx"));
ZipEntry documentXML =
docxFile.getEntry("word/document.xml");
assertNotNull(documentXML);
}

The ZipFile code has the ability to return the contents of the contained entry itself, via the getInputStream() call that returns, not surprisingly, an InputStream. Feeding the InputStream into a DOM factory node will create a DOM of the document itself:

@Test public void fromDocumentIntoDOM()
throws IOException, ZipException, SAXException,
ParserConfigurationException
{
ZipFile docxFile =
new ZipFile(new File("hello.docx"));
ZipEntry documentXML =
docxFile.getEntry("word/document.xml");
InputStream documentXMLIS =
docxFile.getInputStream(documentXML);
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
Document doc =
dbf.newDocumentBuilder().parse(documentXMLIS);

assertEquals("[w:document: null]",
doc.getDocumentElement().toString());
}

In fact, the document.xml contents (below, with the namespace declarations removed for clarity) look pretty tame compared to some other XML document formats that support the wide-ranging kinds of formatting that Word needs to support:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document ...>
<w:body>
<w:p w:rsidR="00DE36E5" w:rsidRDefault="00DE36E5">
<w:r>
<w:t>Hello, from Office 2007!</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidR="00DE36E5">
<w:pgSz w:w="12240" w:h="15840"/>
<w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
<w:cols w:space="720"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

The full details of what each of these elements mean is beyond the scope of this article, and readers are referred to the OpenXML documentation set for the complete reference, but it's pretty clear that the core piece of the document is the "" element, which contains "p" (paragraph) elements, which contains "r" (text range) elements in turn made up of "t" (text) elements, in which is found the body of the hello.docx document itself, in this case, the single sentence "Hello, from Office 2007!".

Having read the file's contents, now it'd be nice to be able to modify those contents, write the file back, and open it in Word 2007. A quick glance at the ZipFile and ZipEntry API reveals a problem, however: while these classes can be used to read a zip file, they have no facility for writing one.

A variety of mechanisms are available to redress this particular lack. One approach would be to simply write out the XML text to a String, store the String into the document.xml file, and re-zip the entire contents using the ZipOutputStream class. Another approach could be to use a third-party tool (or build one) that can edit the zip contents in place, but this is outside the core JDK itself, so for this article, the ZipOutputStream approach will be the one taken.

A couple of things have to happen in order for all this to take place. First, the Java application must navigate through the DOM hierarchy, find the "t" node, and replace its text content with the message to be sent back to the Word document. ("Hello, Office 2007, from Java6!" seems somehow to be appropriate.) The resulting DOM instance must then be saved to disk, a task that's not easy to do using the Java XML APIs. (In a nutshell, the developer must create a Transformer from the javax.xml.transform package, and do an XML identity-transform to a StreamResult wrapped around a ByteArrayOutputStream.

Once that's all done, the code must then create a new ZIP file, this time using the ZipOutputStream; but because only the contents of the file need to be changed, not the styles, fonts, or formatting, the other components of the original file need to be copied over from the source. A simple loop, iterating through each of the ZipEntries from the source file, copying over the contents (except for "word/document.xml", in which case the contents are the modified byte array) into a new ZipEntry and written to that entry, will suffice. When all is said and done, the code looks like the following:

@Test public void modifyDocumentAndSave()
throws IOException, ZipException, SAXException,
ParserConfigurationException,
TransformerException,
TransformerConfigurationException
{
ZipFile docxFile =
new ZipFile(new File("hello.docx"));
ZipEntry documentXML =
docxFile.getEntry("word/document.xml");
InputStream documentXMLIS =
docxFile.getInputStream(documentXML);
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
Document doc =
dbf.newDocumentBuilder().parse(documentXMLIS);

Element docElement = doc.getDocumentElement();
assertEquals("w:document", docElement.getTagName());

Element bodyElement = (Element)
docElement.getElementsByTagName("w:body").item(0);
assertEquals("w:body", bodyElement.getTagName());

Element pElement = (Element)
bodyElement.getElementsByTagName("w:p").item(0);
assertEquals("w:p", pElement.getTagName());

Element rElement = (Element)
pElement.getElementsByTagName("w:r").item(0);
assertEquals("w:r", rElement.getTagName());

Element tElement = (Element)
rElement.getElementsByTagName("w:t").item(0);
assertEquals("w:t", tElement.getTagName());

assertEquals("Hello, from Office 2007!",
tElement.getTextContent());

tElement.setTextContent(
"Hello, Office 2007, from Java6!");

Transformer t =
TransformerFactory.newInstance().newTransformer();
ByteArrayOutputStream baos =
new ByteArrayOutputStream();
t.transform(new DOMSource(doc),
new StreamResult(baos));

ZipOutputStream docxOutFile = new ZipOutputStream(
new FileOutputStream("response.docx"));
Enumeration entriesIter =
docxFile.entries();
while (entriesIter.hasMoreElements())
{
ZipEntry entry = entriesIter.nextElement();

if (entry.getName().equals("word/document.xml"))
{
byte[] data = baos.toByteArray();
docxOutFile.putNextEntry(
new ZipEntry(entry.getName()));
docxOutFile.write(data, 0, data.length);
docxOutFile.closeEntry();
}
else
{
InputStream incoming =
docxFile.getInputStream(entry);
byte[] data = new byte[1024 * 16];
int readCount =
incoming.read(data, 0, data.length);
docxOutFile.putNextEntry(
new ZipEntry(entry.getName()));
docxOutFile.write(data, 0, readCount);
docxOutFile.closeEntry();
}
}
docxOutFile.close();

}

My apologies for the amount of code displayed there, but in truth, this is one of Java's weakest areas compared to other languages or libraries. Fortunately, the result pays off-the resulting document looks like so:


Obviously a number of things could be done to improve this scenario.

First, a better XML-manipulation library, one that supports XPath out of the box and natively serializing XML DOM structures back to disk, would be a great help in reducing the amount of code here. JDOM, the open-source Java/XML library (available at jdom.org) is an obvious choice. So would XMLBeans from Apache. A corollary to that would be to obtain the schema documents that describe the OpenXML format, and use those to generate a set of Java classes that more closely mirror the OpenXML document format. Then, developers could work with native Java classes, rather than "Document" and "Element" classes.

Second, either of those approaches could be then combined into a more Office-specific API, one that elevates the abstraction layer of working with Word (or Excel, or PowerPoint) documents away from the actual storage of XML and instead focuses on the fact that these are documents that have paragraphs, fonts, and so on. In essence, libraries like POI can and should be updated to reflect the new Office XML format, and ideally will be written to support either the legacy binary structured storage format as well as the new OpenXML format.

Third, Java could use with a small overhaul to its ZIP file support, though again this could be done through a third-party library.

Despite all the cumbersome API calls, however, it's exciting and inspiring to consider how open the Office platform is to the Java programmer. Between the interoperability of using Java within an Office application, using Office within a Java application, and being able to read and write Office file formats from Java, the Office platform is more open to the Java programming community than ever before.

The sample code accompanying this article can be downloaded HERE.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Nice!! by karan malhi

This was a really nice article. I also wish java had support for modifying an existing zip file, instead of reading the whole zip, adding an entry, writing the whole thing into a new zip with the new entry.

Re: Nice!! by Dariusz Borowski

Yeah, nice but a little bit too late. Speaking of an 'OpenDocument'Format reminds me on UNO. UNO is an API for Java to be able to work with a documents. Instead of ziping/unzipping the document it gives the chance to work with OBJECTS. Objects are Tables, Paragraphs, Pictures, Lines, words or even letters. It is able to manipulate, save, delete or what ever somebody wants to do with the document and at the end you can convert it into a different format.

It's not necassary to unzip the document to retrieve the xml file and then iterate through it. There is a much better way to do it. And you get a rid of reading XML tags by specific iteration or numeration.

Check this out: api.openoffice.org/docs/DevelopersGuide/FirstSt...

Cheers,
Dariusz

Accessing Word from Groovy is really easy by Peter Kelley

Yesterday I wrote some code to access a Word document from Groovy using the Scriptom extension that uses the JACOB Java-COM bridge. Quite frankly it beats the hell out of writing XML manipulation code in Java. 36 Lines to get the text out of all comments in a Word document including constant declarations and code comments.

Naturally Java can parse XML, but... by Colm Smyth

Clearly Java's ability to parse valid XML documents is not in question!

The issue with "cracking Office 2007" is not the fact that it doesn't have an XML format, it is the fact that it's format is a) a successor to an existing standard format, b) incomplete (missing required information), and c) not optimised for comprehension by third-party applications. The Wikipedia article includes a partial lists of criticisms.

BTW, in that section it's hardly surprising that governments quoted from existing sources; why rewrite a criticism if it is valid?

And on that basis, I do not rewrite them here but simply refer you to the link above, and to the Comparison of OpenDocument and Office Open XML formats.

Re: Naturally Java can parse XML, but... by Stefan Wenig

Colm, I guess we are all aware of the criticisms. It's just getting boring to be lectured about it whenever someone speaks about OOXML. For a lot of users, the single OOXML advantage of 100% compatibility with the old binary format is just so much more important than anything on the downside (a large part of which is only affecting developers of competing office suites). Can we please just get on with it and not get all political each and every time?

Re: Naturally Java can parse XML, but... by Doug Mahugh

Yes, the lectures have become quite boring and repetitive. There is a lot of opportunity for developers around the new formats simply because they're the defaults in Office, regardless of anyone's feelings on the politics of standards.

One thing worth noting is that all of the parts shown above aren't required, because Word writes out many optional parts every time. You just need the start part (document.xml), content types, and a relationships part to have a minimal hello-world document, and the start part only needs to have the body/p/r/t child elements, as shown in section 2.2 of Part 3 of the spec.

On a related note, there's a Java library for Open XML over on SourceForge (sourceforge.net/projects/openxml4j) that Julien Chable started. It's a good way for Java developers to get productive quickly.

existing open source project for java/open XML by c dubettier

Hello,

If you wish to use java to change/create word documents, there is already some tools. There is an open source project at sourceforge : www.openxml4j.org/
You can merge docs, create paragraph, TOC …
Currently there are 2 branches: As the main branch is under development, I would recommend you to look at the LEGACY branch which is used on my project without any problem. A merge of both branches are planned but it will not happen soon

Slight typo by Ted Neward

Sorry, folks, I just noticed a typo in the article; when it says, "it's pretty clear that the core piece of the document is the "" element..." it should (pretty obviously) read "it's pretty clear that the core piece of the document is the "document" element...".

Sorry about that.

Oh, and to the others who were pointing out the OpenOffice formats and that it's easy to access those file formats as well, I'd like to say I agree with you 100%,

Re: Naturally Java can parse XML, but... by Dash Bitla

Clearly Java's ability to parse valid XML documents is not in question!

The issue with "cracking Office 2007" is not the fact that it doesn't have an XML format, it is the fact that it's format is a) a successor to an existing standard format, b) incomplete (missing required information), and c) not optimised for comprehension by third-party applications. The Wikipedia article includes a partial lists of criticisms.

BTW, in that section it's hardly surprising that governments quoted from existing sources; why rewrite a criticism if it is valid?

And on that basis, I do not rewrite them here but simply refer you to the link above, and to the Comparison of OpenDocument and Office Open XML formats.


asdf asd
fasd
f
adsf
sad
fasdfsdf

Microsoft actually released a CTP version of OpenXML SDK by YewMing Chen

But unfortunately it is more .NET
So perhaps some of those .NET/Java interoperabilty would come handly.
channel9.msdn.com/ShowPost.aspx?PostID=313246

Java and .NET interoperability by Arun Gupta

If you are interested in .NET interoperability, then you can use GlassFish V2 that provides Metro - the Web services stack! Check out a screencast at:

blogs.sun.com/arungupta/entry/excel_using_wsit_...

This shows how Excel 2007 can invoke a Secure and Reliable Web service endpoint deployed on GlassFish V2.

Great Article ! But ... by Vikram Roopchand

full power of the Office suite is in the APIs. Instead of manipulating XML, why not use the IDispatch interface directly..., I have created some samples using j-Interop (j-interop.sourceforge.net) , it's a pure java implementation of DCOM protocol, hence truly platform independent and the same thing is accomplished in less than 50 lines.

Thanks,
best regards,
Vikram

Re: Nice!! by Jenny Thompson

Hi Dariusz,
I am in need of Extracting the content of Word Document using java code. Can this be achieved by using Open Office? I m new to Open Office. If u r familiar with Open Office can u please give me an idea to start with? Can I get any sample code?

Thanks in advance,
Jenny

Using JAXB to unmarshall WordprocessingML to Java by Jason Harrop

The article envisions:
obtain the schema documents that describe the OpenXML format, and use those to generate a set of Java classes that more closely mirror the OpenXML document format. Then, developers could work with native Java classes, rather than "Document" and "Element" classes.

I've since done this successfully for a subset of WordML using JAXB - see my blog post for details. This will make it into the trunk of docx4j in due course.

Re: Nice!! by Dariusz Borowski

Hi Jenny! Sorry I just read your post. ;)

Let me know if you are still interested in extracting content out of a document.
Dariusz

Re: Nice!! by Krishna moorthy

Hi Dariusz,



I am interested in reading the microsoft word document. I tried with OpenOffice UNO, but couldn't find the right documentation.



The usecase (for generating reports)is to read the document, replace some parts with dynamic data and convert into pdf format. we are using Jasper for reporting but want to replace it with word document as the report contents are heavily modified by end users.



thanks

Re: Nice!! by Dariusz Borowski

Hi Krishna,

How does your document look like? Is the content wrapped in paragraphs or in tables as well?



It's very straight forward. You just need to iterate through your pages and pull out the content as you wish.



Here is a good reference how to get the content:
Using OpenOffice.org from Java Applications




If you need more information, just let me know.


Cheers!

java.util.zip by dinesh prabakaran

I have used java.util.zip to unzip a word 2007 document and added a new .xml file into the docprops folder and then zip it again to .docx document. now i could open the document using word no problem in it but when i retrieve it using .net it is the specified part is not there in the package.. what could be the problem ?..
Thank you.

Great by Fahim A.

Thanks for this article! Really great!

One Question by tapan mokha

Firstly great article.
Is there any way i can verify that this is a MS word document(.docx), what would be the name of the .xml file i am to look into to get this information(i assume this information would be a schema element defined in one of the .xml files).I need to verify it is in the correct format.
Thanks
Tapan

Excel 2007 data in XML was not proper by Usman Bat

The Excel 2007 data in XML was in the two files(When we unziped that files)
The numeric values was in the Sheet1.xml files and the text data are in the xl\sharedStrings.xml. So that i couldn't able to arrange that. And another problem was not in the Sequence Order also. I have to read the numeric and text data also from the EXcel 2007. Any Guys can help in this....

just help me by arvind dalal

can you please send me the complete code which is written above
because i am getting few errors while implementing it

give me complete code with included packages.

Re: Nice!! by Slim Makni

Hi,
Im' a beginner on java development and i need your help to run this source
thanks

Embed Excel/Powerpoint files in Word document by andr dk

Is it possible to embed excel(xls) or powerpoint(ppt) files using the above approach? I tried it but the preview image for OLE content is the same as in the original content of the docx file. Is there a way to preview emf image in Linux/java environment?

javadocx by Eduardo Ramos

You may find interesting javadocx a java library to generate MS Word documents from any data source. It has a LGPL version that you may download from www.javadocx.com

Best regards,

Eduardo

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

25 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT