InfoQ

News

Article: Using Java to Crack Office 2007

Posted by Floyd Marinescu on Jun 05, 2007 10:12 AM

Community
Java
Topics
Java plus .NET Integration ,
Data Access
Tags
OpenXML ,
Microsoft Office
Historically, opening Office files from within Java has always been something of a problem because Office documents (principally Word, Excel, and PowerPoint) were stored in a binary format known to COM developers worldwide as the structured storage format.  Office 2003, introduced new XML formats unique to itself (such as WordML), which Java developers could use to read or write Office documents, but the formats were not well-documented, and Java developers frequently found themselves learning the WordML format through trial-and-error development. Various open-source projects stepped in to try and mitigate the situation, such as the POI framework from Apache, for reading and writing Excel documents, or various Java-COM solutions With Office 2007, Microsoft has made a significant part of these problems "go away". Without anything more complicated than the native JDK itself-in other words Java apps can now read and write any Office 2007 document, because Office 2007 documents are now nothing more than ZIP files of XML documents.

In  Using Java to Crack Office 2007, Ted Neward goes in-depth on the new Office file format and demonstrates reading it from Java.

23 comments

Watch Thread Reply

Nice!! by karan malhi Posted Jun 4, 2007 6:18 AM
Re: Nice!! by Dariusz Borowski Posted Jun 5, 2007 8:58 AM
Re: Nice!! by Jenny Thompson Posted Aug 7, 2007 4:38 AM
Re: Nice!! by Dariusz Borowski Posted May 16, 2008 8:23 AM
Re: Nice!! by Krishna moorthy Posted May 23, 2008 5:53 AM
Re: Nice!! by Dariusz Borowski Posted Jun 1, 2008 7:07 AM
just help me by arvind dalal Posted Nov 19, 2008 2:23 PM
Re: Nice!! by Slim Makni Posted Apr 13, 2009 5:19 AM
Accessing Word from Groovy is really easy by Peter Kelley Posted Jun 5, 2007 6:39 PM
Naturally Java can parse XML, but... by Colm Smyth Posted Jun 5, 2007 7:52 PM
Re: Naturally Java can parse XML, but... by Stefan Wenig Posted Jun 6, 2007 2:47 AM
Re: Naturally Java can parse XML, but... by Doug Mahugh Posted Jun 6, 2007 11:06 AM
Re: Naturally Java can parse XML, but... by Dash Bitla Posted Jun 12, 2007 10:50 PM
existing open source project for java/open XML by c dubettier Posted Jun 11, 2007 3:44 AM
Slight typo by Ted Neward Posted Jun 12, 2007 3:39 PM
Microsoft actually released a CTP version of OpenXML SDK by YewMing Chen Posted Jun 14, 2007 3:08 AM
Java and .NET interoperability by Arun Gupta Posted Jun 19, 2007 5:38 PM
Great Article ! But ... by Vikram Roopchand Posted Jul 1, 2007 2:30 AM
Using JAXB to unmarshall WordprocessingML to Java by Jason Harrop Posted Dec 14, 2007 2:04 AM
java.util.zip by dinesh prabakaran Posted Jun 17, 2008 3:45 AM
Great by Fahim A. Posted Aug 8, 2008 3:59 AM
One Question by tapan mokha Posted Aug 28, 2008 3:54 PM
Excel 2007 data in XML was not proper by Usman Bat Posted Nov 6, 2008 8:21 AM
  1. Back to top

    Nice!!

    Jun 4, 2007 6:18 AM by karan malhi

    This was a really nice article. I also wish java had support for modifying an existing zip file, instead of reading the whole zip, adding an entry, writing the whole thing into a new zip with the new entry.

  2. Back to top

    Re: Nice!!

    Jun 5, 2007 8:58 AM by Dariusz Borowski

    Yeah, nice but a little bit too late. Speaking of an 'OpenDocument'Format reminds me on UNO. UNO is an API for Java to be able to work with a documents. Instead of ziping/unzipping the document it gives the chance to work with OBJECTS. Objects are Tables, Paragraphs, Pictures, Lines, words or even letters. It is able to manipulate, save, delete or what ever somebody wants to do with the document and at the end you can convert it into a different format. It's not necassary to unzip the document to retrieve the xml file and then iterate through it. There is a much better way to do it. And you get a rid of reading XML tags by specific iteration or numeration. Check this out: http://api.openoffice.org/docs/DevelopersGuide/FirstSteps/FirstSteps.xhtml Cheers, Dariusz

  3. Back to top

    Accessing Word from Groovy is really easy

    Jun 5, 2007 6:39 PM by Peter Kelley

    Yesterday I wrote some code to access a Word document from Groovy using the Scriptom extension that uses the JACOB Java-COM bridge. Quite frankly it beats the hell out of writing XML manipulation code in Java. 36 Lines to get the text out of all comments in a Word document including constant declarations and code comments.

  4. Back to top

    Naturally Java can parse XML, but...

    Jun 5, 2007 7:52 PM by Colm Smyth

    Clearly Java's ability to parse valid XML documents is not in question! The issue with "cracking Office 2007" is not the fact that it doesn't have an XML format, it is the fact that it's format is a) a successor to an existing standard format, b) incomplete (missing required information), and c) not optimised for comprehension by third-party applications. The Wikipedia article includes a partial lists of criticisms. BTW, in that section it's hardly surprising that governments quoted from existing sources; why rewrite a criticism if it is valid? And on that basis, I do not rewrite them here but simply refer you to the link above, and to the Comparison of OpenDocument and Office Open XML formats.

  5. Back to top

    Re: Naturally Java can parse XML, but...

    Jun 6, 2007 2:47 AM by Stefan Wenig

    Colm, I guess we are all aware of the criticisms. It's just getting boring to be lectured about it whenever someone speaks about OOXML. For a lot of users, the single OOXML advantage of 100% compatibility with the old binary format is just so much more important than anything on the downside (a large part of which is only affecting developers of competing office suites). Can we please just get on with it and not get all political each and every time?

  6. Back to top

    Re: Naturally Java can parse XML, but...

    Jun 6, 2007 11:06 AM by Doug Mahugh

    Yes, the lectures have become quite boring and repetitive. There is a lot of opportunity for developers around the new formats simply because they're the defaults in Office, regardless of anyone's feelings on the politics of standards. One thing worth noting is that all of the parts shown above aren't required, because Word writes out many optional parts every time. You just need the start part (document.xml), content types, and a relationships part to have a minimal hello-world document, and the start part only needs to have the body/p/r/t child elements, as shown in section 2.2 of Part 3 of the spec. On a related note, there's a Java library for Open XML over on SourceForge (http://sourceforge.net/projects/openxml4j) that Julien Chable started. It's a good way for Java developers to get productive quickly.

  7. Back to top

    existing open source project for java/open XML

    Jun 11, 2007 3:44 AM by c dubettier

    Hello, If you wish to use java to change/create word documents, there is already some tools. There is an open source project at sourceforge : http://www.openxml4j.org/ You can merge docs, create paragraph, TOC … Currently there are 2 branches: As the main branch is under development, I would recommend you to look at the LEGACY branch which is used on my project without any problem. A merge of both branches are planned but it will not happen soon

  8. Back to top

    Slight typo

    Jun 12, 2007 3:39 PM by Ted Neward

    Sorry, folks, I just noticed a typo in the article; when it says, "it's pretty clear that the core piece of the document is the "" element..." it should (pretty obviously) read "it's pretty clear that the core piece of the document is the "document" element...". Sorry about that. Oh, and to the others who were pointing out the OpenOffice formats and that it's easy to access those file formats as well, I'd like to say I agree with you 100%,

  9. Back to top

    Re: Naturally Java can parse XML, but...

    Jun 12, 2007 10:50 PM by Dash Bitla

    Clearly Java's ability to parse valid XML documents is not in question! The issue with "cracking Office 2007" is not the fact that it doesn't have an XML format, it is the fact that it's format is a) a successor to an existing standard format, b) incomplete (missing required information), and c) not optimised for comprehension by third-party applications. The Wikipedia article includes a partial lists of criticisms. BTW, in that section it's hardly surprising that governments quoted from existing sources; why rewrite a criticism if it is valid? And on that basis, I do not rewrite them here but simply refer you to the link above, and to the Comparison of OpenDocument and Office Open XML formats.
    asdf asd fasd f adsf sad fasdfsdf

  10. But unfortunately it is more .NET So perhaps some of those .NET/Java interoperabilty would come handly. http://channel9.msdn.com/ShowPost.aspx?PostID=313246

  11. Back to top

    Java and .NET interoperability

    Jun 19, 2007 5:38 PM by Arun Gupta

    If you are interested in .NET interoperability, then you can use GlassFish V2 that provides Metro - the Web services stack! Check out a screencast at: http://blogs.sun.com/arungupta/entry/excel_using_wsit_javaone_2007 This shows how Excel 2007 can invoke a Secure and Reliable Web service endpoint deployed on GlassFish V2.

  12. Back to top

    Great Article ! But ...

    Jul 1, 2007 2:30 AM by Vikram Roopchand

    full power of the Office suite is in the APIs. Instead of manipulating XML, why not use the IDispatch interface directly..., I have created some samples using j-Interop (http://j-interop.sourceforge.net) , it's a pure java implementation of DCOM protocol, hence truly platform independent and the same thing is accomplished in less than 50 lines. Thanks, best regards, Vikram

  13. Back to top

    Re: Nice!!

    Aug 7, 2007 4:38 AM by Jenny Thompson

    Hi Dariusz, I am in need of Extracting the content of Word Document using java code. Can this be achieved by using Open Office? I m new to Open Office. If u r familiar with Open Office can u please give me an idea to start with? Can I get any sample code? Thanks in advance, Jenny

  14. Back to top

    Using JAXB to unmarshall WordprocessingML to Java

    Dec 14, 2007 2:04 AM by Jason Harrop

    The article envisions:

    obtain the schema documents that describe the OpenXML format, and use those to generate a set of Java classes that more closely mirror the OpenXML document format. Then, developers could work with native Java classes, rather than "Document" and "Element" classes.
    I've since done this successfully for a subset of WordML using JAXB - see my blog post for details. This will make it into the trunk of docx4j in due course.

  15. Back to top

    Re: Nice!!

    May 16, 2008 8:23 AM by Dariusz Borowski

    Hi Jenny! Sorry I just read your post. ;) Let me know if you are still interested in extracting content out of a document. Dariusz

  16. Back to top

    Re: Nice!!

    May 23, 2008 5:53 AM by Krishna moorthy

    Hi Dariusz,

    I am interested in reading the microsoft word document. I tried with OpenOffice UNO, but couldn't find the right documentation.

    The usecase (for generating reports)is to read the document, replace some parts with dynamic data and convert into pdf format. we are using Jasper for reporting but want to replace it with word document as the report contents are heavily modified by end users.

    thanks

  17. Back to top

    Re: Nice!!

    Jun 1, 2008 7:07 AM by Dariusz Borowski

    Hi Krishna, How does your document look like? Is the content wrapped in paragraphs or in tables as well?

    It's very straight forward. You just need to iterate through your pages and pull out the content as you wish.

    Here is a good reference how to get the content:
    Using OpenOffice.org from Java Applications

    If you need more information, just let me know.
    Cheers!

  18. Back to top

    java.util.zip

    Jun 17, 2008 3:45 AM by dinesh prabakaran

    I have used java.util.zip to unzip a word 2007 document and added a new .xml file into the docprops folder and then zip it again to .docx document. now i could open the document using word no problem in it but when i retrieve it using .net it is the specified part is not there in the package.. what could be the problem ?.. Thank you.

  19. Back to top

    Great

    Aug 8, 2008 3:59 AM by Fahim A.

    Thanks for this article! Really great!

  20. Back to top

    One Question

    Aug 28, 2008 3:54 PM by tapan mokha

    Firstly great article. Is there any way i can verify that this is a MS word document(.docx), what would be the name of the .xml file i am to look into to get this information(i assume this information would be a schema element defined in one of the .xml files).I need to verify it is in the correct format. Thanks Tapan

  21. Back to top

    Excel 2007 data in XML was not proper

    Nov 6, 2008 8:21 AM by Usman Bat

    The Excel 2007 data in XML was in the two files(When we unziped that files) The numeric values was in the Sheet1.xml files and the text data are in the xl\sharedStrings.xml. So that i couldn't able to arrange that. And another problem was not in the Sequence Order also. I have to read the numeric and text data also from the EXcel 2007. Any Guys can help in this....

  22. Back to top

    just help me

    Nov 19, 2008 2:23 PM by arvind dalal

    can you please send me the complete code which is written above because i am getting few errors while implementing it give me complete code with included packages.

  23. Back to top

    Re: Nice!!

    Apr 13, 2009 5:19 AM by Slim Makni

    Hi, Im' a beginner on java development and i need your help to run this source thanks

Educational Content

Bindings, Platforms, and Innovation

This presentation focuses on the Internet and separating myth from fact, history from the future, and the mundane from the imaginative. Bob Frankston presents a vision of what could and should be.

Orchestrating Long Running Activities with JBoss / JBPM

This article explores the use of JBoss and jBPM to implement design solutions that effectively address the issue of orchestrating long running activities.

Neo4j - The Benefits of Graph Databases

This presentation covers the use of graph databases as an optimal solution for data that is difficult to fit in static tables, rapidly evolving data or data that has a lot of optional attributes.

Realistic about Risk: Software development with Real Options

This session introduces Real Options and shows how it can help in running your project. Real Options is a decision-making process that can be used to manage risk.

Communication Flexibility Using Bindings

This article discusses the use of bindings on services and references (including the instance of non-configured bindings) as the means to implement SCA communications in a Web and SOA environment.

Writing DSLs in Groovy

After a short introduction to DSLs, Scott Davis plays with the keyboard showing how to approach the creation of a DSL by typing working snippets of Groovy code that get executed.

Scaling Agile with C/ALM (Collaborative Application Lifecycle Management)

IBM Rational and InfoQ present, Scaling Agile with C/ALM, an eBook showing organizations how to become “finely tuned software delivery machines” by enabling team integration and scaling.

Concurrent Programming with Microsoft F#

Amanda Laucher presents a real life enterprise application written in F#. She shows actual code snippets, explaining design decisions and suggesting how to use some of the F# constructs.