Apache POI 3.5 Released with OOXML Support
Apache have released version 3.5 of POI, the Java library for working with Microsoft's document formats. POI previously supported Microsoft's OLE2 compound document formats as used in Office 97-2003 (versions 8.0 - 11). With POI 3.5 Apache have added support for Microsoft's Office Open XML (OOXML) document standard, the default file format for Microsoft Office 2007.
Existing code using the HSSF user model will continue to work correctly with POI 3.5 although it cannot support the OOXML format. To work with both OOXML and OLE2 formats POI has a new SS user model (org.apache.poi.ss.usermodel) which provides a common API. The new SS model is heavily based on the old HSSF model (org.apache.poi.hssf.usermodel) and as such retains a familiar feel for developers used to working with HSSF. Updating from one to the other is also not a particularly arduous task, the main difference being that the package name and class names have been tweaked to remove HSSF from them. It should be noted however that there are some features that are not yet covered by the SS user model. Apache's Yegor Kozlov, release manager for POI 3.5, told us:
Currently the SS usermodel covers about 90% of the functionality provided by HSSF. The goal is to achieve full coverage. In particular, conditional formatting and data validation are not yet supported by XSSF. There is also some work to do with graphics and rich text (both in HSSF and XSSF).
POI's OOXML support was developed from scratch and is independent of the OpenXML SDK. Along with Julien Chable's OpenXML4J project which merged with POI, a significant contribution to the OOXML implementation in version 3.5 came from Sourcesense, an Open Source company which was commissioned by Microsoft to do the work. Microsoft's Robert Duffner told us that Microsoft themselves provide the funding, architectural guidance, program & project management, feature prioritisation & release strategy for the OOXML work with Sourcesense as the development partner.
Microsoft's involvement inevitably caused some controversy, with some POI contributors questioning the validity of Microsoft's Open Specification Promise (OSP) patent license. Writing on his blog at the time project founder Andrew Oliver stated:
Although the OSP does not address some of the edge cases where work may be required for compatibility but not for implementing the specification, Microsoft has agreed to go further and sign a specific agreement with Apache which will address this concern for the work they have funded with POI. Furthermore, the OSP will be managed as a legal product much like the way that an Open Source project is, with revisions as they are needed.
When we spoke to Oliver however he did make it clear that he still harbours some concerns about potential patent issues in the specification:
The patent promise only goes to the extent of conformance to the OOXML specification, it does not go as far as compatibility with the products using the specification. Microsoft has key patents on things that could fall outside of this (i.e. methods of reading an OOXML document that does not conform to its schema) that are not in the scope of conformance to the specification but are in the scope of POI's mission of compatibility with popular proprietary office software.
To be clear, Microsoft employees Robert Duffner and former employee Sam Ramji were very willing to work on this issue, but currently there is not sufficient focus on resolving these and other IP issues within the POI developer community itself. There is a belief in trusting that Microsoft means well and that its interests will prevent it from doing something wrong. I disagree and think that corporate strategies/interest change. POI also needs more IP auditing and governance overall, I think the project has been doing a bad job of this lately.
Additionally, no efforts were made to ensure that any other inadvertently or purposefully implemented patents held by Microsoft which are not required for conformance to the OOXML specifications and were done by Sourcesense on Microsoft's dime were covered (outside of this "promise"). Sourcesense signed a CLA with Apache, but there is no similar agreement between Apache and Microsoft. This means that Sourcesense would be theoretically liable to Apache should there be a problem, but I do not know that is sufficient. I would like to see the developer community take an active role in resolving all of these issues.
Yegor Kozlov, release manager for the POI 3.5, disagrees with Oliver's assessment: "There were hot debates on this subject some time ago", he told us, and "details can be found here." He went on:
Actually it was an Andy Oliver vs Apache/Microsoft/Sourcesense struggle, as Andy was the only person vetoing OOXML development until Microsoft rewrites its Open Specification Promise (OSP) to his satisfaction. Sadly, it later resulted in his resignation from Apache POI.
I (and I believe the whole Apache POI community) have every reason to believe that Microsoft is interested in promoting the ISO/IEC 29500 and ECMA 376, and wants to ensure that it's easy to adopt. There are several other open source projects that don't seem to be worried about the patent and compatibility issues, including OpenOffice.
These thoughts are echoed by Microsoft's technical lead for the POI project, Vijay Rajagopalan, who told us:
Enabling developers to accomplish their common tasks with OpenXML file formats is our highest order bit. It does not matter whether they use .NET or Java. We want to enable Java developers who need to manipulate calculations in spreadsheets with a simple & pragmatic API. The POI project was the answer to this. This is one of the reasons why we are funding this open source project and working with the larger community to demonstrate practical interoperability.
As well as the patent concerns there is some difficulty with the OOXML standard as it currently stands. POI officially supports the ISO/IEC 29500:2008 version, approved by the Ecma International Technical Committee and published in November 2008. However since this version contains some changes that were introduced after Office 2007 shipped, Office 2007 is not itself fully compliant with it. Microsoft have stated that Office 2010 will be the first version of their office suite to fully support the standard. As it stands therefore in cases of discrepancy POI aims to be compatible with the Office 2007 version. Vijay Rajagopalan told us:
The POI OpenXML Java APIs are created for practical developer use cases and not really driven based on standards politics. So we have kept the user/developer in mind when we have built these APIs.
In general, he told us:
We are becoming more transparent in our approach to how we implement standards. IE 8 is a great example of this. We contributed the CSS test cases & specifications to W3C. Similarly, the Office team is documenting implementation notes illustrating how the standard is implemented.
We went on to talk about how the POI implementation will be managed when Office 2010 ships. Rajagopalan stated:
Office 2010 will embrace ISO/IEC 29500:2008 and we plan to use the implementer notes & community validator project (details here) spearheaded by Fraunhofer institute to validate the conformance of the specification for interoperability scenarios.
We also discussed future plans for POI with Apache's Kozlov. He highlighted four other areas he hoped to focus on:
- Support for Excel Charts both in HSSF and XSSF. This is a long-standing plan and I hope we will complete it by POI-3.6.
- Improve memory usage. Both HSSF and XSSF are memory hogs. Both keep the whole model in memory resulting in OutOfMemory when generating large grids - 10K of records or so. The actual limit depends on the amount of available heap memory.
There are workarounds for applications that need only read access (text extraction and indexing, etc.) - for those we provide event-based APIs with low memory footprint, but generating large grids using user model API is an issue.
- Scratchpad projects. HWPF (DOC format) and HSLF (PPT format) still live in the scratchpad area of POI. That means they are not "mature" enough to be included in the main jar. Unfortunately there are no active developers working on these modules and we only apply sporadic patches and fixes. This part of POI definitely needs some attention.
- Continue the work on DOCX and PPTX modules.
Project founder Andrew Oliver also raised the memory issue as a future target telling us:
I think that the OLE2FS code should be revisited to use RandomAccessFile and memory mapping to avoid using 5-10x as much memory/heap as the file is large when doing a modification. When we originally wrote this code Java NIO didn't exist.