BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Bridging Microsoft Word and the Browser

Bridging Microsoft Word and the Browser

Bookmarks

There are plenty of WYSIWYG HTML editors in the software world, most built using JavaScript and JavaScript based libraries. These editors work fine for all HTML related formatting and for generating HTML source data, but they don’t have all the capabilities we require in business reports. For example, creating graphics, diagrams, tracking changes and inserting comments are useful while publishing with a typical review / approval life cycle; all of which come out of the box in Microsoft Word. Plus, Word inherently works offline. We don’t need an Internet connection to work on existing documents or to create new ones. Browser/JavaScript offline capabilities are still limited. Considering these abilities, utilizing native applications specifically created for this purpose looks to be a good solution.

For enhanced business oriented reporting and preview purposes, Apache Poor Obfuscation Implementation (POI) is an excellent Java framework to read Microsoft documents such as MS Word and MS Excel. It provides enhanced data readability through conversion of Word documents to easily readable HTML format.

Design

The POI library supports Office Open XML file formats - OOXML (which is a representation for word processing, spreadsheets etc.). It contains the API to read the various sections of the documents. On loading the document into POI memory, it has all the metadata and content of the document. We can read this information easily by traversing the various sections (e.g. paragraph, table, table cell etc.). However, the generation of HTML equivalent elements is not possible with POI alone.

For example, a section of text with background color is to be rendered as HTML span element with formatting styles such as font-family, background-color etc. We should be able to create HTML element with Simple API for XML (SAX) API with all related properties or styles of the document section which are read with POI. To get this functionality of reading at low level document sections, there should be a Visitor Pattern style API which will read the content and properties (formatting styles etc.) of the various sections of the document sequentially.

Xdocreport is built on POI core and POI-OOXML with generic OOXML-SCHEMAS. It will load the document with the help of POI core and read the content and metadata with the help of poi-ooxml and ooxml-schemas. Since it uses the schema library, it is easy to navigate through the elements of the document. Xdocreport provides the visitor style API to read each section of the document and generate the content in HTML. The library can be extended to handle all formatting HTML styles and overcome the limitations of rendering table structure, bullets etc. Additionally, this can be extended to customize external image rendering. Below is the code snippet that walks through and shows how this can be achieved using WYSIWYG.  

Please see below architecture diagram how extended xdocreport can be used to control render HTML  

Now let’s start with actual implementation for extended docx to HTML conversion with various components such as paragraphs, tables, bullets and images etc.

Docx to HTML conversion Implementation

Load docx file stream to create XWPFDocument

FileInputStream fstream = new FileInputStream("Example.docx");
XWPFDocument document = new XWPFDocument (fstream);

// Create options. Options are used to control image rendering etc.,
XHTMLOptions options = XHTMLOptions.create();

// Create outputstream to store generated HTML source
ByteArrayOutputStream out = new ByteArrayOutputStream();
IContentHandlerFactory factory = DefaultContentHandlerFactory.INSTANCE;
options.setIgnoreStylesIfUnused(false);
XHTMLMapper mapper = new XHTMLMapper (document, factory.create(out, null, options), options);
mapper.start();
out.close();

We can convert output stream to String and create HTML file.

String html = new String(out.toByteArray(), “UTF-8”);

We can easily attach this HTML source data to a servlet response output stream.

Customize formatting styles and components

We can extend XHTMLMapper class in xdocreport to customize the default component styles which are converted from MS Word. We can also customize the rendering behavior of HTML components.

For example, the superscript / subscripts are generated as a CSS style attached to a span element. However, what if a legacy browser doesn’t understand these CSS styles and only understands sup and sub tags? So as part of rendering it is forced to generate sup / sub HTML tags instead of CSS styles. Below is an example of overriding visitRun method to generate sup / sub HTML tags:

@Override
protected void visitRun(XWPFRun run, boolean pageNumber, String url, Object paragraphContainer) throws Exception {
           boolean isSuper = false;
	boolean isSub = false;
if (rPr.getVertAlign() != null) {
		int align = rPr.getVertAlign().getVal().intValue();
		if (STVerticalAlignRun.INT_SUPERSCRIPT == align) {
			isSuper = true;
		} else if (STVerticalAlignRun.INT_SUBSCRIPT == align) {
			isSub = true;
		}
	}
	.
.
.
if (isSuper || isSub) {
		startElement(isSuper ? "SUP" : "SUB", null);
	}
	.
	.
	.
	if (isSuper || isSub) {
		endElement(isSuper ? "SUP" : "SUB");
	}

We can also customize components. For example, by default, all hyperlinks will be generated to open a target without any additional attributes to the link. Even though we cannot translate all configurations into HTML, we need to support opening all hyperlinks in a new window. We can achieve this by overriding visitRun method:

@Override
protected void visitRun(XWPFRun run, boolean pageNumber, String url, Object 
paragraphContainer) throws Exception {
	boolean isUrl = url != null;
	if (isUrl) {
		AttributesImpl hyperlinkAttributes = new AttributesImpl();
		SAXHelper.addAttrValue(hyperlinkAttributes, "href", url);
		SAXHelper.addAttrValue(hyperlinkAttributes, "target", "_blank");
		startElement("a", hyperlinkAttributes);
	}
	.
	.
	.
	if (isUrl) {
		characters(" ");
		endElement("a");
	}
                     MS Word                                   HTML in browser
                          

Bullets to HTML lists

For bullet points, the generated HTML can be ul and li elements. However, all MS Word bullet type or styles cannot be translated to equivalent HTML list item as MS Word has its own rendering capability and image or clip art library. Simple ul and li HTML elements will not serve the purpose. Here we can make use of span elements. The first span element will contain the actual bullet character and second has the data. So to achieve basic bullets, we can override “startVisitParagraph” method.

Below code snippet shows how to render basic bullets (a small filled circle) which is an ASCII character.

startElement(SPAN_ELEMENT, attributes);
String text = itemContext.getText();
if (StringUtils.isNotEmpty(text)) {
	text = text.replace('\u2020', '\u2022'); //loop to replace all
	text = text + " ";
	SAXHelper.characters(contentHandler, StringEscapeUtils.escapeHtml(text));
}
endElement(SPAN_ELEMENT);
                    MS Word                  HTML in browser

There is no direct HTML representation for the tab character. Instead, we can output a specific number of space characters to approximate a tab. Here’s the code:

@Override
protected void visitTabs(CTTabs o, Object paragraphContainer) throws Exception {
	if (paragraph != null && o == null) {
		startElement(SPAN_ELEMENT, null);
		characters("    ");//add no of space chars if needed
		endElement(SPAN_ELEMENT);
		return;
	}
	super.visitTabs(o, paragraphContainer);
}

Extract Images

So far, we’ve converted text data with formatting styles, if any, to HTML source. However, the document may have images embedded or externally linked. These images should be extracted and shown in browser.

Embedded Images

We can resolve with default URIResolver with XHTMLOptions for embedded images. The default resolver will scan for images in docx zipped folder under “word/media/” where all embedded images are stored. The rendered HTML code for image would be

<img src="word/media/image1.jpeg" width="189pt" height="141pt"/>

However, the browser doesn’t understand the src (source path) resulting in broken image. We can fix this with a custom resolver, constructing a URL to a servlet that will load the image

final String imgUrl = "/MyImageLoader?imgeId=";
XHTMLOptions options = XHTMLOptions.create().URIResolver(new IURIResolver(){

		@Override
		public String resolve(String uri) {
			if (imgUrl == null)
				return "/no_image.gif"; 
			int ls = uri.lastIndexOf('/');
			if (ls >= 0)
				uri = uri.substring(ls+1);
				return imgUrl+uri;
			}});

Above option created will resolve the images and render as follows: 

<img src="/MyImageLoader/imageId=image1.jpeg" width="189pt" height="141pt"/>

The typical servlet implementation would be with inline java comments:


resp.setContentType("image/jpeg");// set correct content type according to file type ServletOutputStream img = resp.getOutputStream();
InputStream fis = getFileInputStream(); // docx file input stream
if(fis != null) {
String imageId = req.getParameter("imageId");
	XWPFDocument document = new XWPFDocument(fis); // load document				XWPFPictureData pic = document.getPictureDataByID(imageId);// get picture
	if (pic != null)
		img.write(pic.getData());
}

One major advantage with implementing custom image resolver is that we can control the format of images we need to render. See below the transformed HTML in browser

                          MS Word                       HTML in browser

Externally Linked images

The external images (inserted in MS Word using menu Insert à Picture à File name (give image URL) and select option “Link to file”) are more useful if the same image is reused in multiple documents. So we have only one copy of the image and saving disk space by not embedding it in each document. These are being rendered as HTML image tag and src as url given. If we like to customize the render such as anti-virus scan before render, is it in allowed image format list, we override the method “visitPicture” from XHTMLMapper.

// external link images inserted
String link = picture.getBlipFill().getBlip().getLink();
PackageRelationship rel = 
document.getPackagePart().getRelationships().getRelationshipByID(link);
if (rel != null) {
	String src = rel.getTargetURI().toString(); //image url

Once we get the image URL, we can decide to render based on perform operations before render such as scan, authentication and authorization and allowed image format check etc.

Future Scope and Limitations

We can further extend Xdocreport for which it doesn’t support components or sections. MS Word can be shown or saved in different layout views (e.g. web layout, print layout, etc.). For web layout, the HTML shouldn’t have page borders. Also, page background colors and page borders can also be transformed to equivalent HTML element or format. Unfortunately, there are certain limitations where there is no HTML equivalent elements such as page numbers and multiple paragraph columns.

Conclusion

We’ve see how we can use MS Word as WYSIWYG HTML generator. MS Word offers lots of capabilities including track changes which can be used in business review and approval cycles. We can also work offline since we native application at client for editing. The big advantage with this approach is that we can easily control the HTML source generation with Java. You can find the example source code located at GitHub.

About the Author

Prasadu Babu Dandu is a technical lead at MetricStream Inc. Platform division, a lead company in enterprise GRC software. He works primarily on J2EE technologies to build the enterprise software. He is an avid contributor to the open source community.

Rate this Article

Adoption
Style

BT