InfoQ Homepage Articles Implementing High Performance Parsers in Java

Implementing High Performance Parsers in Java

Nov 17, 2013 22 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

This article de-linked and replaced by a new version

[Editors update "Dec 9, 2013: The author has since published a significantly revised article that you can read here: http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java-V2. InfoQ would like to thank Jakob for responding so well to our reader's feedback]

[Editors note "Nov 28, 2013: Due to reader complaints about the validity of the parser used,the article has been removed from active listing in InfoQ's archive and available by direct link only. The author will be given an opportunity to refactor and republish]

From time to time you may need to implement your own data or language parser in Java, for example if th

From time to time you may need to implement your own data or language parser in Java, for example if there is no standard Java or open source parser for that data format or language. Or, perhaps there are parsers, but they are too slow, consume too much memory, or just don’t have the features you need. Or there might be bugs in an open source parser, or the open source parser project was abandoned etc. The reason is not as important as the reality that you will be needing to implement your own parser.

When you have to implement your own parser, you want it to perform well, be flexible, feature rich, easy to use, and last but not least, easy to implement; after all, your name is on that code. In this article I will explain one way of implementing high performance parsers in Java. This method is not exclusive but it is reasonably simple and achieves both high performance and a reasonably modular design. The design is inspired by the design of VTD-XML, the fastest XML parser for Java I have seen, being even faster than the StAX and SAX Java standard XML parsers.

Two Basic Parser Types

There are several ways to categorize parsers. Here I will just make the distinction between two basic parser types:

Sequential access parsers
Random access parsers

By sequential access I mean that the parser parses the data, turning over the parsed data to the data processor as the data is parsed. The data processor only has access to the data currently being parsed; it can’t look back to earlier data, and can’t look ahead. Such parsers are also known as event based parsers, like the SAX and StAX parsers.

A random access parser is a parser that enables the data processing code to move back and forth (access at random) over the data being parsed. Examples of such parsers are XML DOM parsers.

This diagram attempts to illustrate the difference between sequential and random access parsers:

Sequential access parsers let you access only the "window" or "event" that was just parsed, whereas random access parsers allow you to navigate the parsed data as you please.

Design Overview

The parser design I explain here is of the random access variety.

Random access parser implementations are often slower than sequential access parsers, because they generally build up some kind of object tree from the parsed data through which the data processing code can access that data. Creating this object tree is actually both slow in CPU time and can consume quite a bit of memory.

Instead of constructing an object tree from the parsed data, a more performant approach is to construct a buffer of indices into the original data buffer. The indices point to the start and end points of the elements found in the parsed data. Instead of accessing this data via an object tree, the data processing code accesses the parsed data directly in the buffer containing the original data. Here is an illustration of these two approaches:

For lack of a better name I call this an "Index Overlay Parser". The parser creates an index overlay on top of the original data. This is reminiscent of how a database indexes data stored on disk. It creates indices on top of the original, raw data to navigate and search through the data faster.

As I mentioned earlier, this design is inspired by VTD-XML (VTD for Virtual Token Descriptor.) Thus, you might also call this a Virtual Token Descriptor Parser. I prefer Index Overlay, because that is what the Virtual Token Descriptors represent - an index into the original data.

Parser Design in General

A normal design for parsers is to separate the parsing into two steps. The first step breaks the data into cohesive tokens, where a token is one or more bytes or characters occurring in the parsed data. The second step interprets the tokens and constructs larger elements based on these tokens. These two steps are illustrated here:

By elements I do not necessarily mean XML elements (though XML elements are also parser elements), but rather the larger "data elements" comprising the parsed data. In an XML document that would be XML elements, in a JSON document it would be JSON objects etc.

As an example, the string <myelement> would be broken into the following tokens:

<
myelement
>

Once the data is broken into tokens it is easier for the parser to make sense of them and thus determine the larger elements that these tokens comprise. The parser would then know that an XML element starts with a '<' token followed by a string token (the element name), then optionally a set of attributes, and finally a '>' token.

Index Overlay Parser Design

This is the two-step approach we will use in our parser design. The input data is first broken into tokens by a tokenizer component. The parser then parses those tokens to determine the larger element boundaries in the input data.

You can also add an optional third “element navigation step” to the parsing process. If the parser constructs an object tree from the parsed data, the object tree typically contains links to navigate the tree. When we construct an element index buffer instead of an object tree, we may need a separate component to help the data processing code navigate the element index buffer.

Here is an overview of our parser design:

First we read all the data into a data buffer. In order to enable random access to the original data via the index created during parsing, all of the original data has to be available in memory.

Second the tokenizer breaks the data into tokens. The start index, end index and token type of the tokens are kept internally in a token buffer in the tokenizer. Using a token buffer makes it possible to look forwards and backwards, in such cases that your parser needs that.

Third, the parser looks at the tokens obtained from the tokenizer, validates them against their context and determines what elements they represent. The parser then constructs the element index (the index overlay) based on the tokens received from the tokenizer. The parser obtains the tokens one by one from the tokenizer. Thus, the tokenizer does not actually need to break all the data into tokens immediately. It just needs to find one token at a time.

The data processing code can navigate the element buffer, and use that to access the original data. Optionally, you may wrap the element buffer in an element navigator component, making navigating the element buffer easier.

This design does not build up an object tree from the parsed data, but it does build up a navigable structure, the element buffer, consisting of indices (int arrays) into the data buffer containing the original data. Using these indices you can navigate the data in the original data buffer.

The following sections will explain the various parts of the design in more detail.

The Data Buffer

The data buffer is a byte or char buffer containing the original data. The token buffer and element buffer contain indices into the data buffer.

In order to allow random access to the parsed data, it is necessary to have some kind of in-memory representation of it all. Instead of an object tree we use the data buffer with the raw data itself.

Having all data in memory can consume a big chunk of memory. If your data contains elements that are independent of each other, such as log records, pulling the whole log file into memory might be overkill. Instead you can pull in a chunk of the log file that contains at least one full log record. Since every log record can be fully parsed and processed independently of the other log records, you don't need the full log file in memory at the same time. I have described how to iterate a stream of data in chunks in my article Iterating Streams Using Buffers.

The Tokenizer + Token Buffer

The tokenizer breaks the data buffer into tokens. The information about the tokens is stored in the token buffer, containing the following:

Token position (start index)
Token length
Token type (optional)

This information is stored in arrays. Here is an example of how the code could look:

   public class IndexBuffer {
       public int[]  position = null;
       public int[]  length   = null;
       public byte[] type     = null; /* assuming a max of 256 types (1 byte / type) */
   }

As the tokenizer finds tokens in the data buffer, it will insert the position (start index) in the position array, the token length in the length array, and the token type in the type array.

If you don’t use the optional token type array you can still determine the token type as needed by looking at the data in the token. This is a trade-off for performance vs. memory consumption.

The Parser

The parser is similar in nature to the tokenizer, except it takes tokens as input and outputs the element indices. Just like with tokens, an element is marked by its position (start index), length, and optionally its element type. These numbers are stored in the same structure used to store tokens.

Again, the type array is optional. If you can determine the element type easily based on the first bytes or characters of the element, you may not need to store the element types.

The precise granularity of the elements marked in the element buffer depends on the data being parsed as well as the code that needs to process the data afterwards. For instance, if you implement an XML parser you might mark each start tag, attribute and end tag as separate "parser elements".

The Element Buffer (The Index)

The parser produces an element buffer with indices into the original data. The indices mark the position (start index), length and type of the elements the parser has found in the data. You can use these indices to navigate the original data.

Looking at the IndexBuffer code above, you can see that the element buffer uses nine bytes per element; four bytes for the position, four bytes for the token length, and one byte for the token type.

You may be able to decrease the memory consumption of the IndexBuffer. For instance, if you know that the elements are never longer than 65,536 bytes, you may use an array of short instead of int to hold the token lengths. That will save you two bytes per element, bringing the memory consumption down to seven bytes per element.

Additionally, if know that the files you will parse are never longer than 16,777,216 bytes in length, you only need three bytes for the position (start index). The fourth byte of each int in the position array could then hold the element type, eliminating the need for a type array. If you have less than 128 token types, you can use seven bits for the token types instead of eight. This enables you to spend 25 bits on the position, which increases the maximum position to 33,554,432. If you have less than 64 token types, you can assign another bit to the position etc.

VTD-XML actually compacts all this information into a long to save space. You lose a bit of speed because of the extra bit manipulation needed to pack separate fields into a single int or long, but you save some memory. It's a trade off.

The Element Navigator

The element navigator helps the code that is processing the data navigate the element buffer. Remember, a semantic object or element (e.g. an XML element) might consist of multiple parser elements. To ease the navigation you can create an element navigator object that can navigate the parser elements on a semantic object level. For instance, an XML element navigator might navigate the element buffer by going from start tag to start tag.

The use of an element navigator component is your choice. If you are implementing a parser for a single use in a single project, you might want to skip it. But if you’re implementing a parser with the intention of reusing it across projects, or publishing it as open source, you might want to add an element navigator component, depending on how complexity of navigating the parsed data.

Case Study: A JSON Parser

To make the index overlay parser design more tangible, I have implemented a small JSON parser in Java, based on the index overlay parser design. You can find the complete code on GitHub.

JSON is short for JavaScript Object Notation. JSON is a popular data format to exchange data between web servers and browsers via AJAX, because web browsers have built-in native support for parsing JSON into JavaScript objects. Going forward I will assume that you are familiar with JSON.

Here is a simple JSON example:

  {  "key1"  :  "value1"  ,  "key2"  :  "value2"  ,  [  "valueA"  ,  "valueB"  ,  "valueC"  ]  }

The JSON tokenizer will break this JSON string into the following tokens:

The underlines are there to emphasize the length of each token.

The tokenizer also determines the basic types of each token. Here is the same JSON example with the token types added:

Notice how the token types are not semantic. They only say what the basic token type is and not what they represent.

The parser interprets the basic token types and replaces them with semantic types. Here is the same JSON example but with semantic types (the parser elements) instead:

Once the parser is finished parsing the above JSON you will have an index (element buffer) consisting of the positions, lengths and element types of the elements marked above. You can then navigate the index to extract the data you need from the JSON.

JsonTokenizer.parseToken()

In order to give you an idea about how the tokenization and parsing is implemented, I will show you the central code parts of the JsonTokenizer and the JsonParser. Remember, the full code is available on Github.

Here is the JsonTokenizer.parseToken() method which parses the next token in the data buffer:

public void parseToken() {
     skipWhiteSpace();
     this.tokenLength = 0;
	
     this.tokenBuffer.position[this.tokenIndex] = this.dataPosition;
     char nextChar = this.dataBuffer.data[this.dataPosition];
	
     switch(nextChar) {
	 case '{'   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_LEFT;
           break;
	 case '}'   :  
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_RIGHT;
           break;
	 case '['   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_LEFT ;
           break;
	 case ']'   : 
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_RIGHT;
           break;
	 case ','   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COMMA;
           break;
	 case ':'   :  
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COLON;
           break;
	 case '"'   :
           parseStringToken();
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_STRING_TOKEN;
           break;
	 default    : 
           parseStringToken();
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_STRING_TOKEN;
     }
     this.tokenBuffer.length[this.tokenIndex] = this.tokenLength;
  }

As you can see, the code is pretty simple. First the skipWhiteSpace() method is called which skips over any white space characters that might be in the data at the current location. Second, the token length is set to 0. Third, the position of the current token (the index into the data buffer) is stored in the TokenBuffer. Fourth the next character is examined, and depending on what character it is (what token it is), one of the statements in the switch-case construct is executed. Finally the token length for the current token is stored.

This is really all it takes to tokenize a data buffer. Notice that once the beginning of a string token is found, the tokenizer calls the parseStringToken() method which is scans through the data until the end of a string token is found. This will execute faster than trying to handle all cases inside the parseToken() method, and is easier to implement.

The rest of the methods in JsonTokenizer are just there to assist the parseToken() method, or to move the data position (index) to the next token (first position after the current token) etc.

JsonParser.parseObject()

The primary method of the JsonParser class is the parseObject() method which looks at the token types of the tokens obtained from the JsonTokenizer and tries to locate JSON objects in the input data based on that.

Here is the parseObject() method:

private void parseObject(JsonTokenizer tokenizer) {
     assertHasMoreTokens(tokenizer);
     tokenizer.parseToken();
     assertThisTokenType(tokenizer, TokenTypes.JSON_CURLY_BRACKET_LEFT);
     setElementData     (tokenizer, ElementTypes.JSON_OBJECT_START);
     tokenizer.nextToken();
     tokenizer.parseToken();
     while( tokenizer.tokenType() != TokenTypes.JSON_CURLY_BRACKET_RIGHT) {                 
         assertThisTokenType(tokenizer, TokenTypes.JSON_STRING_TOKEN);
	 setElementData(tokenizer, ElementTypes.JSON_PROPERTY_NAME);          
         tokenizer.nextToken();
	 tokenizer.parseToken();
	 assertThisTokenType(tokenizer, TokenTypes.JSON_COLON);
	 tokenizer.nextToken();
	 tokenizer.parseToken();
	 if(tokenizer.tokenType() == TokenTypes.JSON_STRING_TOKEN) {             
             setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE);
	 } else if(tokenizer.tokenType() == TokenTypes.JSON_SQUARE_BRACKET_LEFT) {
	     parseArray(tokenizer);
	 }
         tokenizer.nextToken();
	 tokenizer.parseToken();
	 if(tokenizer.tokenType() == TokenTypes.JSON_COMMA) {
	     tokenizer.nextToken();  //skip , tokens if found here.             
             tokenizer.parseToken();
	 }
      }
      setElementData(tokenizer, ElementTypes.JSON_OBJECT_END);
  }

  private void setElementData(JsonTokenizer tokenizer, byte elementType) {
     this.elementBuffer.position[this.elementIndex] = tokenizer.tokenPosition();     
     this.elementBuffer.length  [this.elementIndex] = tokenizer.tokenLength();         
     this.elementBuffer.type    [this.elementIndex] = elementType;     this.elementIndex++;
  }

The parseObject() method expects to see a left curly bracket ( { ) followed by a string token, a colon and another string token or the beginning of an array ( [ ) or another JSON object. As the JsonParser gets these tokens from the JsonTokenizer, it stores start, length and the semantic meaning of these tokens in its own elementBuffer. The data processing code can then navigate this elementBuffer afterwards, to extract any needed data from the input data.

Seeing the core parts of the JsonTokenizer and the JsonParser classes should provide some idea of how tokenization and parsing works. To fully understand how the code works you may need to look that the complete JsonTokenizer and JsonParser implementation. They are both less than 115 lines of code each, so they should be reasonably approachable.

Benchmarks

VTD-XML has already done extensive benchmarking of their XML parser against StAX, SAX and DOM parsers. VTD-XML wins over them all in raw performance.

In order to establish some credibility for index overlay parser performance in general, I have benchmarked my JSON parser implementation against GSON – Google’s JSON parser. GSON creates an object tree from a JSON input (String or stream).

Keep in mind that GSON is fairly mature production quality, tested, with good error reporting etc. My JSON parser is only at a proof-of-concept level. The benchmarking is only done to get an indication of the difference in performance. They are not final numbers. Remember to read the discussion of the benchmarks below too.

Here are some details about how the benchmarks are structured:

In order to warm up the JIT, minimize one-off overheads etc. the JSON input parsing is performed 10,000,000 times.
The benchmarks are repeated separately for three different files to see how the parsers do on small, medium and larger files. The file sizes are 64 bytes, 406 bytes and 1012 bytes. That means first do 10,000,000 iterations on a small file, and measure that. Then on a medium file, and measure that. And then on a larger file and measure that.
The file is being loaded fully into memory before parsing and measurement begins. That is done to exclude file-load time from the parsing time.
Each measurement of 10,000,000 iterations runs in its own process. That means that each file is parsed in separate processes. Only one process runs at a time.
Each file is measured 3 times. That is, the process parsing the file 10,000,000 times is started and stopped 3 times. The processes run sequentially, not in parallel.

The benchmark table contains three columns:

1. Simple iteration of the raw data buffer
2. JSON Parser
3. GSON

The first column is the simple iteration of all of the data in the raw data buffer. This number is only there to signal the lower limit; the minimum time theoretically possible to process all the data. Of course no parser will reach this speed, but the number is interesting to see how far off a parser is from the raw iteration speed. The second column is my JSON parser. The third column is Google's GSON parser.

Here are the times in milliseconds to perform the 10.000.000 parsings of the 3 files (64 bytes, 406 bytes, 1012 bytes):

File	Run	Iteration	JSON Parser	GSON
Small	1	2341	69708	91190
Small	2	2342	70705	91308
Small	3	2331	68278	92752
Medium	1	13954	122769	314266
Medium	2	13963	131708	316395
Medium	3	13954	132277	323585
Big	1	33494	239614	606194
Big	2	33541	231866	612193
Big	3	32462	232951	618212

As you can see, the index overlay implementation is significantly faster than GSON (an object tree JSON parser). Of course this was expected, but now you can get an idea of what the performance difference is.

Note that all benchmark processes were very stable in their memory consumption during execution. GSON did not steadily increase its memory consumption despite the many object trees created. The memory consumption of the index overlay parser was also stable, and about 1mb lower than that of the GSON benchmarks. Some of that might be due to the larger code base in GSON loaded into the JVM.

Benchmark Discussion

Now it wouldn’t be fair to compare a parser that creates an object tree from data (GSON) to a parser that just marks the index of the elements found in the data without a discussion of what is being compared.

Parsing a file inside a running application usually requires the following steps:

First the data is loaded either from disk or from the network. Second, the data is parsed. Third, the data is processed.

In order to measure just the raw parser speed I preloaded the files to be parsed into memory, and the benchmarked code will not process the data in any way. While this does benchmark just the raw parsing speeds, the performance difference does not translate one to one into increased performance in a running application. Here is why:

A streaming parser is often able to start parsing the incoming data before all of the data is loaded into memory. My JSON parser cannot do that the way it is implemented now. That means, that even though it is faster in raw parsing benchmarks, in a real life running application where my parser would have to wait for the data to load, it may not be as fast in total. This is illustrated in the diagram below:

You could probably modify my parser to be able to parser data as it is being loaded, in order to speed up the total parsing time. But that would probably decrease raw parser-performance a bit. The total speed might still be better, though.

Similarly, my JSON parser does not do anything with the parsed data. If you need to extract a lot of that data into Strings, then GSON will have done some of the work for you already, since it creates an object tree from the parsed data. This is illustrated in the diagram below:

When using GSON some of the data extraction needed during data processing might already have been done for you, meaning that the data processing step might be faster than if it also had to include data extraction (e.g. into String objects).

So, to really measure the impact on your application, you have to measure the use of different parsers in your application. I am sure you will see a speedup from using an index overlay parser, but exactly how much it will be is hard to say.

Discussion of Index Overlay Parsers in General

One argument I have heard against index overlay parsers is that to be able to point into the original data rather than extract it into an object tree, it is necessary to keep all of the data in memory while parsing it. This is can produce a memory-hit when processing big files.

In general the argument is that a streaming parser (like SAX or StAX) can parse huge files without ever having the whole file in memory. However this is only true if the data in the file can be parsed and processed in smaller chunks, where each chunk can be parsed and processed independently from other chunks. For instance, a big XML file contains a list of elements, each of which can be parsed and processed individually (like a list of log records). But, if your data can be parsed separately in independent chunks, you can implement an index overlay parser that is capable of that as well.

If the file cannot be parsed in independent chunks you will anyways have to extract the necessary information into some structure which can be accessed by the code processing later chunks. But if you can do that with a streaming parser, you can also do it with an index overlay parser.

Parsers that create object trees from input data often consume much larger amounts of memory with the object tree than the original data size. This is due to the memory overhead associated with an object instance, plus extra data needed to keep reference between objects.

Additionally since all data needs to be in memory at one time, you need to allocate a data buffer ahead of parsing that is big enough to hold all the data. But what if you don't know how big the files are when you start parsing them?

Imagine you have a web application (or web service, or other server application) where the users can upload files. You may not know how big the files are, so how can you allocate a suitable buffer for them before the parsing starts? Well, for security reasons you should always have a maximum allowed file size. Otherwise users may be able to crash your system by uploading very large files. Or they may even write a program that pretends to be a browser uploading a file, and have that program never stop sending data to your server. You can allocate a buffer fitting the maximum allowed file size. That way your buffer will not run out of space for valid files. If it does run out of space, your user has uploaded an excessively large file anyway.

About the Author

Jakob Jenkov is an entrepreneur, writer and software developer currently located in Barcelona, Spain. Jakob learned Java in 1997, and has worked with Java professionally since 1999. He is currently focused on the concept of efficiency across a variety of domains (software, processes, businesses etc.). He holds a master of science in IT from the IT University in Copenhagen. You can read more about his work on his website.

InfoQ Software Architects' Newsletter