InfoQ Homepage Articles Implementing High Performance Parsers in Java

Implementing High Performance Parsers in Java

Lire ce contenu en franÃ§ais

Dec 09, 2013 30 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Note: This article is an updated version of a previous article on the same topic. The original version of the article was intended to capture some of the main points of creating a high-performance parser, but it received some criticism from readers for the perception that other parts were overlooked. The original article was completely revised and a more complete version of the associated code was created. We hope you enjoy this revision.

From time to time you may need to implement your own data or language parser in Java, for example if there is no standard Java or open source parser for that data format or language. Or, perhaps there are parsers, but they are too slow, consume too much memory, or just don’t have the features you need. Or there might be bugs in an open source parser, or the open source parser project was abandoned etc. The reason is not as important as the reality that you will be needing to implement your own parser.

When you have to implement your own parser, you want it to perform well, be flexible, feature rich, easy to use, and last but not least, easy to implement; after all, your name is on that code. In this article I will explain one way of implementing high performance parsers in Java. This method is not exclusive but it is reasonably simple and achieves both high performance and a reasonably modular design. The design is inspired by the design of VTD-XML, the fastest XML parser for Java I have seen, being even faster than the StAX and SAX Java standard XML parsers.

Two Basic Parser Types

There are several ways to categorize parsers. Here I will just make the distinction between two basic parser types:

Sequential access parsers
Random access parsers

By sequential access I mean that the parser parses the data, turning over the parsed data to the data processor as the data is parsed. The data processor only has access to the data currently being parsed; it can’t look back to earlier data, and can’t look ahead. Such parsers are also known as event based parsers, and the SAX and StAX parsers are the most well-known examples.

A random access parser is a parser that enables the data-processing code to move back and forth (access at random) over the data being parsed. Examples of such parsers are XML DOM parsers.

Sequential access parsers let you access only the "window" or "event" that was just parsed, in the document sequence, whereas random access parsers allow you to navigate through the parsed data as you please.

Design Overview

The parser design I explain here is of the random access variety.

Random access parser implementations are often slower than sequential access parsers, because they generally build up some kind of object tree from the parsed data through which the data processing code can access that data. Creating this object tree is actually both slow in CPU time and can consume quite a bit of memory.

Instead of constructing an object tree from the parsed data, a more performant approach is to construct a buffer of indices into the original data buffer. The indices point to the start and end points of the elements found in the parsed data. Instead of accessing this data via an object tree, the data processing code accesses the parsed data directly in the buffer containing the original data. Here is an illustration of these two approaches:

For lack of a better name I call this an "Index Overlay Parser". The parser creates an index overlay on top of the original data. This is reminiscent of how a database indexes data stored on disk. It creates indices on top of the original, raw data to navigate and search through the data faster.

As I mentioned earlier, this design is inspired by VTD-XML (VTD for Virtual Token Descriptor.) Thus, you might also call this a Virtual Token Descriptor Parser. I prefer the name Index Overlay, because that is what the Virtual Token Descriptors represent - an index into the original data.

Parser Design in General

A normal design for parsers is to separate the parsing into two steps. The first step breaks the data into cohesive tokens, where a token is one or more bytes or characters occurring in the parsed data. The second step interprets the tokens and constructs larger elements based on these tokens. These two steps are illustrated here:

By elements I do not necessarily mean XML elements (though XML elements are also parser elements), but rather the larger "data elements" comprising the parsed data. In an XML document that would be XML elements, in a JSON document it would be JSON objects etc.

As an example, the string would be broken into the following tokens:

· <
· myelement
· >

Once the data is broken into tokens it is easier for the parser to make sense of them and thus determine the larger elements that these tokens comprise. The parser would then know that an XML element starts with a '<' token followed by a string token (the element name), then optionally a set of attributes, and finally a '>' token.

Index Overlay Parser Design

This is the two-step approach we will use in our parser design. The input data is first broken into tokens by a tokenizer component. The parser then parses those tokens to determine the larger element boundaries in the input data.

You can also add an optional third ‘element navigation step’ to the parsing process. If the parser constructs an object tree from the parsed data, the object tree typically contains links to navigate the tree. When we construct an element index buffer instead of an object tree, we may need a separate component to help the data processing code navigate the element index buffer.

Here is an overview of our parser design:

First we read all the data into a data buffer. In order to enable random access to the original data via the index created during parsing, all of the original data has to be available in memory.

Second the tokenizer breaks the data into tokens. The start index, end index and token type of the tokens are kept internally in a token buffer in the tokenizer. Using a token buffer makes it possible to look forwards and backwards, in such cases that your parser needs that.

Third, the parser looks at the tokens obtained from the tokenizer, validates them against their context and determines what elements they represent. The parser then constructs the element index (the index overlay) based on the tokens received from the tokenizer. The parser obtains the tokens one by one from the tokenizer. Thus, the tokenizer does not actually need to break all the data into tokens immediately. It just needs to find one token at a time.

The data processing code can navigate the element buffer, and use that to access the original data. Optionally, you may wrap the element buffer in an element navigator component, making navigating the element buffer easier.

This design does not build up an object tree from the parsed data, but it does build up a navigable structure, the element buffer, consisting of indices (int arrays) into the data buffer containing the original data. Using these indices you can navigate the data in the original data buffer.

The following sections will explain the various parts of the design in more detail.

The Data Buffer

The data buffer is a byte or char buffer containing the original data. The token buffer and element buffer contain indices into the data buffer.

In order to allow random access to the parsed data, it is necessary to have some kind of in-memory representation of it all. Instead of an object tree we use the data buffer with the raw data itself.

Having all data in memory can consume a big chunk of memory. If your data contains elements that are independent of each other, such as log records, pulling the whole log file into memory might be overkill. Instead you can pull in a chunk of the log file that contains at least one full log record. Since every log record can be fully parsed and processed independently of the other log records, you don't need the full log file in memory at the same time. I have described how to iterate a stream of data in chunks in my article Iterating Streams Using Buffers.

The Tokenizer + Token Buffer

The tokenizer breaks the data buffer into tokens. The information about the tokens is stored in the token buffer, containing the following:

Token position (start index)
Token length
Token type (optional)

This information is stored in arrays. Here is an example of how the code could look:

public class IndexBuffer {
    public int[] position = null;
    public int[] length = null;
    public byte[] type = null; /* assuming a max of 256 types (1 byte / type) */
}

As the tokenizer finds tokens in the data buffer, it will insert the position (start index) in the position array, the token length in the length array, and the token type in the type array.

If you don't use the optional token type array you can still determine the token type as needed by looking at the data in the token. This is a trade-off for performance vs. memory consumption.

The Parser

The parser is similar in nature to the tokenizer, except it takes tokens as input and outputs the element indices. Just like with tokens, an element is marked by its position (start index), length, and optionally its element type. These numbers are stored in the same structure used to store tokens.

Again, the type array is optional. If you can determine the element type easily based on the first bytes or characters of the element, you may not need to store the element types.

The precise granularity of the elements marked in the element buffer depends on the data being parsed as well as the code that needs to process the data afterwards. For instance, if you implement an XML parser you might mark each start tag, attribute and end tag as separate "parser elements".

The Element Buffer (The Index)

The parser produces an element buffer with indices into the original data. The indices mark the position (start index), length and type of the elements the parser has found in the data. You can use these indices to navigate the original data.

Looking at the IndexBuffer code above, you can see that the element buffer uses nine bytes per element; four bytes for the position, four bytes for the token length, and one byte for the token type.

You may be able to decrease the memory consumption of the IndexBuffer. For instance, if you know that the elements are never longer than 65,536 bytes, you may use an array of short instead of int to hold the token lengths. That will save you two bytes per element, bringing the memory consumption down to seven bytes per element.

Additionally, if know that the files you will parse are never longer than 16,777,216 bytes in length, you only need three bytes for the position (start index). The fourth byte of each int in the position array could then hold the element type, eliminating the need for a type array. If you have less than 128 token types, you can use seven bits for the token types instead of eight. This enables you to spend 25 bits on the position, which increases the maximum position to 33,554,432. If you have less than 64 token types, you can assign another bit to the position etc.

VTD-XML actually compacts all this information into a long to save space. You lose a bit of speed because of the extra bit manipulation needed to pack separate fields into a single int or long, but you save some memory. Again, it's a trade off.

The Element Navigator

The element navigator helps the code that is processing the data navigate the element buffer. Remember, a semantic object or element (e.g. an XML element) might consist of multiple parser elements. To ease the navigation you can create an element navigator object that can navigate the parser elements on a semantic object level. For instance, an XML element navigator might navigate the element buffer by going from start tag to start tag.

The use of an element navigator component is your choice. If you are implementing a parser for a single use in a single project, you might want to skip it. But if you’re implementing a parser with the intention of reusing it across projects, or publishing it as open source, you might want to add an element navigator component, depending on how complexity of navigating the parsed data.

Case Study: A JSON Parser

To make the index overlay parser design more tangible, I have implemented a small JSON parser in Java, based on the index overlay parser design. You can find the complete code on GitHub.

JSON is short for JavaScript Object Notation. JSON is a popular data format to exchange data between web servers and browsers via AJAX, because web browsers have built-in native support for parsing JSON into JavaScript objects. Going forward I will assume that you are familiar with JSON.

Here is a simple JSON example:

{ "key1" : "value1" , "key2" : "value2" , [ "valueA" , "valueB" , "valueC" ] }

The JSON tokenizer will break this JSON string into the following tokens:

The underlines are there to emphasize the length of each token.

The tokenizer also determines the basic types of each token. Here is the same JSON example with the token types added:

Notice how the token types are not semantic. They only say what the basic token type is and not what they represent.

The parser interprets the basic token types and replaces them with semantic types. Here is the same JSON example but with semantic types (the parser elements) instead:

Once the parser is finished parsing the above JSON you will have an index (element buffer) consisting of the positions, lengths and element types of the elements marked above. You can then navigate the index to extract the data you need from the JSON.

The implementation in the GitHub repository contains two JSON parsers. One that splits the parsing between a JsonTokenizer and a JsonParser (as described earlier in this article), and a JsonParser2 which combines the tokenization and parsing into a single phase and class. The JsonParser2 is faster, but also harder to understand. Therefore I will take a quick look at the JsonTokenizer and JsonParser class in the following sections, but skip over JsonParser2.

(A reader of the first version of this article observed that the output from the index overlay parser was not very easy to use to extract data from the original data buffer. As mentioned earlier, that is why you might add an element navigation component. To show how such an element navigation component could look, I have added the JsonNavigator class. I will also take a quick look at that class later.)

JsonTokenizer.parseToken()

In order to give you an idea about how the tokenization and parsing is implemented, I will show you the central code parts of the JsonTokenizer and the JsonParser. Remember, the full code is available on GitHub.

Here is the JsonTokenizer.parseToken() method which parses the next token in the data buffer:

     public void parseToken() {
       skipWhiteSpace();
       this.tokenBuffer.position[this.tokenIndex] = this.dataPosition;
       switch (this.dataBuffer.data[this.dataPosition]) {
         case '{': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_LEFT;
         }
         break;
         case '}': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_RIGHT;
         }
         break;
         case '[': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_LEFT;
         }
         break;
         case ']': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_RIGHT;
         }
         break;
         case ',': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COMMA;
         }
         break;
         case ':': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COLON;
         }
         break;
         case '"': { parseStringToken(); } break;
         case '0':
         case '1':
         case '2':
         case '3':
         case '4':
         case '5':
         case '6':
         case '7':
         case '8':
         case '9': {
           parseNumberToken();
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_NUMBER_TOKEN;
         }
         break;
         case 'f': {
           if (parseFalse()) {
             this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_BOOLEAN_TOKEN;
           }
         }
         break;
         case 't': {
           if (parseTrue()) {
               this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_BOOLEAN_TOKEN;
           }
         }
         break;
         case 'n': {
           if (parseNull()) {
             this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_NULL_TOKEN;
           }
         }
         break;
        }
        this.tokenBuffer.length[this.tokenIndex] = this.tokenLength;
     }

As you can see, the code is pretty simple. First the skipWhiteSpace() method is called which skips over any white space characters that might be in the data at the current location. Second, the position of the current token (the index into the data buffer) is stored in the tokenBuffer. Third the next character is examined, and depending on what character it is (what token it is), one of the statements in the switch-case construct is executed. Finally the token length for the current token is stored.

This is really all it takes to tokenize a data buffer. Notice that once the beginning of a string token is found, the tokenizer calls the parseStringToken() method which scans through the data until the end of the string token is found. This will execute faster than trying to handle all cases inside the parseToken() method, and it is easier to implement.

The rest of the methods in JsonTokenizer are just there to assist the parseToken() method, or to move the data position (index) to the next token (first position after the current token) etc.

JsonParser.parseObject()

The primary method of the JsonParser class is the parseObject() method which looks at the token types of the tokens obtained from the JsonTokenizer and tries to locate JSON objects in the input data based on that.

Here is the parseObject() method:

     private void parseObject(JsonTokenizer tokenizer) {
       assertHasMoreTokens(tokenizer);
       tokenizer.parseToken();
       assertThisTokenType(tokenizer.tokenType(), TokenTypes.JSON_CURLY_BRACKET_LEFT);
       setElementData(tokenizer, ElementTypes.JSON_OBJECT_START);
       tokenizer.nextToken();
       tokenizer.parseToken();
       byte tokenType = tokenizer.tokenType();
       while (tokenType != TokenTypes.JSON_CURLY_BRACKET_RIGHT) {
          assertThisTokenType(tokenType, TokenTypes.JSON_STRING_TOKEN);
          setElementData(tokenizer, ElementTypes.JSON_PROPERTY_NAME);
          tokenizer.nextToken();
          tokenizer.parseToken();
          tokenType = tokenizer.tokenType();
          assertThisTokenType(tokenType, TokenTypes.JSON_COLON);
          tokenizer.nextToken();
          tokenizer.parseToken();
          tokenType = tokenizer.tokenType();
          switch (tokenType) {
             case TokenTypes.JSON_STRING_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_STRING);
             }
             break;
             case TokenTypes.JSON_STRING_ENC_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_STRING_ENC);
             }
             break;
             case TokenTypes.JSON_NUMBER_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_NUMBER);
             }
             break;
             case TokenTypes.JSON_BOOLEAN_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_BOOLEAN);
             }
             break;
             case TokenTypes.JSON_NULL_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_NULL);
             }
             break;
             case TokenTypes.JSON_CURLY_BRACKET_LEFT: {
               parseObject(tokenizer);
             }
             break;
             case TokenTypes.JSON_SQUARE_BRACKET_LEFT: {
               parseArray(tokenizer);
             }
             break;
          }
          tokenizer.nextToken();
          tokenizer.parseToken();
          tokenType = tokenizer.tokenType();
          if (tokenType == TokenTypes.JSON_COMMA) {
             tokenizer.nextToken(); //skip , tokens if found here.
             tokenizer.parseToken();
             tokenType = tokenizer.tokenType();
          }
       }
       setElementData(tokenizer, ElementTypes.JSON_OBJECT_END); } 
     }

The parseObject() method expects to see a left curly bracket ( { ) followed by a string token, a colon and another string token or the beginning of an array ( [ ) or another JSON object. As the JsonParser gets these tokens from the JsonTokenizer, it stores start, length and the semantic meaning of these tokens in its own elementBuffer. The data processing code can then navigate this elementBuffer afterwards, to extract any needed data from the input data.

Seeing the core parts of the JsonTokenizer and the JsonParser classes should provide some idea of how tokenization and parsing works. To fully understand how the code works you may need to look that the complete JsonTokenizer and JsonParser implementation. They are both less than 200 lines of code each, so they should be reasonably approachable.

JsonNavigator

The JsonNavigator is an element navigator component. It helps you navigate element index created by theJsonParser and JsonParser2. The indices produced by these two components are the same, so an index from either will do. Here is a code example:

JsonNavigator jsonNavigator = new JsonNavigator(dataBuffer, elementBuffer);

Once the JsonNavigator is created you can navigate element indexes using its navigation methods (next(),previous() etc. and you can extract data using the methods asString(), asInt() and asLong(). You can compare constant strings to the element in the data buffer using the isEqualUnencoded(String).

Using the JsonNavigator class looks pretty similar to using the Gson streaming API . Look at the gsonStreamBuildObject(Reader) method in the AllBenchmarks class, and in the parseJsonObject(JsonNavigator) method in the JsonObjectBuilder class.

They look pretty similar, do they not? Only, the parseJsonObject() method is capable of using some of the optimizations (discussed later in this article) of the JsonNavigator, like counting the primitive elements in an array, and faster string comparison for JSON field names.

Benchmarks

VTD-XML has already done extensive benchmarking of their XML parser against StAX, SAX and DOM parsers. VTD-XML wins over them all in raw performance.

In order to establish some credibility for index overlay parser performance in general, I have benchmarked my JSON parser implementation against Gson. The first version of this article only measured the speed of parsing a JSON file and building and object via reflection with Gson. Due to comments from readers I have now expanded the benchmarks to measure Gson in four different modes:

Stream through all elements in the JSON file without doing anything with the data.
Stream through all elements in the JSON file and build a JsonObject from it.
Parse the JSON file and build a Map from it.
Parse the JSON file and build a JsonObject from it using reflection.

Keep in mind that Gson is already of a mature production quality and is well tested, with good error reporting etc. My JSON parser is only at a proof-of-concept level. The benchmarking is only done to get an indication of the difference in performance. They are not final numbers. Remember to read the discussion of the benchmarks below too.

Here are some details about how the benchmarks are structured:

In order to warm up the JIT, minimize one-off overheads etc. the JSON input parsing is performed 10,000,000 times for the small file, and 1,000,000 times for the medium and big file.
The benchmarks are repeated separately for three different files to see how the parsers do on small, medium and larger files. The file sizes are 58 bytes, 783 bytes and 1854 bytes. That means first do 10,000,000 iterations on a small file and measure that. Then on a medium file, and then on a larger file. The files are located in the data directory in the GitHub repository.
The file is being loaded fully into memory before parsing and measurement begins. That is done to exclude the file load time from the parsing time.
Each measurement of 10,000,000 iterations (or 1,000,000 iterations) runs in its own process. That means that each file is parsed in separate processes. Only one process runs at a time. Each file is measured 3 times. That is, the process parsing the file 10,000,000 times is started and stopped 3 times. The processes run sequentially, not in parallel.

Here are the execution times in milliseconds:

Benchmarks
	Small 1	Small 2	Small 3	Medium 1	Medium 2	Medium 3	Large 1	Large 2	Large 3
JsonParser2	8.143	7.862	8.236	11.092	10.967	11.169	28.517	28.049	28.111
JsonParser2 + Builder	13.431	13.416	13.416	17.799	18.377	18.439	41.636	41.839	43.025
JsonParser	13.634	13.759	14.102	19.703	19.032	20.438	43.742	41.762	42.541
JsonParser + Builder	19.890	19.173	19.609	29.531	26.582	28.003	56.847	60.731	58.235
Gson Stream	44.788	45.006	44.679	33.727	34.008	33.821	69.545	69.111	68.578
Gson Stream + Builder	45.490	45.334	45.302	36.972	38.001	37.253	74.865	76.565	77.517
Gson Map	66.004	65.676	64.788	42.900	42.778	42.214	83.932	84.911	86.156
Gson Reflection	76.300	76.473	77.174	69.825	67.750	68.734	135.177	137.483	134.337

Here is an explanation of what the benchmarks do:

Description
JsonParser2	Parses file and locates indices using JsonParser2.
JsonParser2 + Builder	Parses file and builds a JsonObject from it using JsonParser2.
JsonParser	Parses file and locates indices using JsonParser.
JsonParser + Builder	Parses file and builds a JsonObject from it using JsonParser2.
Gson Stream	Parses file and iterates through all tokens via Gson streaming API.
Gson Stream + Builder	Parses file and builds a JsonObject from it.
Gson Map	Parses file and builds a Map from it.
Gson Reflection	Parses file and builds a JsonObject from it using reflection.

As you can see, the index overlay implementations (JsonParser and JsonParser2) are faster than Gson. Below I will discuss some of my speculations about why that is.

Performance Discussion

The main reason why Gson's streaming API is not even faster still, is because all data is extracted from the stream as it is being traversed, even if you don't need that data. Every token is turned into a String, int, double, etc., which is time consuming. That is also why parsing the JSON file and building a JsonOject from it with Gson's streaming API is almost as fast as just streaming through the elements themselves. The only addition time wise is the instantiation of the JsonObject and arrays inside the JsonObject.

The data extraction cannot account for it all, though. Building a JsonObject with JsonParser2 as almost twice as fast as building a JsonObject with Gson's streaming API. Here are a few of the performance advantages I see an index overlay parser have over a streaming parser:

First of all, if you look at the benchmarks for small and large files, it seems that Gson has some one-off initialization overhead associated with each parsing. The performance differences between JsonParser2 +JsonParser and the Gson benchmarks are bigger for smaller files. That could be due to the creation of theCharArrayReader or something similar. It could also be something internal in Gson.

Second, an index overlay parser allows you to choose exactly how much of the data you want to extract. That gives you are more fine-grained control over the performance of the parser.

Third, the JsonParser and JsonParser2 detects during tokenization if a string token has any escape characters (e.g. " \" \t \n \r ") that needs to be manually converted from UTF-8 to UTF-16. If a string token does not contain escape characters, the JsonNavigator can use a faster String creation mechanism than if they do.

Fourth, the JsonNavigator enables faster string comparisons against data in the data buffer. This is handy when you need to check if the name of a field is equal to some constant string. Using Gson's streaming API you would have to extract the field name into a String object, and compare the constant string to that String object. The JsonNavigator can compare the constant string to the chars in the data buffer directly, without creating a String object first. That saves the instantiation of a String object, and the copying of data from the data buffer into a String object which is only used for that comparison (e.g. checking if the JSON field name is equal to "key" or "name" or something else). Here is how that looks with the JsonNavigator:

if(jsonNavigator.isEqualUnencoded("fieldName")) { }

Fifth, the JsonNavigator can look ahead in its index and count the number of elements in arrays that contain primitive values (strings, numbers, booleans, nulls etc. but not objects or nested arrays). When you do not know how many elements an array contains, normally you extract the elements and put them into a List. Once you encounter the end-of-array marker, you convert the List into an array. This means that you create an unnecessary List object. Additionally, all data extracted has to be objects to be inserted in a List, even if the array contains primitive values like ints or booleans. This creates unnecessary object creation (unnecessary auto-boxing at the least) when inserting the extracted values into the List. Again, when creating the array of primitives, all the objects have to be converted into primitives again, and inserted into the array. Here is how it looks in code using the Gson streaming API:

List<Integer> elements = new ArrayList<Integer>();
reader.beginArray();
while (reader.hasNext()) {
   elements.add(reader.nextInt());
}
reader.endArray();
int[] ints = new int[elements.size()];
for (int i = 0; i < ints.length; i++) {
   ints[i] = elements.get(i);
}
jsonObject.numberArray = ints;

When you do know how many elements an array contains, you can create the final Java array immediately, and insert the primitive values directly into the array. This saves the List instantiation and building, the auto-boxing of primitive values, and the conversion from objects back to primitive values again when inserting the values into the array. Here is how the same code looks using the JsonNavigator:

int[] ints = new int[jsonNavigator.countPrimitiveArrayElements()];
for (int i = 0, n = ints.length; i < n; i++) {
   ints[i] = jsonNavigator.asInt();
   jsonNavigator.next();
}
jsonObject.numberArray = ints;

Even when you are just building List objects from JSON arrays, knowing the number of elements enables you to instantiate an ArrayList with the correct capacity from the beginning. That way you avoid penalties from the ArrayList dynamically resizing itself if you reach the limit of its default capacity. Here is how it looks in code:

List<String> strings = new ArrayList<String>(jsonNavigator.countPrimitiveArrayElements());
jsonNavigator.next(); // skip over array start. 
while (ElementTypes.JSON_ARRAY_END != jsonNavigator.type()) {
   strings.add(jsonNavigator.asString());
   jsonNavigator.next();
}
jsonNavigator.next(); //skip over array end.

Sixth, when you have access to the original data buffer you can use ropes instead of String objects in many places. A rope is a string token object that contains a reference to a char array, a start position and a length. You can make string comparisons, copy the rope etc. just like with a String. Some operations may just be faster with a rope than with a String object. They also take up less memory because they do not copy the original data.

Seventh, if you need to do a lot of back and forth navigation of the data you can create even more advanced indices . VTD-XML's index contains the indentation level of an element, as well as references to the index of the next element at the same level (next sibling), the first element with a higher indentation level (first ancestor) etc. These are all just int indices added on top of the linear parser element index. Such extra indices can make navigation of the parsed data extremely fast.

Performance vs. Error Reporting

If you look at the JsonParser and JsonParser2 code, you will see that the faster JsonParser2 also has worse error reporting than JsonParser. When the tokenization and parsing phase are split in two, good data validation and error reporting is easier to implement.

Normally, this difference would spur a discussion about whether you should prioritize performance or error reporting when deciding on your parser implementation. With index overlay parsers, however, this discussion is not necessary.

Because the original data is always available in its full form in memory, you can parse the same data with both a fast and slow parser. You can start with the fast parser, and if parsing fails, you can use a slower parser to detect where the errors in the input data are. Just give the original data to the slower parser when the faster parser fails. That way you can get the best of both worlds.

Benchmark Discussion

It wouldn't be fair to compare a parser that creates an object tree from data (Gson) to a parser that just marks the index of the elements found in the data without a discussion of what is being compared.

Parsing a file inside a running application usually requires the following steps:

First the data is loaded either from disk or from the network. Second, the data is decoded, for instance from UTF-8 to UTF-16. Third, the data is parsed. Fourth, the data is processed.

In order to measure just the raw parser speed I preloaded the files to be parsed into memory; the benchmarked code did not process the data in any way. While this does benchmark just the raw parsing speeds, the performance difference does not translate one to one into increased performance in a running application. Here is why:

A streaming parser is often able to start parsing the incoming data before all of the data is loaded into memory. My JSON parser cannot do that the way it is implemented now. That means that even though it is faster in raw parsing benchmarks, in a real-life running application where my parser would have to wait for the data to load, it may not be as fast in total. This is illustrated in the diagram below:

You could probably modify my parser to be able to parse data as it is being loaded, in order to speed up the total parsing time. But that would probably decrease raw parser performance a bit. The total speed might still be better though.

Furthermore, by preloading the data into memory before running the benchmarks, I also skip the data-decoding step. Data decoding from UTF-8 to UTF-16 is not free. In a real life application you could not get around this step. Each file to parse would have to be decoded. This is the same for all parsers though. A streaming parser may be able to decode the data as it is read. An index overlay parser would be able to decode while reading the data into a buffer too.

VTD-XML and Jackson (another JSON parser) uses another technique. They do not decode the raw data at all. Instead they tokenize directly on the raw data, consuming whatever format that data is in (ASCII, UTF-8 etc.). That saves a costly decoding step, at the price of a more complex tokenizer.

In general the only way to really know which parser will be fastest in your application is to benchmark all of them on the real job you have to do with the data you need to parse.

Discussion of Index Overlay Parsers in General

One argument I have heard against index overlay parsers is that to be able to point into the original data rather than extract it into an object tree, it is necessary to keep all of the data in memory while parsing it. This can produce a memory-hit when processing big files.

In general the argument is that a streaming parser (like SAX or StAX) can parse huge files without ever having the whole file in memory. However this is only true if the data in the file can be parsed and processed in smaller chunks, where each chunk can be parsed and processed independently from other chunks. For instance, a big XML file contains a list of elements, each of which can be parsed and processed individually (like a list of log records). But, if your data can be parsed separately in independent chunks, you can implement an index overlay parser that is capable of that as well.

If the file cannot be parsed in independent chunks you will nonetheless need to extract the necessary information into some structure which can be accessed by the code processing later chunks. But if you can do that with a streaming parser, you can also do it with an index overlay parser.

Parsers that create object trees from input data often consume much larger amounts of memory with the object tree than the original data size. This is due to the memory overhead associated with an object instance, plus extra data needed to keep references between objects.

Additionally since all data needs to be in memory at one time, you need to allocate a data buffer ahead of parsing that is big enough to hold all the data. But what if you don't know how big the files are when you start parsing them?

Imagine you have a web application (or web service, or other server application) where the users can upload files. You may not know how big the files are, so how can you allocate a suitable buffer for them before the parsing starts? Well, for security reasons you should always have a maximum allowed file size. Otherwise users may be able to crash your system by uploading very large files. Or they may even write a program that pretends to be a browser uploading a file, and have that program never stop sending data to your server. You can allocate a buffer fitting the maximum allowed file size. That way your buffer will not run out of space for valid files. If it does run out of space, your user has uploaded an excessively large file anyway.

About the Author

Jakob Jenkov is an entrepreneur, writer and software developer currently located in Barcelona, Spain. Jakob learned Java in 1997, and has worked with Java professionally since 1999. He is currently focused on the concept of efficiency across a variety of domains (software, processes, businesses etc.). He holds a master of science in IT from the IT University in Copenhagen. You can read more about his work on his website.

InfoQ Software Architects' Newsletter