Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Implementing High Performance Parsers in Java

Implementing High Performance Parsers in Java

Lire ce contenu en français


Note: This article is an updated version of a previous article on the same topic. The original version of the article was intended to capture some of the main points of creating a high-performance parser, but it received some criticism from readers for the perception that other parts were overlooked. The original article was completely revised and a more complete version of the associated code was created. We hope you enjoy this revision.

From time to time you may need to implement your own data or language parser in Java, for example if there is no standard Java or open source parser for that data format or language. Or, perhaps there are parsers, but they are too slow, consume too much memory, or just don’t have the features you need. Or there might be bugs in an open source parser, or the open source parser project was abandoned etc. The reason is not as important as the reality that you will be needing to implement your own parser.

When you have to implement your own parser, you want it to perform well, be flexible, feature rich, easy to use, and last but not least, easy to implement; after all, your name is on that code. In this article I will explain one way of implementing high performance parsers in Java. This method is not exclusive but it is reasonably simple and achieves both high performance and a reasonably modular design. The design is inspired by the design of VTD-XML, the fastest XML parser for Java I have seen, being even faster than the StAX and SAX Java standard XML parsers.

Two Basic Parser Types

There are several ways to categorize parsers. Here I will just make the distinction between two basic parser types:

  • Sequential access parsers
  • Random access parsers

By sequential access I mean that the parser parses the data, turning over the parsed data to the data processor as the data is parsed. The data processor only has access to the data currently being parsed; it can’t look back to earlier data, and can’t look ahead. Such parsers are also known as event based parsers, and the SAX and StAX parsers are the most well-known examples.

A random access parser is a parser that enables the data-processing code to move back and forth (access at random) over the data being parsed. Examples of such parsers are XML DOM parsers.

Sequential access parsers let you access only the "window" or "event" that was just parsed, in the document sequence, whereas random access parsers allow you to navigate through the parsed data as you please.

Design Overview

The parser design I explain here is of the random access variety.

Random access parser implementations are often slower than sequential access parsers, because they generally build up some kind of object tree from the parsed data through which the data processing code can access that data. Creating this object tree is actually both slow in CPU time and can consume quite a bit of memory.

Instead of constructing an object tree from the parsed data, a more performant approach is to construct a buffer of indices into the original data buffer. The indices point to the start and end points of the elements found in the parsed data. Instead of accessing this data via an object tree, the data processing code accesses the parsed data directly in the buffer containing the original data. Here is an illustration of these two approaches:

For lack of a better name I call this an "Index Overlay Parser". The parser creates an index overlay on top of the original data. This is reminiscent of how a database indexes data stored on disk. It creates indices on top of the original, raw data to navigate and search through the data faster.

As I mentioned earlier, this design is inspired by VTD-XML (VTD for Virtual Token Descriptor.) Thus, you might also call this a Virtual Token Descriptor Parser. I prefer the name Index Overlay, because that is what the Virtual Token Descriptors represent - an index into the original data.

Parser Design in General

A normal design for parsers is to separate the parsing into two steps. The first step breaks the data into cohesive tokens, where a token is one or more bytes or characters occurring in the parsed data. The second step interprets the tokens and constructs larger elements based on these tokens. These two steps are illustrated here:

By elements I do not necessarily mean XML elements (though XML elements are also parser elements), but rather the larger "data elements" comprising the parsed data. In an XML document that would be XML elements, in a JSON document it would be JSON objects etc.

As an example, the string would be broken into the following tokens:

· <
· myelement
· >

Once the data is broken into tokens it is easier for the parser to make sense of them and thus determine the larger elements that these tokens comprise. The parser would then know that an XML element starts with a '<' token followed by a string token (the element name), then optionally a set of attributes, and finally a '>' token.

Index Overlay Parser Design

This is the two-step approach we will use in our parser design. The input data is first broken into tokens by a tokenizer component. The parser then parses those tokens to determine the larger element boundaries in the input data.

You can also add an optional third ‘element navigation step’ to the parsing process. If the parser constructs an object tree from the parsed data, the object tree typically contains links to navigate the tree. When we construct an element index buffer instead of an object tree, we may need a separate component to help the data processing code navigate the element index buffer.

Here is an overview of our parser design:

First we read all the data into a data buffer. In order to enable random access to the original data via the index created during parsing, all of the original data has to be available in memory.

Second the tokenizer breaks the data into tokens. The start index, end index and token type of the tokens are kept internally in a token buffer in the tokenizer. Using a token buffer makes it possible to look forwards and backwards, in such cases that your parser needs that.

Third, the parser looks at the tokens obtained from the tokenizer, validates them against their context and determines what elements they represent. The parser then constructs the element index (the index overlay) based on the tokens received from the tokenizer. The parser obtains the tokens one by one from the tokenizer. Thus, the tokenizer does not actually need to break all the data into tokens immediately. It just needs to find one token at a time.

The data processing code can navigate the element buffer, and use that to access the original data. Optionally, you may wrap the element buffer in an element navigator component, making navigating the element buffer easier.

This design does not build up an object tree from the parsed data, but it does build up a navigable structure, the element buffer, consisting of indices (int arrays) into the data buffer containing the original data. Using these indices you can navigate the data in the original data buffer.

The following sections will explain the various parts of the design in more detail.

The Data Buffer

The data buffer is a byte or char buffer containing the original data. The token buffer and element buffer contain indices into the data buffer.

In order to allow random access to the parsed data, it is necessary to have some kind of in-memory representation of it all. Instead of an object tree we use the data buffer with the raw data itself.

Having all data in memory can consume a big chunk of memory. If your data contains elements that are independent of each other, such as log records, pulling the whole log file into memory might be overkill. Instead you can pull in a chunk of the log file that contains at least one full log record. Since every log record can be fully parsed and processed independently of the other log records, you don't need the full log file in memory at the same time. I have described how to iterate a stream of data in chunks in my article Iterating Streams Using Buffers.

The Tokenizer + Token Buffer

The tokenizer breaks the data buffer into tokens. The information about the tokens is stored in the token buffer, containing the following:

  • Token position (start index)
  • Token length
  • Token type (optional)

This information is stored in arrays. Here is an example of how the code could look:

public class IndexBuffer {
    public int[] position = null;
    public int[] length = null;
    public byte[] type = null; /* assuming a max of 256 types (1 byte / type) */

As the tokenizer finds tokens in the data buffer, it will insert the position (start index) in the position array, the token length in the length array, and the token type in the type array.

If you don't use the optional token type array you can still determine the token type as needed by looking at the data in the token. This is a trade-off for performance vs. memory consumption.

The Parser

The parser is similar in nature to the tokenizer, except it takes tokens as input and outputs the element indices. Just like with tokens, an element is marked by its position (start index), length, and optionally its element type. These numbers are stored in the same structure used to store tokens.

Again, the type array is optional. If you can determine the element type easily based on the first bytes or characters of the element, you may not need to store the element types.

The precise granularity of the elements marked in the element buffer depends on the data being parsed as well as the code that needs to process the data afterwards. For instance, if you implement an XML parser you might mark each start tag, attribute and end tag as separate "parser elements".

The Element Buffer (The Index)

The parser produces an element buffer with indices into the original data. The indices mark the position (start index), length and type of the elements the parser has found in the data. You can use these indices to navigate the original data.

Looking at the IndexBuffer code above, you can see that the element buffer uses nine bytes per element; four bytes for the position, four bytes for the token length, and one byte for the token type.

You may be able to decrease the memory consumption of the IndexBuffer. For instance, if you know that the elements are never longer than 65,536 bytes, you may use an array of short instead of int to hold the token lengths. That will save you two bytes per element, bringing the memory consumption down to seven bytes per element.

Additionally, if know that the files you will parse are never longer than 16,777,216 bytes in length, you only need three bytes for the position (start index). The fourth byte of each int in the position array could then hold the element type, eliminating the need for a type array. If you have less than 128 token types, you can use seven bits for the token types instead of eight. This enables you to spend 25 bits on the position, which increases the maximum position to 33,554,432. If you have less than 64 token types, you can assign another bit to the position etc.

VTD-XML actually compacts all this information into a long to save space. You lose a bit of speed because of the extra bit manipulation needed to pack separate fields into a single int or long, but you save some memory. Again, it's a trade off.

The Element Navigator

The element navigator helps the code that is processing the data navigate the element buffer. Remember, a semantic object or element (e.g. an XML element) might consist of multiple parser elements. To ease the navigation you can create an element navigator object that can navigate the parser elements on a semantic object level. For instance, an XML element navigator might navigate the element buffer by going from start tag to start tag.

The use of an element navigator component is your choice. If you are implementing a parser for a single use in a single project, you might want to skip it. But if you’re implementing a parser with the intention of reusing it across projects, or publishing it as open source, you might want to add an element navigator component, depending on how complexity of navigating the parsed data.

Case Study: A JSON Parser

To make the index overlay parser design more tangible, I have implemented a small JSON parser in Java, based on the index overlay parser design. You can find the complete code on GitHub.

JSON is short for JavaScript Object Notation. JSON is a popular data format to exchange data between web servers and browsers via AJAX, because web browsers have built-in native support for parsing JSON into JavaScript objects. Going forward I will assume that you are familiar with JSON.

Here is a simple JSON example:

{ "key1" : "value1" , "key2" : "value2" , [ "valueA" , "valueB" , "valueC" ] } 

The JSON tokenizer will break this JSON string into the following tokens:

The underlines are there to emphasize the length of each token.

The tokenizer also determines the basic types of each token. Here is the same JSON example with the token types added:

Notice how the token types are not semantic. They only say what the basic token type is and not what they represent.

The parser interprets the basic token types and replaces them with semantic types. Here is the same JSON example but with semantic types (the parser elements) instead:

Once the parser is finished parsing the above JSON you will have an index (element buffer) consisting of the positions, lengths and element types of the elements marked above. You can then navigate the index to extract the data you need from the JSON.

The implementation in the GitHub repository contains two JSON parsers. One that splits the parsing between a JsonTokenizer and a JsonParser (as described earlier in this article), and a JsonParser2 which combines the tokenization and parsing into a single phase and class. The JsonParser2 is faster, but also harder to understand. Therefore I will take a quick look at the JsonTokenizer and JsonParser class in the following sections, but skip over JsonParser2.

(A reader of the first version of this article observed that the output from the index overlay parser was not very easy to use to extract data from the original data buffer. As mentioned earlier, that is why you might add an element navigation component. To show how such an element navigation component could look, I have added the JsonNavigator class. I will also take a quick look at that class later.)


In order to give you an idea about how the tokenization and parsing is implemented, I will show you the central code parts of the JsonTokenizer and the JsonParser. Remember, the full code is available on GitHub.

Here is the JsonTokenizer.parseToken() method which parses the next token in the data buffer:

     public void parseToken() {
       this.tokenBuffer.position[this.tokenIndex] = this.dataPosition;
       switch ([this.dataPosition]) {
         case '{': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_LEFT;
         case '}': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_RIGHT;
         case '[': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_LEFT;
         case ']': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_RIGHT;
         case ',': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COMMA;
         case ':': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COLON;
         case '"': { parseStringToken(); } break;
         case '0':
         case '1':
         case '2':
         case '3':
         case '4':
         case '5':
         case '6':
         case '7':
         case '8':
         case '9': {
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_NUMBER_TOKEN;
         case 'f': {
           if (parseFalse()) {
             this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_BOOLEAN_TOKEN;
         case 't': {
           if (parseTrue()) {
               this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_BOOLEAN_TOKEN;
         case 'n': {
           if (parseNull()) {
             this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_NULL_TOKEN;
        this.tokenBuffer.length[this.tokenIndex] = this.tokenLength;

As you can see, the code is pretty simple. First the skipWhiteSpace() method is called which skips over any white space characters that might be in the data at the current location. Second, the position of the current token (the index into the data buffer) is stored in the tokenBuffer. Third the next character is examined, and depending on what character it is (what token it is), one of the statements in the switch-case construct is executed. Finally the token length for the current token is stored.

This is really all it takes to tokenize a data buffer. Notice that once the beginning of a string token is found, the tokenizer calls the parseStringToken() method which scans through the data until the end of the string token is found. This will execute faster than trying to handle all cases inside the parseToken() method, and it is easier to implement.

The rest of the methods in JsonTokenizer are just there to assist the parseToken() method, or to move the data position (index) to the next token (first position after the current token) etc.


The primary method of the JsonParser class is the parseObject() method which looks at the token types of the tokens obtained from the JsonTokenizer and tries to locate JSON objects in the input data based on that.

Here is the parseObject() method:

     private void parseObject(JsonTokenizer tokenizer) {
       assertThisTokenType(tokenizer.tokenType(), TokenTypes.JSON_CURLY_BRACKET_LEFT);
       setElementData(tokenizer, ElementTypes.JSON_OBJECT_START);
       byte tokenType = tokenizer.tokenType();
       while (tokenType != TokenTypes.JSON_CURLY_BRACKET_RIGHT) {
          assertThisTokenType(tokenType, TokenTypes.JSON_STRING_TOKEN);
          setElementData(tokenizer, ElementTypes.JSON_PROPERTY_NAME);
          tokenType = tokenizer.tokenType();
          assertThisTokenType(tokenType, TokenTypes.JSON_COLON);
          tokenType = tokenizer.tokenType();
          switch (tokenType) {
             case TokenTypes.JSON_STRING_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_STRING);
             case TokenTypes.JSON_STRING_ENC_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_STRING_ENC);
             case TokenTypes.JSON_NUMBER_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_NUMBER);
             case TokenTypes.JSON_BOOLEAN_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_BOOLEAN);
             case TokenTypes.JSON_NULL_TOKEN: {
               setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE_NULL);
             case TokenTypes.JSON_CURLY_BRACKET_LEFT: {
             case TokenTypes.JSON_SQUARE_BRACKET_LEFT: {
          tokenType = tokenizer.tokenType();
          if (tokenType == TokenTypes.JSON_COMMA) {
             tokenizer.nextToken(); //skip , tokens if found here.
             tokenType = tokenizer.tokenType();
       setElementData(tokenizer, ElementTypes.JSON_OBJECT_END); } 

The parseObject() method expects to see a left curly bracket ( { ) followed by a string token, a colon and another string token or the beginning of an array ( [ ) or another JSON object. As the JsonParser gets these tokens from the JsonTokenizer, it stores start, length and the semantic meaning of these tokens in its own elementBuffer. The data processing code can then navigate this elementBuffer afterwards, to extract any needed data from the input data.

Seeing the core parts of the JsonTokenizer and the JsonParser classes should provide some idea of how tokenization and parsing works. To fully understand how the code works you may need to look that the complete JsonTokenizer and JsonParser implementation. They are both less than 200 lines of code each, so they should be reasonably approachable.


The JsonNavigator is an element navigator component. It helps you navigate element index created by theJsonParser and JsonParser2. The indices produced by these two components are the same, so an index from either will do. Here is a code example:

JsonNavigator jsonNavigator = new JsonNavigator(dataBuffer, elementBuffer); 

Once the JsonNavigator is created you can navigate element indexes using its navigation methods (next(),previous() etc. and you can extract data using the methods asString(), asInt() and asLong(). You can compare constant strings to the element in the data buffer using the isEqualUnencoded(String).

Using the JsonNavigator class looks pretty similar to using the Gson streaming API . Look at the gsonStreamBuildObject(Reader) method in the AllBenchmarks class, and in the parseJsonObject(JsonNavigator) method in the JsonObjectBuilder class.

They look pretty similar, do they not? Only, the parseJsonObject() method is capable of using some of the optimizations (discussed later in this article) of the JsonNavigator, like counting the primitive elements in an array, and faster string comparison for JSON field names.


VTD-XML has already done extensive benchmarking of their XML parser against StAX, SAX and DOM parsers. VTD-XML wins over them all in raw performance.

In order to establish some credibility for index overlay parser performance in general, I have benchmarked my JSON parser implementation against Gson. The first version of this article only measured the speed of parsing a JSON file and building and object via reflection with Gson. Due to comments from readers I have now expanded the benchmarks to measure Gson in four different modes:

  1. Stream through all elements in the JSON file without doing anything with the data.
  2. Stream through all elements in the JSON file and build a JsonObject from it.
  3. Parse the JSON file and build a Map from it.
  4. Parse the JSON file and build a JsonObject from it using reflection.

Keep in mind that Gson is already of a mature production quality and is well tested, with good error reporting etc. My JSON parser is only at a proof-of-concept level. The benchmarking is only done to get an indication of the difference in performance. They are not final numbers. Remember to read the discussion of the benchmarks below too.

Here are some details about how the benchmarks are structured:

  • In order to warm up the JIT, minimize one-off overheads etc. the JSON input parsing is performed 10,000,000 times for the small file, and 1,000,000 times for the medium and big file.
  • The benchmarks are repeated separately for three different files to see how the parsers do on small, medium and larger files. The file sizes are 58 bytes, 783 bytes and 1854 bytes. That means first do 10,000,000 iterations on a small file and measure that. Then on a medium file, and then on a larger file. The files are located in the data directory in the GitHub repository.
  • The file is being loaded fully into memory before parsing and measurement begins. That is done to exclude the file load time from the parsing time.
  • Each measurement of 10,000,000 iterations (or 1,000,000 iterations) runs in its own process. That means that each file is parsed in separate processes. Only one process runs at a time. Each file is measured 3 times. That is, the process parsing the file 10,000,000 times is started and stopped 3 times. The processes run sequentially, not in parallel.

Here are the execution times in milliseconds:



  Small 1

  Small 2

  Small 3

  Medium 1

  Medium 2

  Medium 3

  Large 1

  Large 2

  Large 3











JsonParser2 + Builder




















JsonParser + Builder










Gson Stream










Gson Stream + Builder










Gson Map










Gson Reflection










Here is an explanation of what the benchmarks do:



Parses file and locates indices using JsonParser2.

JsonParser2 + Builder

Parses file and builds a JsonObject from it using JsonParser2.


Parses file and locates indices using JsonParser.

JsonParser + Builder

Parses file and builds a JsonObject from it using JsonParser2.

Gson Stream

Parses file and iterates through all tokens via Gson streaming API.

Gson Stream + Builder

Parses file and builds a JsonObject from it.

Gson Map

Parses file and builds a Map from it.

Gson Reflection

Parses file and builds a JsonObject from it using reflection.

As you can see, the index overlay implementations (JsonParser and JsonParser2) are faster than Gson. Below I will discuss some of my speculations about why that is.

Performance Discussion

The main reason why Gson's streaming API is not even faster still, is because all data is extracted from the stream as it is being traversed, even if you don't need that data. Every token is turned into a String, int, double, etc., which is time consuming. That is also why parsing the JSON file and building a JsonOject from it with Gson's streaming API is almost as fast as just streaming through the elements themselves. The only addition time wise is the instantiation of the JsonObject and arrays inside the JsonObject.

The data extraction cannot account for it all, though. Building a JsonObject with JsonParser2 as almost twice as fast as building a JsonObject with Gson's streaming API. Here are a few of the performance advantages I see an index overlay parser have over a streaming parser:

First of all, if you look at the benchmarks for small and large files, it seems that Gson has some one-off initialization overhead associated with each parsing. The performance differences between JsonParser2 +JsonParser and the Gson benchmarks are bigger for smaller files. That could be due to the creation of theCharArrayReader or something similar. It could also be something internal in Gson.

Second, an index overlay parser allows you to choose exactly how much of the data you want to extract. That gives you are more fine-grained control over the performance of the parser.

Third, the JsonParser and JsonParser2 detects during tokenization if a string token has any escape characters (e.g. " \" \t \n \r ") that needs to be manually converted from UTF-8 to UTF-16. If a string token does not contain escape characters, the JsonNavigator can use a faster String creation mechanism than if they do.

Fourth, the JsonNavigator enables faster string comparisons against data in the data buffer. This is handy when you need to check if the name of a field is equal to some constant string. Using Gson's streaming API you would have to extract the field name into a String object, and compare the constant string to that String object. The JsonNavigator can compare the constant string to the chars in the data buffer directly, without creating a String object first. That saves the instantiation of a String object, and the copying of data from the data buffer into a String object which is only used for that comparison (e.g. checking if the JSON field name is equal to "key" or "name" or something else). Here is how that looks with the JsonNavigator:

if(jsonNavigator.isEqualUnencoded("fieldName")) { } 

Fifth, the JsonNavigator can look ahead in its index and count the number of elements in arrays that contain primitive values (strings, numbers, booleans, nulls etc. but not objects or nested arrays). When you do not know how many elements an array contains, normally you extract the elements and put them into a List. Once you encounter the end-of-array marker, you convert the List into an array. This means that you create an unnecessary List object. Additionally, all data extracted has to be objects to be inserted in a List, even if the array contains primitive values like ints or booleans. This creates unnecessary object creation (unnecessary auto-boxing at the least) when inserting the extracted values into the List. Again, when creating the array of primitives, all the objects have to be converted into primitives again, and inserted into the array. Here is how it looks in code using the Gson streaming API:

List<Integer> elements = new ArrayList<Integer>();
while (reader.hasNext()) {
int[] ints = new int[elements.size()];
for (int i = 0; i < ints.length; i++) {
   ints[i] = elements.get(i);
jsonObject.numberArray = ints;

When you do know how many elements an array contains, you can create the final Java array immediately, and insert the primitive values directly into the array. This saves the List instantiation and building, the auto-boxing of primitive values, and the conversion from objects back to primitive values again when inserting the values into the array. Here is how the same code looks using the JsonNavigator:

int[] ints = new int[jsonNavigator.countPrimitiveArrayElements()];
for (int i = 0, n = ints.length; i < n; i++) {
   ints[i] = jsonNavigator.asInt();;
jsonObject.numberArray = ints;

Even when you are just building List objects from JSON arrays, knowing the number of elements enables you to instantiate an ArrayList with the correct capacity from the beginning. That way you avoid penalties from the ArrayList dynamically resizing itself if you reach the limit of its default capacity. Here is how it looks in code:

List<String> strings = new ArrayList<String>(jsonNavigator.countPrimitiveArrayElements());; // skip over array start. 
while (ElementTypes.JSON_ARRAY_END != jsonNavigator.type()) {
}; //skip over array end.

Sixth, when you have access to the original data buffer you can use ropes instead of String objects in many places. A rope is a string token object that contains a reference to a char array, a start position and a length. You can make string comparisons, copy the rope etc. just like with a String. Some operations may just be faster with a rope than with a String object. They also take up less memory because they do not copy the original data.

Seventh, if you need to do a lot of back and forth navigation of the data you can create even more advanced indices . VTD-XML's index contains the indentation level of an element, as well as references to the index of the next element at the same level (next sibling), the first element with a higher indentation level (first ancestor) etc. These are all just int indices added on top of the linear parser element index. Such extra indices can make navigation of the parsed data  extremely fast.

Performance vs. Error Reporting

If you look at the JsonParser and JsonParser2 code, you will see that the faster JsonParser2 also has worse error reporting than JsonParser. When the tokenization and parsing phase are split in two, good data validation and error reporting is easier to implement.

Normally, this difference would spur a discussion about whether you should prioritize performance or error reporting when deciding on your parser implementation. With index overlay parsers, however, this discussion is not necessary.

Because the original data is always available in its full form in memory, you can parse the same data with both a fast and slow parser. You can start with the fast parser, and if parsing fails, you can use a slower parser to detect where the errors in the input data are. Just give the original data to the slower parser when the faster parser fails. That way you can get the best of both worlds.

Benchmark Discussion

It wouldn't be fair to compare a parser that creates an object tree from data (Gson) to a parser that just marks the index of the elements found in the data without a discussion of what is being compared.

Parsing a file inside a running application usually requires the following steps:

First the data is loaded either from disk or from the network. Second, the data is decoded, for instance from UTF-8 to UTF-16. Third, the data is parsed. Fourth, the data is processed.

In order to measure just the raw parser speed I preloaded the files to be parsed into memory; the benchmarked code did not process the data in any way. While this does benchmark just the raw parsing speeds, the performance difference does not translate one to one into increased performance in a running application. Here is why:

A streaming parser is often able to start parsing the incoming data before all of the data is loaded into memory. My JSON parser cannot do that the way it is implemented now. That means that even though it is faster in raw parsing benchmarks, in a real-life running application where my parser would have to wait for the data to load, it may not be as fast in total. This is illustrated in the diagram below:

You could probably modify my parser to be able to parse data as it is being loaded, in order to speed up the total parsing time. But that would probably decrease raw parser performance a bit. The total speed might still be better though.

Furthermore, by preloading the data into memory before running the benchmarks, I also skip the data-decoding step. Data decoding from UTF-8 to UTF-16 is not free. In a real life application you could not get around this step. Each file to parse would have to be decoded. This is the same for all parsers though. A streaming parser may be able to decode the data as it is read. An index overlay parser would be able to decode while reading the data into a buffer too.

VTD-XML and Jackson (another JSON parser) uses another technique. They do not decode the raw data at all. Instead they tokenize directly on the raw data, consuming whatever format that data is in (ASCII, UTF-8 etc.). That saves a costly decoding step, at the price of a more complex tokenizer.

In general the only way to really know which parser will be fastest in your application is to benchmark all of them on the real job you have to do with the data you need to parse.

Discussion of Index Overlay Parsers in General

One argument I have heard against index overlay parsers is that to be able to point into the original data rather than extract it into an object tree, it is necessary to keep all of the data in memory while parsing it. This can produce a memory-hit when processing big files.

In general the argument is that a streaming parser (like SAX or StAX) can parse huge files without ever having the whole file in memory. However this is only true if the data in the file can be parsed and processed in smaller chunks, where each chunk can be parsed and processed independently from other chunks. For instance, a big XML file contains a list of elements, each of which can be parsed and processed individually (like a list of log records). But, if your data can be parsed separately in independent chunks, you can implement an index overlay parser that is capable of that as well.

If the file cannot be parsed in independent chunks you will nonetheless need to extract the necessary information into some structure which can be accessed by the code processing later chunks. But if you can do that with a streaming parser, you can also do it with an index overlay parser.

Parsers that create object trees from input data often consume much larger amounts of memory with the object tree than the original data size. This is due to the memory overhead associated with an object instance, plus extra data needed to keep references between objects.

Additionally since all data needs to be in memory at one time, you need to allocate a data buffer ahead of parsing that is big enough to hold all the data. But what if you don't know how big the files are when you start parsing them?

Imagine you have a web application (or web service, or other server application) where the users can upload files. You may not know how big the files are, so how can you allocate a suitable buffer for them before the parsing starts? Well, for security reasons you should always have a maximum allowed file size. Otherwise users may be able to crash your system by uploading very large files. Or they may even write a program that pretends to be a browser uploading a file, and have that program never stop sending data to your server. You can allocate a buffer fitting the maximum allowed file size. That way your buffer will not run out of space for valid files. If it does run out of space, your user has uploaded an excessively large file anyway.

About the Author

Jakob Jenkov is an entrepreneur, writer and software developer currently located in Barcelona, Spain. Jakob learned Java in 1997, and has worked with Java professionally since 1999. He is currently focused on the concept of efficiency across a variety of domains (software, processes, businesses etc.). He holds a master of science in IT from the IT University in Copenhagen. You can read more about his work on his website.

Rate this Article


Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Thanks for the update

    by Richard Hightower,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Well done!
    I really enjoy your article.
    This time even more.

    As we talked about I implemented a partial hierarchal index overlay. It is about 20% faster than the normal parser that I wrote except for really small files, it is quite a bit faster. I might have to try a true, full index overlay.

    Jackson is pretty darn fast. It is a faster than GSON at smaller files (or so it appears to me).

    If your numbers hold up, then this should be faster than Boon, GSON and Jackson.

    I was able to tune Boon to be faster (up to 4x, but mostly 10 to 20%) than GSON and Jackson.

    Jackson is really, really good at encoding direct byte[]. GSON is a bit better at large files and handling readers.

    I've tuned Boon to handle files form 42 bytes to 4MB faster than Jackson, and GSON for readers, input stream, byte[], char[], String and CharSequence (CharacaterBuffer, StirngBuidler, etc.). I spent some time making the I/O faster and getting it to reduce the number of buffers it creates. (GSON handles IO well.) I don't have a DOM like thing. It always parse things to a Map/List or something that looks like a Map/List which hides the partial index overlay bits or it does object serialization which was my main goal. At one point I added my own UTF-8 encoding, but have since shied away from this.

    At some point, I will pull your parser down again and try it out. I like the idea of a full index parser. Your first article inspired me to benchmark and improve Boon JSON parser, which was a rabbit hole of discovery.

    Thanks. Great article. I love the updates. Thanks for the information. Great work.

  • Re: Thanks for the update

    by Jakob Jenkov,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thank you too! ... your input prompted me to do a whole lot more work / research / writing.

    In a real life application your code will also have to load and decode the data before parsing it. If you added the loading and decoding time to the above benchmark, the numbers would probably look less good for index overlay. Not necessarily bad, since VTD-XML is still performing quite well based on this technique, but not as impressive as when the data is fully preloaded in memory.

    Anyways, the purpose of the article was not to benchmark JSON parsers against each other, but to explain a simple method for implementing parsers which perform pretty well on average.

  • Re: Thanks for the update

    by Richard Hightower,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I appreciate your feedback.

    The benchmark page is more like random notes than an article so it is hard to find this.... The last time I benchmarked Boon using InputStream, Reader, String, byte[], String and char[], it was faster than Jackson and GSON, and I already use Boon JSON in real life (long before I tuned it).

    I actually wrote a low level I/O lib and tuned the crap out of it so I could have a fast reader, inputStream, char[], String, etc. Anyway... Boon JSON is not ready for general consumption yet. It is used in a few places but it is no direct competitor to Jackson. I will be adding a read from file name mode, and it will be fast. Right now you give it a stream, reader, byte[], char[], String, or CharSequence and it goes to town.

    (Someone else benchmarked Boon on a 180 MB file and it was faster than any C# implementation.)

    3 warm-ups Optimization 2 2MB file

    java -jar target/microbenchmarks.jar "(.*string.*Catalog|.*byte.*Catalog|.*inputStream.*Catalog)" -wi 3 -i 5 -f 1 -t 8

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.bytes.BoonBenchmark.citmCatalog thrpt 8 5 1 567.957 28.914 ops/s
    i.g.j.bytes.GSONBenchmark.citmCatalog thrpt 8 5 1 454.070 37.305 ops/s
    i.g.j.bytes.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 444.917 37.230 ops/s
    i.g.j.bytes.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 369.067 79.432 ops/s
    i.g.j.bytes.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 310.707 30.617 ops/s
    i.g.j.inputStream.BoonBenchmark.citmCatalog thrpt 8 5 1 432.913 42.389 ops/s
    i.g.j.inputStream.GSONBenchmark.citmCatalog thrpt 8 5 1 348.263 29.406 ops/s
    i.g.j.inputStream.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 309.323 62.159 ops/s
    i.g.j.inputStream.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 350.983 21.199 ops/s
    i.g.j.string.BoonBenchmark.citmCatalog thrpt 8 5 1 589.603 32.688 ops/s
    i.g.j.string.BoonCharacterSequenceBenchMark.citmCatalog thrpt 8 5 1 590.683 22.074 ops/s
    i.g.j.string.GSONBenchmark.citmCatalog thrpt 8 5 1 419.830 40.693 ops/s
    i.g.j.string.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 306.390 15.029 ops/s
    i.g.j.string.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 318.053 16.880 ops/s
    i.g.j.string.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 355.843 22.255 ops/s

    1 warm-up Optimization 2

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.bytes.BoonBenchmark.medium thrpt 8 5 1 399243.743 62196.133 ops/s
    i.g.j.bytes.GSONBenchmark.medium thrpt 8 5 1 272056.607 81816.346 ops/s
    i.g.j.bytes.JacksonASTBenchmark.medium thrpt 8 5 1 374466.083 53820.539 ops/s
    i.g.j.bytes.JacksonObjectBenchmark.medium thrpt 8 5 1 301776.590 120239.705 ops/s
    i.g.j.bytes.JsonSmartBenchmark.medium thrpt 8 5 1 276518.383 45386.617 ops/s
    i.g.j.inputStream.BoonBenchmark.medium thrpt 8 5 1 160257.603 36129.635 ops/s
    i.g.j.inputStream.GSONBenchmark.medium thrpt 8 5 1 113724.713 56064.618 ops/s
    i.g.j.inputStream.JacksonASTBenchmark.medium thrpt 8 5 1 128157.543 134502.152 ops/s
    i.g.j.inputStream.JacksonObjectBenchmark.medium thrpt 8 5 1 126100.847 90562.800 ops/s
    i.g.j.string.BoonBenchmark.medium thrpt 8 5 1 325846.617 92375.586 ops/s
    i.g.j.string.BoonCharacterSequenceBenchMark.medium thrpt 8 5 1 466013.417 62494.732 ops/s
    i.g.j.string.GSONBenchmark.medium thrpt 8 5 1 268420.253 72867.882 ops/s
    i.g.j.string.JacksonASTBenchmark.medium thrpt 8 5 1 239763.487 119743.263 ops/s
    i.g.j.string.JacksonObjectBenchmark.medium thrpt 8 5 1 239958.903 55348.635 ops/s
    i.g.j.string.JsonSmartBenchmark.medium thrpt 8 5 1 312084.100 28782.277 ops/s
    Optmization 2 2MB

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.bytes.BoonBenchmark.citmCatalog thrpt 8 5 1 492.520 86.823 ops/s
    i.g.j.bytes.GSONBenchmark.citmCatalog thrpt 8 5 1 395.413 64.481 ops/s
    i.g.j.bytes.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 373.497 204.279 ops/s
    i.g.j.bytes.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 364.330 115.620 ops/s
    i.g.j.bytes.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 300.953 144.286 ops/s
    i.g.j.inputStream.BoonBenchmark.citmCatalog thrpt 8 5 1 398.873 85.204 ops/s
    i.g.j.inputStream.GSONBenchmark.citmCatalog thrpt 8 5 1 331.230 60.375 ops/s
    i.g.j.inputStream.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 271.857 210.069 ops/s
    i.g.j.inputStream.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 291.733 145.984 ops/s
    i.g.j.string.BoonBenchmark.citmCatalog thrpt 8 5 1 546.653 73.946 ops/s
    i.g.j.string.BoonCharacterSequenceBenchMark.citmCatalog thrpt 8 5 1 551.000 83.382 ops/s
    i.g.j.string.GSONBenchmark.citmCatalog thrpt 8 5 1 384.077 66.253 ops/s
    i.g.j.string.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 256.113 109.586 ops/s
    i.g.j.string.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 250.663 124.434 ops/s
    i.g.j.string.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 328.350 44.345 ops/s

    Optimization 1 2 MB file (From String)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.s.BoonBenchmark.citmCatalog thrpt 8 5 1 634.490 161.698 ops/s
    i.g.j.s.BoonCharacterSequenceBenchMark.citmCatalog thrpt 8 5 1 554.517 36.224 ops/s
    i.g.j.s.GSONBenchmark.citmCatalog thrpt 8 5 1 398.407 33.769 ops/s
    i.g.j.s.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 311.417 9.404 ops/s
    i.g.j.s.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 310.427 28.941 ops/s
    i.g.j.s.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 364.003 11.225 ops/s
    Optimization 1 2 MB file (from bytes)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.b.BoonBenchmark.citmCatalog thrpt 8 5 1 577.847 33.689 ops/s
    i.g.j.b.GSONBenchmark.citmCatalog thrpt 8 5 1 452.947 30.377 ops/s
    i.g.j.b.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 389.663 110.508 ops/s
    i.g.j.b.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 365.137 71.960 ops/s
    i.g.j.b.JsonSmartBenchmark.citmCatalog thrpt 8 5 1 272.030 107.993 ops/s
    Optimization 1 2 MB file (from input stream)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.inputStream.BoonBenchmark.citmCatalog thrpt 8 5 1 538.933 7.130 ops/s
    i.g.j.inputStream.GSONBenchmark.citmCatalog thrpt 8 5 1 407.733 42.164 ops/s
    i.g.j.inputStream.JacksonASTBenchmark.citmCatalog thrpt 8 5 1 379.237 34.670 ops/s
    i.g.j.inputStream.JacksonObjectBenchmark.citmCatalog thrpt 8 5 1 349.427 49.306 ops/s
    Optimization 1 webxml file (from input stream)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.inputStream.BoonBenchmark.webxml thrpt 8 5 1 153473.453 2432.691 ops/s
    i.g.j.inputStream.GSONBenchmark.webxml thrpt 8 5 1 99381.280 2997.789 ops/s
    i.g.j.inputStream.JacksonASTBenchmark.webxml thrpt 8 5 1 120798.840 1797.505 ops/s
    i.g.j.inputStream.JacksonObjectBenchmark.webxml thrpt 8 5 1 101469.583 26395.522 ops/s
    Optimization 1 webxml file (from string and bytes)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.bytes.BoonBenchmark.webxml thrpt 8 5 1 243897.070 2004.897 ops/s
    i.g.j.bytes.GSONBenchmark.webxml thrpt 8 5 1 145293.453 2244.048 ops/s
    i.g.j.bytes.JacksonASTBenchmark.webxml thrpt 8 5 1 195156.710 8402.572 ops/s
    i.g.j.bytes.JacksonObjectBenchmark.webxml thrpt 8 5 1 166488.157 21522.238 ops/s
    i.g.j.bytes.JsonSmartBenchmark.webxml thrpt 8 5 1 153111.840 9095.388 ops/s

    i.g.j.string.BoonBenchmark.webxml thrpt 8 5 1 249912.307 19361.926 ops/s
    i.g.j.string.BoonCharacterSequenceBenchMark.webxml thrpt 8 5 1 266623.943 10615.374 ops/s
    i.g.j.string.GSONBenchmark.webxml thrpt 8 5 1 132536.827 15160.406 ops/s
    i.g.j.string.JacksonASTBenchmark.webxml thrpt 8 5 1 148642.550 7133.436 ops/s
    i.g.j.string.JacksonObjectBenchmark.webxml thrpt 8 5 1 148273.823 6390.983 ops/s
    i.g.j.string.JsonSmartBenchmark.webxml thrpt 8 5 1 173956.203 3314.959 ops/s
    Optimization 1 medium

    Benchmark Mode Thr Count Sec Mean Mean error Units
    i.g.j.bytes.BoonBenchmark.medium thrpt 8 5 1 394104.150 71284.602 ops/s
    i.g.j.bytes.GSONBenchmark.medium thrpt 8 5 1 270688.050 77677.140 ops/s
    i.g.j.bytes.JacksonASTBenchmark.medium thrpt 8 5 1 380602.207 54750.541 ops/s
    i.g.j.bytes.JacksonObjectBenchmark.medium thrpt 8 5 1 343228.970 50054.425 ops/s
    i.g.j.bytes.JsonSmartBenchmark.medium thrpt 8 5 1 282828.397 31699.294 ops/s

    i.g.j.string.BoonBenchmark.medium thrpt 8 5 1 410202.460 47389.489 ops/s
    i.g.j.string.BoonCharacterSequenceBenchMark.medium thrpt 8 5 1 361550.043 25223.089 ops/s
    i.g.j.string.GSONBenchmark.medium thrpt 8 5 1 288093.323 77267.217 ops/s
    i.g.j.string.JacksonASTBenchmark.medium thrpt 8 5 1 267662.703 35289.327 ops/s
    i.g.j.string.JacksonObjectBenchmark.medium thrpt 8 5 1 251775.013 55411.458 ops/s
    i.g.j.string.JsonSmartBenchmark.medium thrpt 8 5 1 314237.510 34189.281 ops/s

    I have recently done a lot of refactoring, but it seems to get faster with each round. I am not focusing more on reuse. Somewhere along the way... I broke the pretty error messages so I need to fix that.

    Its speed has been verified but at least two other sources (one the gatling benchmark project).


    This just in from the Gatling guy from France (a.k.a. Stephane). So not just the fastest, but independently verified. :)

    Benchmark Mode Thr Count Sec Mean Mean error Units
    1 BoonCharArrayBenchmark.roundRobin thrpt 16 10 1 810895,127 102746,330 ops/s
    2 JsonSmartBytesBenchmark.roundRobin thrpt 16 10 1 582712,022 55424,984 ops/s
    3 BoonDirectBytesBenchmark.roundRobin thrpt 16 10 1 577522,193 38627,510 ops/s
    4 JsonSmartStringBenchmark.roundRobin thrpt 16 10 1 566789,030 42245,984 ops/s
    5 JacksonObjectBenchmark.roundRobin thrpt 16 10 1 554903,552 251024,921 ops/s
    6 GSONStringBenchmark.roundRobin thrpt 16 10 1 517546,880 68706,631 ops/s
    7 JacksonASTBenchmark.roundRobin thrpt 16 10 1 495469,035 299158,957 ops/s
    8 GSONReaderBenchmark.roundRobin thrpt 16 10 1 432468,228 51960,414 ops/s
    9 JsonSmartStreamBenchmark.roundRobin thrpt 16 10 1 284085,723 136069,599 ops/s
    10 JsonSmartReaderBenchmark.roundRobin thrpt 16 10 1 91780,987 24235,931 ops/s
    This from FastJson (sleepless in Hayward).

    So it looks like boon comes in 1st and 3rd on this test. But look at first again. It is way ahead of the pack. Also note that there is an Index Overlay option that is a bit faster that it is not using. :) Congrats to JsonSmart. Clearly, I need to tune the direct byte handling a bit. I think Boon can get 1st (convert to char[] and then parse), 2nd (direct bytes) and 3rd (index overlay), but no rush.

    From jsonfast...

    Rick's "Boon" small test

    Rick's "Boon" small test, slightly modified (deserializing x times the JSON contained in the boon-small.json.txt file = 79 bytes) - with POCO target (1 class):

    10,000,000 iterations: in ~ 28.1 seconds
    vs. Microsoft's JavaScriptSerializer in ~ 302.3 seconds (bad omen #1)
    vs. JSON.NET in ~ 56.5 seconds
    vs. ServiceStack in ~ 40.7 seconds
    (Which yields System.Text.Json.JsonParser's throughput : 28,090,886 bytes / second)
    the same Rick's "Boon" small test, this time with "loosely typed" deserialization (no POCO target, just dictionaries and lists - read above):

    10,000,000 iterations: in ~ 34.7 seconds
    vs. Microsoft's JavaScriptSerializer in ~ 232.2 seconds (bad omen #2)
    vs. JSON.NET in ~ 72.4 seconds
    vs. ServiceStack in... N / A
    (Which yields System.Text.Json.JsonParser's throughput : 22,739,702 bytes / second)
    Rick's original test can be found at:

    Note Rick is one of our fellows from the Java realm - and from his own comparative figures that I eventually noticed, I take it Rick's "Boon" is pretty darn fast among them guys' Java toolboxes for JSON... That'd almost make a .NET / CLR dude like me jealous of Java... ;)
    I never meant to make you jealous. :) What is this CLR thing that you speak of? :)

    The above From

    If you want to turn a JSON file into a java.util.Map, it appears that Boon is the fastest option. When I first read the article, this was not the case. I had some ideas for speeding up the JSON parsing, but never seemed to need to (it seemed fast enough). Then this article came out,

    Also Boon seems to do really fast Object Serialization (from JSON to Java only at the moment). It has been measured to be the fastest, but... the truth is all it takes is the GSON or Jackson guys an afternoon or two sitting by the warm glow of the jvisualvm window and they can perf tune them to be very fast. I mean.. there is only so many things you can do. IndexOverlay is very fast. I plan on doing a full index overlay at some point in the future.

    Like you said, avoiding buffer copies, encoding, I/O, etc. are all important components too.
    Combine a highly tuned parser with a full index overlay and you have a force to reckon with.

    Boon is only a partial index overlay parser.

  • Re: Thanks for the update

    by Richard Hightower,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Let me know if you want to collaborate on adding a full index overlay to boon.
    I have the fast encoding, I/O, reduce buffer copies, direct parsing of ints, longs, floats, doubles and dates from buffers, custom buffer classes, fast string interning, object serialization, high speed reflection lib, direct form byte on the fly encoding, lax mode, etc. written. Add the full index overlay and it should be even faster.

  • Re: Thanks for the update

    by Jakob Jenkov,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Collaborate? Why not? I can only learn a lot from that.

  • Re: Thanks for the update

    by Richard Hightower,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Check out this:

    It is a clone of this:

    (look at the number not the text because numbers say boon is faster).

    So index overlay is great for things like JsonPath, and JsonPath can use boon, jackson or json smart.

    I change the index overlay parser to not hold onto the char buffers once an expression has been applied. This allows for same damn fast test results without the worry about caching a map that is holding on to a giant buffer.

    There are two implementations. One goes chops lazily and one chops during parse. There is a 30 to 24 ratio in favor of being lazy. I imagine since this is only a partial overlay that there is another 30 to 36 or 30 to 40 ratio gain. So... your mission if you should choose it.. Help me port your full index overlay to Boon. I think there is another 25% or so gain.

  • Re: Thanks for the update

    by Richard Hightower,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Of course... it is our mission. I will help. :)

    How do you want to proceed... drop me an email. My email is on linkedin.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p