BT

Implementing High Performance Parsers in Java

Posted by Jakob Jenkov on Nov 17, 2013 |

***This article de-linked and replaced by a new version***

[Editors update "Dec 9, 2013:  The author has since published a significantly revised article that you can read here: http://www.infoq.com/articles/HIgh-Performance-Parsers-in-Java-V2. InfoQ would like to thank Jakob for responding so well to our reader's feedback

[Editors note "Nov 28, 2013:  Due to reader complaints about the validity of the parser used,the article has been removed from active listing in InfoQ's archive and available by direct link only. The author will be given an opportunity to refactor and republish]

 

From time to time you may need to implement your own data or language parser in Java, for example if th

From time to time you may need to implement your own data or language parser in Java, for example if there is no standard Java or open source parser for that data format or language. Or, perhaps there are parsers, but they are too slow, consume too much memory, or just don’t have the features you need. Or there might be bugs in an open source parser, or the open source parser project was abandoned etc. The reason is not as important as the reality that you will be needing to implement your own parser.

When you have to implement your own parser, you want it to perform well, be flexible, feature rich, easy to use, and last but not least, easy to implement; after all, your name is on that code. In this article I will explain one way of implementing high performance parsers in Java. This method is not exclusive but it is reasonably simple and achieves both high performance and a reasonably modular design. The design is inspired by the design of VTD-XML, the fastest XML parser for Java I have seen, being even faster than the StAX and SAX Java standard XML parsers.

Two Basic Parser Types

There are several ways to categorize parsers. Here I will just make the distinction between two basic parser types:

  • Sequential access parsers
  • Random access parsers

By sequential access I mean that the parser parses the data, turning over the parsed data to the data processor as the data is parsed. The data processor only has access to the data currently being parsed; it can’t look back to earlier data, and can’t look ahead. Such parsers are also known as event based parsers, like the SAX and StAX parsers.

A random access parser is a parser that enables the data processing code to move back and forth (access at random) over the data being parsed. Examples of such parsers are XML DOM parsers.

This diagram attempts to illustrate the difference between sequential and random access parsers:

Sequential access parsers let you access only the "window" or "event" that was just parsed, whereas random access parsers allow you to navigate the parsed data as you please.

Design Overview

The parser design I explain here is of the random access variety.

Random access parser implementations are often slower than sequential access parsers, because they generally build up some kind of object tree from the parsed data through which the data processing code can access that data. Creating this object tree is actually both slow in CPU time and can consume quite a bit of memory.

Instead of constructing an object tree from the parsed data, a more performant approach is to construct a buffer of indices into the original data buffer. The indices point to the start and end points of the elements found in the parsed data. Instead of accessing this data via an object tree, the data processing code accesses the parsed data directly in the buffer containing the original data. Here is an illustration of these two approaches:

For lack of a better name I call this an "Index Overlay Parser". The parser creates an index overlay on top of the original data. This is reminiscent of  how a database indexes data stored on disk. It creates indices on top of the original, raw data to navigate and search through the data faster.

As I mentioned earlier, this design is inspired by VTD-XML (VTD for Virtual Token Descriptor.) Thus, you might also call this a Virtual Token Descriptor Parser. I prefer Index Overlay, because that is what the Virtual Token Descriptors represent - an index into the original data.

Parser Design in General

A normal design for parsers is to separate the parsing into two steps. The first step breaks the data into cohesive tokens, where a token is one or more bytes or characters occurring in the parsed data. The second step interprets the tokens and constructs larger elements based on these tokens. These two steps are illustrated here:

By elements I do not necessarily mean XML elements (though XML elements are also parser elements), but rather the larger "data elements" comprising the parsed data. In an XML document that would be XML elements, in a JSON document it would be JSON objects etc.

As an example, the string <myelement> would be broken into the following tokens:

  • <
  • myelement
  • >

Once the data is broken into tokens it is easier for the parser to make sense of them and thus determine the larger elements that these tokens comprise. The parser would then know that an XML element starts with a '<' token followed by a string token (the element name), then optionally a set of attributes, and finally a '>' token.

Index Overlay Parser Design

This is the two-step approach we will use in our parser design. The input data is first broken into tokens by a tokenizer component. The parser then parses those tokens to determine the larger element boundaries in the input data.

You can also add an optional third “element navigation step” to the parsing process. If the parser constructs an object tree from the parsed data, the object tree typically contains links to navigate the tree. When we construct an element index buffer instead of an object tree, we may need a separate component to help the data processing code navigate the element index buffer.

Here is an overview of our parser design:

First we read all the data into a data buffer. In order to enable random access to the original data via the index created during parsing, all of the original data has to be available in memory.

Second the tokenizer breaks the data into tokens. The start index, end index and token type of the tokens are kept internally in a token buffer in the tokenizer. Using a token buffer makes it possible to look forwards and backwards, in such cases that your parser needs that.

Third, the parser looks at the tokens obtained from the tokenizer, validates them against their context and determines what elements they represent. The parser then constructs the element index (the index overlay) based on the tokens received from the tokenizer. The parser obtains the tokens one by one from the tokenizer. Thus, the tokenizer does not actually need to break all the data into tokens immediately. It just needs to find one token at a time.

The data processing code can navigate the element buffer, and use that to access the original data. Optionally, you may wrap the element buffer in an element navigator component, making navigating the element buffer easier.

This design does not build up an object tree from the parsed data, but it does build up a navigable structure, the element buffer, consisting of indices (int arrays) into the data buffer containing the original data. Using these indices you can navigate the data in the original data buffer.

The following sections will explain the various parts of the design in more detail.

The Data Buffer

The data buffer is a byte or char buffer containing the original data. The token buffer and element buffer contain indices into the data buffer.

In order to allow random access to the parsed data, it is necessary to have some kind of in-memory representation of it all. Instead of an object tree we use the data buffer with the raw data itself.

Having all data in memory can consume a big chunk of memory. If your data contains elements that are independent of each other, such as log records, pulling the whole log file into memory might be overkill. Instead you can pull in a chunk of the log file that contains at least one full log record. Since every log record can be fully parsed and processed independently of the other log records, you don't need the full log file in memory at the same time. I have described how to iterate a stream of data in chunks in my article Iterating Streams Using Buffers.

The Tokenizer + Token Buffer

The tokenizer breaks the data buffer into tokens. The information about the tokens is stored in the token buffer, containing the following:

  • Token position (start index)
  • Token length
  • Token type (optional)

      This information is stored in arrays. Here is an example of how the code could look:

   public class IndexBuffer {
       public int[]  position = null;
       public int[]  length   = null;
       public byte[] type     = null; /* assuming a max of 256 types (1 byte / type) */
   }

As the tokenizer finds tokens in the data buffer, it will insert the position (start index) in the position array, the token length in the length array, and the token type in the type array.

If you don’t use the optional token type array you can still determine the token type as needed by looking at the data in the token. This is a trade-off for performance vs. memory consumption.

The Parser

The parser is similar in nature to the tokenizer, except it takes tokens as input and outputs the element indices. Just like with tokens, an element is marked by its position (start index), length, and optionally its element type. These numbers are stored in the same structure used to store tokens.

Again, the type array is optional. If you can determine the element type easily based on the first bytes or characters of the element, you may not need to store the element types.

The precise granularity of the elements marked in the element buffer depends on the data being parsed as well as the code that needs to process the data afterwards. For instance, if you implement an XML parser you might mark each start tag, attribute and end tag as separate "parser elements".

The Element Buffer (The Index)

The parser produces an element buffer with indices into the original data. The indices mark the position (start index), length and type of the elements the parser has found in the data. You can use these indices to navigate the original data.

Looking at the IndexBuffer code above, you can see that the element buffer uses nine bytes per element; four bytes for the position, four bytes for the token length, and one byte for the token type.

You may be able to decrease the memory consumption of the IndexBuffer. For instance, if you know that the elements are never longer than 65,536 bytes, you may use an array of short instead of int to hold the token lengths. That will save you two bytes per element, bringing the memory consumption down to seven bytes per element.

Additionally, if know that the files you will parse are never longer than 16,777,216 bytes in length, you only need three bytes for the position (start index). The fourth byte of each int in the position array could then hold the element type, eliminating the need for a type array. If you have less than 128 token types, you can use seven bits for the token types instead of eight. This enables you to spend 25 bits on the position, which increases the maximum position to 33,554,432. If you have less than 64 token types, you can assign another bit to the position etc.

VTD-XML actually compacts all this information into a long to save space. You lose a bit of speed because of the extra bit manipulation needed to pack separate fields into a single int or long, but you save some memory. It's a trade off.

The Element Navigator

The element navigator helps the code that is processing the data navigate the element buffer. Remember, a semantic object or element (e.g. an XML element) might consist of multiple parser elements. To ease the navigation you can create an element navigator object that can navigate the parser elements on a semantic object level. For instance, an XML element navigator might navigate the element buffer by going from start tag to start tag.

The use of an element navigator component is your choice. If you are implementing a parser for a single use in a single project, you might want to skip it. But if you’re implementing a parser with the intention of reusing it across projects, or publishing it as open source, you might want to add an element navigator component, depending on how complexity of navigating the parsed data.

Case Study: A JSON Parser

To make the index overlay parser design more tangible, I have implemented a small JSON parser in Java, based on the index overlay parser design. You can find the complete code on GitHub.

JSON is short for JavaScript Object Notation. JSON is a popular data format to exchange data between web servers and browsers via AJAX, because web browsers have built-in native support for parsing JSON into JavaScript objects. Going forward I will assume that you are familiar with JSON.

Here is a simple JSON example:

  {  "key1"  :  "value1"  ,  "key2"  :  "value2"  ,  [  "valueA"  ,  "valueB"  ,  "valueC"  ]  }

The JSON tokenizer will break this JSON string into the following tokens:

The underlines are there to emphasize the length of each token.

The tokenizer also determines the basic types of each token. Here is the same JSON example with the token types added:

Notice how the token types are not semantic. They only say what the basic token type is and not what they represent.

The parser interprets the basic token types and replaces them with semantic types. Here is the same JSON example but with semantic types (the parser elements) instead:

Once the parser is finished parsing the above JSON you will have an index (element buffer) consisting of the positions, lengths and element types of the elements marked above. You can then navigate the index to extract the data you need from the JSON.

JsonTokenizer.parseToken()

In order to give you an idea about how the tokenization and parsing is implemented, I will show you the central code parts of the JsonTokenizer and the JsonParser. Remember, the full code is available on Github.

Here is the JsonTokenizer.parseToken() method which parses the next token in the data buffer:

public void parseToken() {
     skipWhiteSpace();
     this.tokenLength = 0;
	
     this.tokenBuffer.position[this.tokenIndex] = this.dataPosition;
     char nextChar = this.dataBuffer.data[this.dataPosition];
	
     switch(nextChar) {
	 case '{'   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_LEFT;
           break;
	 case '}'   :  
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_CURLY_BRACKET_RIGHT;
           break;
	 case '['   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_LEFT ;
           break;
	 case ']'   : 
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_SQUARE_BRACKET_RIGHT;
           break;
	 case ','   :
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COMMA;
           break;
	 case ':'   :  
           this.tokenLength = 1;
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_COLON;
           break;
	 case '"'   :
           parseStringToken();
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_STRING_TOKEN;
           break;
	 default    : 
           parseStringToken();
           this.tokenBuffer.type[this.tokenIndex] = TokenTypes.JSON_STRING_TOKEN;
     }
     this.tokenBuffer.length[this.tokenIndex] = this.tokenLength;
  }

As you can see, the code is pretty simple. First the skipWhiteSpace() method is called which skips over any white space characters that might be in the data at the current location. Second, the token length is set to 0. Third, the position of the current token (the index into the data buffer) is stored in the TokenBuffer. Fourth the next character is examined, and depending on what character it is (what token it is), one of the statements in the switch-case construct is executed. Finally the token length for the current token is stored.

This is really all it takes to tokenize a data buffer. Notice that once the beginning of a string token is found, the tokenizer calls the parseStringToken() method which is scans through the data until the end of a string token is found. This will execute faster than trying to handle all cases inside the parseToken() method, and is easier to implement.

The rest of the methods in JsonTokenizer are just there to assist the parseToken() method, or to move the data position (index) to the next token (first position after the current token) etc.

JsonParser.parseObject()

The primary method of the JsonParser class is the parseObject() method which looks at the token types of the tokens obtained from the JsonTokenizer and tries to locate JSON objects in the input data based on that.

Here is the parseObject() method:

private void parseObject(JsonTokenizer tokenizer) {
     assertHasMoreTokens(tokenizer);
     tokenizer.parseToken();
     assertThisTokenType(tokenizer, TokenTypes.JSON_CURLY_BRACKET_LEFT);
     setElementData     (tokenizer, ElementTypes.JSON_OBJECT_START);
     tokenizer.nextToken();
     tokenizer.parseToken();
     while( tokenizer.tokenType() != TokenTypes.JSON_CURLY_BRACKET_RIGHT) {                 
         assertThisTokenType(tokenizer, TokenTypes.JSON_STRING_TOKEN);
	 setElementData(tokenizer, ElementTypes.JSON_PROPERTY_NAME);          
         tokenizer.nextToken();
	 tokenizer.parseToken();
	 assertThisTokenType(tokenizer, TokenTypes.JSON_COLON);
	 tokenizer.nextToken();
	 tokenizer.parseToken();
	 if(tokenizer.tokenType() == TokenTypes.JSON_STRING_TOKEN) {             
             setElementData(tokenizer, ElementTypes.JSON_PROPERTY_VALUE);
	 } else if(tokenizer.tokenType() == TokenTypes.JSON_SQUARE_BRACKET_LEFT) {
	     parseArray(tokenizer);
	 }
         tokenizer.nextToken();
	 tokenizer.parseToken();
	 if(tokenizer.tokenType() == TokenTypes.JSON_COMMA) {
	     tokenizer.nextToken();  //skip , tokens if found here.             
             tokenizer.parseToken();
	 }
      }
      setElementData(tokenizer, ElementTypes.JSON_OBJECT_END);
  }

  private void setElementData(JsonTokenizer tokenizer, byte elementType) {
     this.elementBuffer.position[this.elementIndex] = tokenizer.tokenPosition();     
     this.elementBuffer.length  [this.elementIndex] = tokenizer.tokenLength();         
     this.elementBuffer.type    [this.elementIndex] = elementType;     this.elementIndex++;
  }

The parseObject() method expects to see a left curly bracket ( { ) followed by a string token, a colon and another string token or the beginning of an array ( [ ) or another JSON object. As the JsonParser gets these tokens from the JsonTokenizer, it stores start, length and the semantic meaning of these tokens in its own elementBuffer. The data processing code can then navigate this elementBuffer afterwards, to extract any needed data from the input data.

Seeing the core parts of the JsonTokenizer and the JsonParser classes should provide some idea of how tokenization and parsing works. To fully understand how the code works you may need to look that the complete JsonTokenizer and JsonParser implementation. They are both less than 115 lines of code each, so they should be reasonably approachable.

Benchmarks

VTD-XML has already done extensive benchmarking of their XML parser against StAX, SAX and DOM parsers. VTD-XML wins over them all in raw performance.

In order to establish some credibility for index overlay parser performance in general, I have benchmarked my JSON parser implementation against GSON – Google’s JSON parser. GSON creates an object tree from a JSON input (String or stream).

Keep in mind that GSON is fairly mature production quality, tested, with good error reporting etc. My JSON parser is only at a proof-of-concept level. The benchmarking is only done to get an indication of the difference in performance. They are not final numbers. Remember to read the discussion of the benchmarks below too.

Here are some details about how the benchmarks are structured:

  • In order to warm up the JIT, minimize one-off overheads etc. the JSON input parsing is performed 10,000,000 times.
  • The benchmarks are repeated separately for three different files to see how the parsers do on small, medium and larger files. The file sizes are 64 bytes, 406 bytes and 1012 bytes. That means first do 10,000,000 iterations on a small file, and measure that. Then on a medium file, and measure that. And then on a larger file and measure that.
  • The file is being loaded fully into memory before parsing and measurement begins. That is done to exclude file-load time from the parsing time.
  • Each measurement of 10,000,000 iterations runs in its own process. That means that each file is parsed in separate processes. Only one process runs at a time.
  • Each file is measured 3 times. That is, the process parsing the file 10,000,000 times is started and stopped 3 times. The processes run sequentially, not in parallel.

The benchmark table contains three columns:

1. Simple iteration of the raw data buffer
2. JSON Parser
3. GSON

The first column is the simple iteration of all of the data in the raw data buffer. This number is only there to signal the lower limit; the minimum time theoretically possible to process all the data. Of course no parser will reach this speed, but the number is interesting to see how far off a parser is from the raw iteration speed. The second column is my JSON parser. The third column is Google's GSON parser.

Here are the times in milliseconds to perform the 10.000.000 parsings of the 3 files (64 bytes, 406 bytes, 1012 bytes):

File

Run  

Iteration  

JSON Parser  

GSON

Small

1

2341

69708  

91190

Small

2

2342

70705  

91308

Small

3

2331

68278  

92752

Medium

1

13954

122769  

314266

Medium

2

13963

131708  

316395

Medium

3

13954

132277  

323585

Big

1

33494

239614  

606194

Big

2

33541

231866  

612193

Big

3

32462

232951  

618212

As you can see, the index overlay implementation is significantly faster than GSON (an object tree JSON parser). Of course this was expected, but now you can get an idea of what the performance difference is.

Note that all benchmark processes were very stable in their memory consumption during execution. GSON did not steadily increase its memory consumption despite the many object trees created. The memory consumption of the index overlay parser was also stable, and about 1mb lower than that of the GSON benchmarks. Some of that might be due to the larger code base in GSON loaded into the JVM.

Benchmark Discussion

Now it wouldn’t be fair to compare a parser that creates an object tree from data (GSON) to a parser that just marks the index of the elements found in the data without a discussion of what is being compared.

Parsing a file inside a running application usually requires the following steps:

First the data is loaded either from disk or from the network. Second, the data is parsed. Third, the data is processed.

In order to measure just the raw parser speed I preloaded the files to be parsed into memory, and the benchmarked code will not process the data in any way. While this does benchmark just the raw parsing speeds, the performance difference does not translate one to one into increased performance in a running application. Here is why:

A streaming parser is often able to start parsing the incoming data before all of the data is loaded into memory. My JSON parser cannot do that the way it is implemented now. That means, that even though it is faster in raw parsing benchmarks, in a real life running application where my parser would have to wait for the data to load, it may not be as fast in total. This is illustrated in the diagram below:

You could probably modify my parser to be able to parser data as it is being loaded, in order to speed up the total parsing time. But that would probably decrease raw parser-performance a bit. The total speed might still be better, though.

Similarly, my JSON parser does not do anything with the parsed data. If you need to extract a lot of that data into Strings, then GSON will have done some of the work for you already, since it creates an object tree from the parsed data. This is illustrated in the diagram below:

When using GSON some of the data extraction needed during data processing might already have been done for you, meaning that the data processing step might be faster than if it also had to include data extraction (e.g. into String objects).

So, to really measure the impact on your application, you have to measure the use of different parsers in your application. I am sure you will see a speedup from using an index overlay parser, but exactly how much it will be is hard to say.

Discussion of Index Overlay Parsers in General

One argument I have heard against index overlay parsers is that to be able to point into the original data rather than extract it into an object tree, it is necessary to keep all of the data in memory while parsing it. This is can produce a memory-hit when processing big files.

In general the argument is that a streaming parser (like SAX or StAX) can parse huge files without ever having the whole file in memory. However this is only true if the data in the file can be parsed and processed in smaller chunks, where each chunk can be parsed and processed independently from other chunks. For instance, a big XML file contains a list of elements, each of which  can be parsed and processed individually (like a list of log records). But, if your data can be parsed separately in independent chunks, you can implement an index overlay parser that is capable of that as well.

If the file cannot be parsed in independent chunks you will anyways have to extract the necessary information into some structure which can be accessed by the code processing later chunks. But if you can do that with a streaming parser, you can also do it with an index overlay parser.

Parsers that create object trees from input data often consume much larger amounts of memory with the object tree than the original data size. This is due to the memory overhead associated with an object instance, plus extra data needed to keep reference between objects.

Additionally since all data needs to be in memory at one time, you need to allocate a data buffer ahead of parsing that is big enough to hold all the data. But what if you don't know how big the files are when you start parsing them?

Imagine you have a web application (or web service, or other server application) where the users can upload files. You may not know how big the files are, so how can you allocate a suitable buffer for them before the parsing starts? Well, for security reasons you should always have a maximum allowed file size. Otherwise users may be able to crash your system by uploading very large files. Or they may even write a program that pretends to be a browser uploading a file, and have that program never stop sending data to your server. You can allocate a buffer fitting the maximum allowed file size. That way your buffer will not run out of space for valid files. If it does run out of space, your user has uploaded an excessively large file anyway.

About the Author

Jakob Jenkov is an entrepreneur, writer and software developer currently located in Barcelona, Spain. Jakob learned Java in 1997, and has worked with Java professionally since 1999. He is currently focused on the concept of efficiency across a variety of domains (software, processes, businesses etc.). He holds a master of science in IT from the IT University in Copenhagen. You can read more about his work on his website.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Parser generators by Brian Oxley

Good particle. Would you follow up with a discussion of parser generators?

The last time I wrote a parser by hand was as an exercise in the early 90s. Since I have used parser generators for any regular syntax. Java has several fine generators such as JavaCC, bison/flex do in a pinch.

A reference:

en.wikipedia.org/wiki/Comparison_of_parser_gene...

Re: Parser generators by Russell Leggett

Yeah, one more vote for parser generators. I've used ANTLR for several small languages and found it to be pretty easy to do some powerful things. Combined with ANTLRworks, I have a hard time imagining a much better experience if you're just getting started. That said, I can see how the techniques here might get you some better performance in exchange for putting in a lot more work.

Reinventing the wheel! by Sila Kayo

Another vote for parser generators.
I mean, nice article but I still don't understand why someone needs to write a parser nowadays.
With existing parser generators like ANTLR (for example), writing your own parser is IMHO a waste of time.
In cases where speed is critical just use a combination of ANTLR and StringTemplate for offline parsing + code generation into target language.

Let's not forget that functional languages are ideally suited for writing parsers by Faisal Waris

Writing parsers for mini languages is par for the course in software development.

Haskell's Parsec (www.haskell.org/haskellwiki/Parsec) and it's derivatives allow one to model the grammar with functions in the host language itself (you don't need a separate language do describe the grammar).

F# has Active Patterns (in addition to a Parsec derivative) which is an innovative way of matching patterns and binding values. You can also 'see the grammar' in the functions.

Here is an HTML parser based on Active Patterns in only 140 lines of code: fshtml.codeplex.com/.

And a json module, including a parser based on Active Patterns: fsjson.codeplex.com/

Many more examples as F# snippets site: www.fssnip.net/categories/Parsing

parser generators by gong gara

This is a nice particle.I think that We dont need to develope the new parser gernerator,Because in the fast developing world,there are many high-performance flexable better parsers for example Javacc Antlr .

If you have a little more leeway by Prasan Samtani

I'd urge a look at parser combinators in Scala as well. It makes implementing a language dead simple.

Your parser is broken by Richard Hightower

Like you said. You don't extract the data or parse the numbers or..... You also do not properly encode the strings or the keys. You leave in \n, \t, \b etc. Also your parser fails on many of the sample files on json.org. If you just encoded the strings properly, you would loose to GSON. I know. I benchmarked it. Then if actually extract the data from that unusable API, your performance is 3x worse than GSON.

VTD-XML is brilliant for performance, less so for code readability by Chris Melikian

VTD is a great piece of software. We reduced a job that took 2.5 hrs using a SAX parser down to under 2 minutes using VTD on a 250MB XML file. It's very memory efficient too. The code isn't particularly nice to work with and unit testing can be hard. There were some final classes and variables in the code which we had to fork to get mocking to work correctly but all in all, very pleased with it!

Some more feedback by Richard Hightower

You did not include the JSON file on github that you use for the benchmark.
When I downloaded some sample json files from github, your parser could not parse them. But to be fair, neither did GSON. And to more than fair, my parser, which I wrote for fun was able to parse more JSON files than GSON and yours, but also failed on some of the JSON files. I would have been devastated by this, because unlike your parser, I have hundred of compliance tests that test for all sorts of edge cases, and since I have used mine in production, I have a few bug test cases. To put it plainly, it handles JSON files better than GSON which is much older and much more mature, yet I would never publish benchmarks until mine worked against all of the sample JSON files on json.org.

Your parser does not encode the JSON strings, which would immediately give your parser an unfair advantage. It is not about comparing apples to apple or apples to oranges it is like comparing a football stadium to a wood tick. It just does not even make sense.
You also used the part of GSON that uses reflection to populate an object so not only does it actually not turn the numbers into numbers but you compare yours which merely tracks the indexes of where stuff is to GSON which is taking a JSON stream and turning it into a Java object. This is a lot less than fair.

I then created a very, very simple JSON file, and your parser failed. I had to keep tweaking the JSON until your parser was able to handle something.

Then... should I stop now.. Then I actually tried to use your parser to access the data that it parsed... HOLY H3LL! Are you kidding me? You are comparing this to GSON. Really?

You make some good points in the article, but way to early to publish a benchmark.

Re: Some more feedback by Faisal Waris

I had the same experience - both this parser and GSON don't parse some common JSON files (which was surprising).

I looked at the code and the techniques used. This parser is indeed very fast. It can be corrected to handle the json better without really slowing it down much. There is value in understanding the techniques used.

However I will have to call it a "half parser" as much of the work of obtaining useful values from the 'parsed' data is still to be done. Inferring hierarchical structure and obtain usable values for strings and numbers is left for later.

Re: Some more feedback by Richard Hightower

Of the three parser (his, GSON, mine and mine after I refactored it to beat GSON), here are the scores.


Article Code: 11,489 miliseconds
GSON : 8,487 miliseconds
Boon v2: 7,910 miliseconds
Boon v1: 10,913 miliseconds


Now keep in mind that his parser is not actually doing real JSON parsing because it is not encoding the JSON string.

It is 10 percent slower than my Boon parser, and if you don't deliberately hamstring GSON by making it do reflection into an object, GSON is faster too.


I refactored my parser to run faster after this article. I was planning on refactoring it, but this article made me both refactor it and improve the tests. :)


JSON compliance. My earlier complaint about GSON was an error it seems. I created a compliance tests based on the examples at json.org.

Here are the results:

testing /Users/rick/github/parsers-in-java/data/actionLabel.json
BOON 1 PASSED /Users/rick/github/parsers-in-java/data/actionLabel.json
BOON 2 PASSED /Users/rick/github/parsers-in-java/data/actionLabel.json
GSON # PASSED /Users/rick/github/parsers-in-java/data/actionLabel.json
INFO Q FAILED /Users/rick/github/parsers-in-java/data/actionLabel.json

testing /Users/rick/github/parsers-in-java/data/menu.json
BOON 1 PASSED /Users/rick/github/parsers-in-java/data/menu.json
BOON 2 PASSED /Users/rick/github/parsers-in-java/data/menu.json
GSON # PASSED /Users/rick/github/parsers-in-java/data/menu.json
INFO Q FAILED /Users/rick/github/parsers-in-java/data/menu.json

testing /Users/rick/github/parsers-in-java/data/sgml.json
BOON 1 PASSED /Users/rick/github/parsers-in-java/data/sgml.json
BOON 2 PASSED /Users/rick/github/parsers-in-java/data/sgml.json
GSON # PASSED /Users/rick/github/parsers-in-java/data/sgml.json
INFO Q FAILED /Users/rick/github/parsers-in-java/data/sgml.json

testing /Users/rick/github/parsers-in-java/data/webxml.json
BOON 1 PASSED /Users/rick/github/parsers-in-java/data/webxml.json
BOON 2 PASSED /Users/rick/github/parsers-in-java/data/webxml.json
GSON # PASSED /Users/rick/github/parsers-in-java/data/webxml.json
INFO Q FAILED /Users/rick/github/parsers-in-java/data/webxml.json

testing /Users/rick/github/parsers-in-java/data/widget.json
BOON 1 PASSED /Users/rick/github/parsers-in-java/data/widget.json
BOON 2 PASSED /Users/rick/github/parsers-in-java/data/widget.json
GSON # PASSED /Users/rick/github/parsers-in-java/data/widget.json
INFO Q FAILED /Users/rick/github/parsers-in-java/data/widget.json

The code from this article could not parse a single JSON example form json.org.

Not one!

Anyway... If you create a JSON file simple enough, you can get it to parse something. The article has some good ideas, but it is a bit less than baked.



json.org/example

Re: Some more feedback by Richard Hightower

github.com/RichardHightower/parsers-in-java

I forked it. Added Jackson to the mix.

GSON: 8,334 mili second
JACKSON: 7,156
Boon 2: 2,645
Boon 1: 3,799
InfoQ : 11,431


Parse times for large json file 1,000,000 runs:

Boon 2: 15,543 mili second
Boon 1: 19,967
JACKSON: 18,985
InfoQ: ParserException
GSON: 25,870


Anyway.. It was a good exercise. I got very familiar with my profiler (jvisualvm). I improved the hell out of the Boon JSON compliance. I added some experimental things to the Boon 2 JSON parser so now it is 3x faster than Jackson and GSON (well for the sample JSON form json.org). It was much slower. I thought it was fast until this article inspired me to test its speed. Now it is pretty fast.

I thought the Boon parser would do much better than it initially did and this article inspired me to tune it. And there were some things in this article (some ideas) that i was already thinking about adding (but in a more useable fashion).

rick-hightower.blogspot.com/2013/11/benchmark-f...

Now I am left wondering why is his parser this slow. It shouldn't be. I almost want to tune it and fix it.

Sorry for the late response. by Jakob Jenkov

Hi everyone, I have been pretty occupied with moving from one apartment to another, and I have not received any notifications about the latest comments (until I got an email summary late friday), so I haven't had a chance to respond to them. I will do that over the weekend though (I am the author of the article).

json.org examples should work now. by Jakob Jenkov

The latest version of the parser on GitHub earlier should be able to parse all 5 example files from json.org/example . This version also supports parsing numbers and booleans ("null" is still not accepted). I have added a very simple JsonOrgExamplesTest that parses all 5 files without throwing any exceptions. The test does not verify that the parser also finds the correct tokens. It merely checks that the parsing does not throw exceptions.

I have also added the 5 examples from json.org/example to the GitHub repository in the "data/json-org" directory.

The 3 files used in the benchmark are added to the GitHub repository by Jakob Jenkov

I have now added the 3 files I used in the benchmark to the "data" directory in the GitHub repository. The 3 files only contain objects, arrays and string values. There are no number values or boolean values. Of course it would make sense to add that to the benchmark, but finding the start and end of numbers and booleans should not be significantly faster or slower than finding the start and end of a quoted string. Alas, is should be pretty easy to add some numbers and booleans to the benchmark to verify it.

I will respond to the rest of the comments tomorrow (there is still a lot of points to respond to).

How does Regex deteriorate the performance by Tarun Saha

Hi, I was wondering to use RegEx to parse the XML input.
Please let me know how using RegEx would be beneficial or loss.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

16 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT