Collaboration: At the Extremities of Extreme
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Charles Humble on Oct 07, 2009
Much of the emphasis in Lucene 2.9 is around performance improvements, many of which result from low level internal infrastructure changes to the way Lucene manages its index. Lucene's index database is composed of a number of separate "segments" each stored in an individual file. When you add documents to the index, new segments may be created which are added incrementally and can be merged. Lucene caches the field information for sorting in its FieldCache, but loading the field cache in Lucene 2.4 and earlier has been an expensive operation particularly since 2.4 regularly reloads the entire field cache. During preparation for release 2.9 the Lucene team noted that segments typically change infrequently – they change when you do a merge or a delete for example but older segments tend to stay static. The field cache has therefore been modified such that it only reloads parts where the segment has changed.
Lucene has also been inefficient when loading FieldCaches over multiple segments. In addition, version 2.9 has bypassed the need for loading FieldCaches over multiple segments, another historical inefficiency, by managing the internal FieldCache per segment. The results of these changes are impressive. Lucid Imagination's Mark Miller ran a simple benchmark using 5,000,000 unique strings and saw a performance improvement of around 15 times against Lucene 2.4:
Lucene 2.4: 150.726s
Lucene 2.9: 9.695s
Other significant performance enhancements include re-opening times. Lucene 2.9 introduces a new IndexWriter.getReader() method which returns a reader that searches the full index, including any uncommitted changes in the current IndexWriter session, providing near real-time searching. It is also possible to "warm" these segments so they are ready to be used immediately using IndexWriter.setMergedSegmentWarmer().
Another area which has seen a significant overhaul is numeric handling, particularly in the context of range queries such as "find me all CDs between 0.5 and 9.99 GBP". Since Lucene's search is largely text based prior to 2.9, Lucene's numeric handling presented a series of string based encodings which operated at full precision. This often produced a large number of unique terms which Lucene would then have to iterate to build up a complete set of results. Many developers have written their own custom encoders to work around this but with version 2.9 Lucene now supports native encoding for numbers. Field and Query classes use a precision step to determine how much precision to encode when indexing and searching. This can greatly reduce the number of terms that need to be searched and can have a significant impact on query response times.
In addition version 2.9 introduces new query types with more scalable multi-term queries (wildcard, prefix, etc.), and new analysers for Persian, Arabic and Chinese. There is also improved Unicode support, a new query parser framework, and support for geospatial searches allowing filtering and sorting of documents based on distance (for example find all dry cleaners within 5 miles of my house). There are many more changes and improvements. A complete list can be found here.
While Lucene generally maintains full backwards compatibility between major versions, Lucene 2.9 has a variety of breaks that are spelled out in the 'Changes in backwards compatibility policy' section of CHANGES.txt. As such migration to 2.9 is likely to require a re-compile and proper regression testing and other appropriate due diligence. Re-compiling against 2.9 will also highlight any calls to deprecated methods allowing developers to update their applications in preparation for version 3. This is advisable since Lucene 3.0 will drop support for Java 1.4 and remove all functionality marked as deprecated in the 2.9 version.
Fair Trade Software Licensing - A Guide to Neo4j Licensing Options
18 agile and lean practices for effective software development governance
Improve Java Garbage Collection, Runtime Execution, and JVM visibility with Zing
Monitor your Production Java App - includes JMX! Low Overhead - Free download
Overview of Lucene 2.9 in Ukrainian is here - www.rozrobka.com/blog/java/270.html
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).
Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.
Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.
Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.
John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.
1 comment
Watch Thread Reply