Apache Tika 1.0 Allows Easy Text Extraction for Java
The Apache Tika project aims to provide a single API for extracting data and detecting language from arbitrary input formats, such as text documents, spreadsheets, PDFs or images. Even audio or video input formats are supported to a certain degree.
The project team released their 1.0 version early November, approximately three years after Tika left the Apache Incubator and became a subproject of Lucene. In April 2010 it then became a top level project on its own. By providing an API and a metadata format, Tika allows users to reuse existing specialized parsers. By leveraging those third party parsers, Tika avoids having to reinvent the wheel and maps their results to its API using a thin conversion layer and custom parsing where needed. Developers are also able to provide custom parsers and thus integrate their data formats easily into any application using Tika.
Parsing unknown content is simple and requires just a few lines of code:
Parser parser = new AutoDetectParser() Metadata metadata = new Metadata(); StringWriter writer = new StringWriter(); parser.parse(byteArrayInputStream, new WriteOutContentHandler(writer), metadata, new ParseContext()); String content = writer.toString();
The content will be the full text of the source document and can be easily redirected to a storage engine of choice. But because full text often only makes sense when knowing the metadata of the source, Tika offers a simple API for that as well:
String fileType = metadata.get(Metadata.CONTENT_TYPE); String author = metadata.get(Metadata.AUTHOR);
Another feature of Tika is to detect the language of a given text. While this detection is still in development, it already works reasonably well on most European languages:
LanguageIdentifier identifier = new LanguageIdentifier("Text content as String"); String language = identifier.getLanguage();
InfoQ spoke to Chris Mattmann, Apache Tika Vice President, and author of the new Manning book "Tika in Action", about the project:
InfoQ: What was the main reason to create Tika?
Mattmann: Tika was created with the initial goal of collocating and separating out from Apache Nutch all of the relevant functionality related to content detection, analysis, language identification, metadata representation and so forth. The project idea originally came from Nutch committers Jerome Charron and myself. Jukka Zitting came along shortly thereafter and really took charge of leading the proposal through the Apache Incubator and has become Tika's champion and core contributor.
InfoQ: Do you see a trend to applications dealing with lots of different, even unknown, content?
Mattmann: I sure do. We see an increasing proliferation of files on the Internet and content types. By some counts (and I teach this to my students in my CSCI 572 Search Engines course at the University of Southern California), there are between 16k-51k different types of files on the internet (simply by looking at file extensions and other basic information). In terms of richly curated content types, IANA (the Internet Assigned Numbers Authority), describes on the order of ~1200 content types out there. Either way, that's tons of different types that software systems and applications must understand.
InfoQ: How about integrating with an OCR library, like Tesseract OCR?
Mattmann: OCR libraries are something that TIka is certainly interested in. We've used Apache PDFBox for OCR type of functionality, but Tesseract OCR, looks very interesting. And, it's ALv2 licensed, which always makes it easier to integrate into Apache products.
InfoQ: Are you running compatibility tests with new parsers, or is it up to the user to upgrade and test parser libraries?
Mattmann: We get tons of questions from users on the email@example.com mailing lists regarding these types of questions. We have a fairly extensive suite of JUnit tests that test backwards compatibility for file formats and parsers, that are run by Apache's Continuous Integration and Testing system. When those don't catch regressions, we usually get user emails and we try our best to respond to them promptly.
InfoQ: We heard from a few users that error handling is still a bit tricky. Like Runtime Exceptions bubbling up from the parsers, or not really knowing how to react on parser specific errors. Are there plans for improving this?
Mattmann: Sure, we're trying to work on improving Tika's error handling. As we see more use cases and so forth, developers get interested, code is developed, and the project moves forward.
InfoQ: What are your ideas for 2012? Any plans to cooperate with parser makers, especialy on Metadata?
Mattmann: That's a great question. We've tried our best to take broad strokes and collaborate with upstream Parser developers and communities wherever it made sense, while trying at the same time to insulate our project from detrimental changes and rapidly changing file formats and requirements. We're lucky in that many of the Tika PMC members are strong members of file format communities, e.g., Nick Burch with POI, Jukka Zitting with his work on Apache Jackrabbit and its content management, my own work with the scientific data file format community, as examples. So, we've got a pretty broad infection into these communities. Beyond making sure we stay closely integrated with the upstream Parser libraries, in terms of plans, things I'm looking forward to including:
- Tika packages for Linux systems (e.g., *.deb, *.rpm packages, etc.) and the potential to try and see if we can get Tika into a Linux distribution
- Improved support for scientific data file formats including integration with GDAL.
- Support for spatial data, and integration with GIS formats (again, e.g., those supported by GDAL)
- Support for Tika in PHP, in Python, including native bindings to other libraries. Jukka is leading the charge on proposing this.