Recently, Human Inference and the Apache Software Foundation (ASF) announced the donation and acceptance into incubation of the MetaModel project. Previously, MetaModel had been available under the LGPL license, and governed by Human Inference’s product development, but is now being moved to the ASF, getting a new license, community and governance. So what is this project all about, and what is it useful for?
MetaModel is a Java library which aims to provide a single interface for interacting with any data store; be it a relational database, a NoSQL database, a spreadsheet file or another file format. By interacting we mean exploring metadata, querying and writing/changing the data contained in the store. Certainly any abstraction like this will leave out details and thereby there’s a risk of over-generalizing and loosing important features. You would not want the functionality of your relational SQL database reduced to only full table scan (SELECT * FROM [table]) like queries. On the other hand you would also not want to expose functionality that is unlikely to work on any other data stores than just your specific SQL server brand and version. Lastly, you would want to build upon existing common skills for interacting with data, such as SQL.
Dealing with Metadata
So what is then the approach to data store abstraction chosen by the MetaModel project? The project exposes a query model through Java interfaces (or optionally parsed from a String) that is very similar to SQL. Since the query is defined as a regular Java object, it can be interpreted easily and – depending on the underlying technology –the best strategy for actually executing it will be chosen. This means that MetaModel not only provides an interface; it also includes a full query engine which can be fitted to take care of some or all of the tasks involved in handling a query. In the case of relational JDBC databases, 99% of the query execution will still be happening in the database’s native engine. But with MetaModel you can also fire the same queries on a CSV file or an Excel spreadsheet, and thereby utilize MetaModel’s query engine to properly slice and dice the data. You won’t have to change the query at all.
Of course, this is assuming the metadata and the structure of your data stores are compatible. Different data stores have different ways of exposing or inferring their metadata. JDBC databases typically expose their metadata through the JDBC metadata API. File formats like CSV and Excel sheets are a bit less well-defined; they have their metadata explored by reading the header lines of the files. And, as the extreme example, there are several NoSQL databases that explicitly do not have metadata. MetaModel provides the option for you to specify the metadata programmatically or to infer the metadata by inspecting the first N records of the data store.
MetaModel’s most central construct is the DataContext interface, which represents the data store and is used to explore and query it. Additionally, the UpdateableDataContext sub-interface is available for writeable data stores where updates to the data can be performed. The whole library can more or less be learned using basic code-completion, once you just ensure you have a DataContext instance. Here are a couple of examples of common DataContext implementations and how to instantiate them:
// a DataContext for a CSV file
UpdateableDataContext csv = new CsvDataContext(new File(“data.csv”));
// a DataContext for an Excel spreadsheet
UpdateableDataContext excel = new ExcelDataContext(new File(“spreadsheet.xlsx”));
// a DataContext for a JDBC database (can use either DataSource or Connection)
java.sql.DataSource dataSource = …
UpdateableDataContext jdbc = new JdbcDataContext(dataSource);
// a DataContext for an XML file (where metadata is automatically inferred)
DataContext xml = new XmlDomDataContext(new File(“data.xml”));
// a DataContext for connecting to Salesforce.com’s data web services
UpdateableDataContext salesforce =
new SalesforceDataContext(username, pw, securityToken);
// a in-memory DataContext for POJOs (useful for testing and mocking)
Person record1 = ...
Person record2 = ...
TableDataProvider tableDataProvider = new ObjectTableDataProvider(
“persons”, Person.class, Arrays.asList(record1, record2));
UpdateableDataContext pojos = new PojoDataContext(“schema”, tableDataProvider);
Metadata is important for MetaModel not only for exploring the data structure, but also for defining queries. A lot of effort went into making sure that queries are safe to fire, if you’re just working with proper metadata. So before querying, the first thing you need to do as a developer is to get hold of the metadata objects. For instance, if you know there’s a table called ORDER_LINE with a price column and a order_id column, then the metadata needed for querying it is resolvable in a typically hardcoded manner (which obviously only works when you know the data store):
DataContext dataContext = ... // the DataContext object represents the ‘connection’
Table orderLines = dataContext.getTableByQualifiedLabel(“ORDER_LINES”);
Column price = orderLines.getColumnByName(“price”); Column orderId = orderLines.getColumnByName(“order_id”);
However, the API also allows you to dynamically fetch the metadata based on discovery. This is useful for applications where you wish to present the user with the available tables, columns etc. and let the user himself make choices that affect the query:
Schema[] schema = dataContext.getSchemas();
Table[] tables = schemas[0].getTables();
Column[] columns = tables[0].getColumns();
Another important aspect of MetaModel is to treat metadata, queries and other entities around data interaction as objects. A query in MetaModel is a regular Java object that you can manipulate and pass around before execution. This enables applications to create complex workflows where different pieces of the code participate in the building and optimization of the query plan without having to turn to e.g. tedious SQL string manipulation. It also helps with type-safety since e.g. the Query model is based on type-safe constructions like columns, tables, etc., instead of vague String literals.
Querying a Data Store
So, let’s see how a query in MetaModel looks like.
There are three ways you can go about firing the same query:
1. Compose it from scratch:
This is the traditional POJO oriented way of doing it. It’s quite verbose but allows for all the flexibility you want.
Query q = new Query();
q.select(SUM, price);
q.select(orderId);
q.from(orderLines);
q.groupBy(orderId);
q.setMaxRows(100);
DataSet dataSet = dataContext.executeQuery(q);
2. Use the fluent Builder API:
The Builder API was added to allow another type-safe way of querying but with less verbosity. Also this API, through the builder-pattern, provides some direction to the developer about which clauses of the query to logically fill in next. This is the preferred way of defining the query, when there’s just a single component for building it:
Query q = dataContext.query().from(orderLines)
.select(SUM, price).and(orderId)
.groupBy(orderId).maxRows(100).toQuery();
DataSet dataSet = dataContext.executeQuery(q);
3. Have it parsed from a String:
Sometimes you might want to cut corners and fall back to a more traditional SQL string approach. MetaModel can also parse queries coming from plain strings, but it comes with the risk of less type-safety since String queries can be validated only at runtime.
Query q = dataContext.parseQuery(
“SELECT SUM(price), order_id FROM order_lines GROUP BY order_id LIMIT 100”);
DataSet dataSet = dataContext.executeQuery(q);
As you can see, the end-result of all three approaches is a DataSet, an object type representing the tabular query result. Without going into too much detail with all the DataSet features, you could simply iterate through it like this:
Try {
while (dataSet.next()) {
Row row = dataSet.getRow();
System.out.println(row.toString());
}
} finally {
dataset.close();
}
Updating a Data Store
Executing updates with MetaModel is performed in a similar type-safe and metadata-driven approach. As mentioned above, not all data stores are writeable, which is why you need a DataContext object that also implements the UpdateableDataContext interface. Given that, let’s try to update order data in our example:
dataContext.executeUpdate(new UpdateScript() {
@Override
public void run(UpdateCallback cb) {
// insert a new order line
cb.insertInto(orderLines).value(orderId, 123).value(price, 395).execute();
// update the price of orderlines where order_id = 122
cb.update(orderLines).where(orderId).eq(122).value(price, 295).execute();
}
});
Notice here that the UpdateScript is the construct that sets the logical transaction boundaries. Depending on the underlying data technology, an appropriate transactional strategy will be applied; JDBC databases will apply ACID transactions, most file formats will use synchronized writing and so on. The net result is that you have a single syntax for writing data in all your data stores.
The syntax here isn’t particularly beautiful because of the anonymous inner class. Obviously this will improve with Closures in Java 8. But also, if you’re only looking to do a single operation, a couple of convenient prebuilt UpdateScript classes are available out of the box:
dataContext.executeUpdate(
new InsertInto(orderLines).value(orderId, 123).value(price, 395));
Furthermore the executeUpdate method can be used to create and drop tables, as well as delete records.
Adding Support for New Data Stores
Finally, the expert user of MetaModel might ask ‘What if I want to connect to [XYZ]?’ (where XYZ is an exotic data store that we don’t support yet). Obviously we want MetaModel to be easy to extend; that was part of the reason to make the query engine pluggable. What you need to do is to build your own implementation of the DataContext interface; but that’s not really a trivial thing to do if you start from the ground up. So we provide an abstract implementation with a number of extension points. Here’s the walk-through:
- Let your class extend the abstract class QueryPostprocessDataContext. You will see that you need to implement a few abstract methods:
- getMainSchema()
This method should be implemented to provide the schema model that your DataContext exposes. - materializeMainSchemaTable(Table, Column[], int)
This method should be implemented to provide the equivalent of a full table scan for a particular table.
- getMainSchema()
- Your DataContext is now functional and you can start using MetaModel with your exotic data store!
- But let’s also optimize it a bit! Although our DataContext is now fully functional, it might not perform greatly for certain queries since MetaModel’s query engine has to rely on the materializeMainSchemaTable(...) method as the source for handling almost any query. Here’s a couple of methods that you might additionally want to override:
- executeCountQuery(...)
Many data stores have a simple way of determining the count of records for a particular table. Since this is also a common query type, overriding this method will often help. - materializeMainSchemaTable(Table, Column[], int, int)
Often times people scan through table data in pages (from record no. X to Y). The extra int-arguments of this method allows you to optimize the way you query, such as a single page of data instead of the full table. - executeQuery(Query)
If you want to do further optimization, using WHERE or GROUP BY clauses, override this method. Beware of many corner cases though, since the Query argument is a rich object type. Good examples for inspiration here include the SalesforceDataContext class and MongoDbDataContext class found in the source code.
- executeCountQuery(...)
Final Words
In this article we have introduced MetaModel, a library providing access to various data stores, explaining how to deal with metadata, how to interrogate the store, and how to perform updates.
In the future, we intend to develop MetaModel even further in it’s new home at Apache. We’ll be adding more built-in DataContext implementations for HBase, Cassandra and other modern databases. As well as further expanding the feature set of what you can do with metadata. Some of the ideas that we are working on are richer metadata about nested structures (maps and lists, as are available in many NoSQL databases), ability to create virtual tables (similar to VIEWs, but facilitated on the client instead of the server), support for mapping POJOs to DataSet rows and plugging in more functions to the query engine.
Apache MetaModel is currently undergoing incubation at The Apache Software Foundation. If you are interested in this project, please find the mailing lists, bug tracking, etc., available on the Apache Incubator MetaModel page.
About the Author
Kasper Sørensen works as Principal Software Engineer and Product Manager at Human Inference. His professional expertise and interests are in product development for data-intensive applications. While graduating from Copenhagen Business School, he went on to found the DataCleaner and MetaModel open source projects as part of his Master’s Thesis. You can read more on his blog: Kasper’s Source.