BT

Infinispan's GridFileSystem - An In-Memory Grid File System

Posted by Bela Ban and Manik Surtani on Jun 25, 2010 |

Introduction

Infinispan, which is a successor to JBoss Cache caching framework, is an open source data grid platform that makes use of distributing state across nodes in a cluster. GridFileSystem is a new, experimental API that exposes an Infinispan-backed data grid as a file system. The API works as an extension to the JDK's File, InputStream and OutputStream classes: specifically, GridFile, GridInputStream and GridOutputStream. A helper class, GridFilesystem, is also included in the framework. This API is available in Infinispan version 4.1.0 (from 4.1.0.ALPHA2 onwards).

GridFilesystem includes 2 Infinispan caches - one for metadata (typically replicated) and another for the actual data (typically distributed). The former is replicated so that each node has metadata information locally and would not need to make RPC calls for tasks like listing the files. The latter is a distributed cache since this is where the bulk of storage space is used up, and a scalable mechanism is needed to store the data. Files themselves are chunked and each chunk is stored as a cache entry.

The feature we're focusing on in the article is the distributed mode of Infinispan. This mode adds “distribution” based on consistent hashing. JBossCache framework only supports the “replication” mode (where every node in a cluster replicates everything to every other node).

Replication is best used in smaller clusters, or when the data stored on every node is relatively small; since every node replicates its data to every other node in the cluster, the average amount of data stored on each node is a function of the cluster size and data size, and therefore might become too large for a single node to store in memory. The advantage of replication is that reads are always local, as everybody has the data, and there is no re-balancing going on when a new node joins or an existing node leaves the cluster.

On the other hand, the in-memory grid file system is a better solution if you have a large data set that needs to be accessed quickly, and you cannot afford to hit the disk (e.g. database) to retrieve the data.

In a previous article, we discussed ReplCache, which is an implementation of a grid data container using consistent hash based distribution. In some ways, ReplCache served as a prototype for Infinispan's distributed mode.

In Infinispan, data can be stored in the grid with or without redundancy. For example, data D can be stored in the grid only once by using the distributed cache mode, and setting numOwners to 1. In this case, Infinispan picks a server to store D on, based on a consistent hash algorithm. If we set numOwners to 2, then Infinispan picks 2 servers to store D on, and so on.

The advantage of Infinispan is that it provides the aggregated memory of the grid. For example, if we have 5 hosts with 1GB of heap each, then we have an aggregated memory of 5GB if numOwners is set to 1 – discounting overhead of course. Even with some redundancy – say, setting numOwners to 2 – we have 2.5GB at our disposal.

There is a problem, however: if we have many data elements which are 1K in size, and only a few which are 200MB, then we have an uneven distribution. Some servers will have their heap almost used up because they're storing the 200MB data elements, and others might be almost empty.

An additional problem is if a data element is larger than the available heap on a given server: for example, if we want to store an element into the grid whose size is 2GB, then we'll fail because the heap is only 1GB and passing a value to Infinispan's Cache.put(K, V) API will fail with an OutOfMemoryException error.

To solve these problems, we can divide a data element into chunks and distribute the chunks across the cluster. Let's say a chunk is 8K: if we divide the 2GB data element into 8K chunks, we'll end up with 250,000 chunks of 8K in size that are stored in the grid. Storing 250,000 equally sized chunks in a grid is likely to result in a more even distribution than storing a few 200MB data elements.

However, we don't want to burden the application programmer with having to divide their data into chunks when writing data, and putting the chunks back together into entire data elements when reading data. Besides, this approach would still require the data elements to be in memory, which won't work.

To overcome this, we picked the concept of streams: a stream (input or output) only deals with a subset of the data of an element. For example, rather than having to write 2GB of data into the grid, we iterate through an input file (say in steps of 50K) and write the read data into the grid. This way, only 50K worth of data has to be kept in memory at any given time.

So now an application programmer can write the (pseudo) code to store a data element of 2GB into the grid as shown in Listing 1 below:

Listing 1: Code to store a data element into the grid.

InputStream in=new FileInputStream(“/home/bela/bond.iso”);
OutputStream out=getGridOutputStream(“/home/bela/bond.iso”);
byte[] buffer=new byte[50000];
int len;
while((len=in.read(buffer, 0, buffer.length)) != -1) {
    out.write(buffer, 0, len);
}
in.close();
out.close();

Similar code can be written to read data out of the grid.

The advantages of using a stream interface are:

  • Only a small buffer is needed to read or write data. This is better than having to create a byte[] buffer of 2GB for the default put() or get() of Infinispan.

  • The distribution of data over a grid is more even. With a good consistent hashing function, all servers should have more or less the same amount of data stored locally.

  • We are able to store files that are greater in size than any given node's JVM heap.

  • The application programmers don't have to write the code for chunking the data.

Architecture

The grid file system needs to not only store chunks, but also metadata information, such as directories, files, file sizes, last modification times and so on.

Since metadata is critical, but also small, we decided to store it on every server in a grid. Therefore, we need one Infinispan cache for all of the metadata, using replicated mode rather than distributed mode (copies available locally on every node), and one Infinispan cache configured in distributed mode with desired numOwners for the data chunks.

Note that Infinispan doesn't create two JGroups stacks and channels, but shares a single JGroups channel between both metadata and chunk instances, provided they were created using the same CacheManager.

Metadata cache

The metadata cache contains the full path names of directories and files as keys and metadata (e.g. file length, last modification date, and whether the key is a file or a directory) as values.

Whenever a new directory or file is created, a file is deleted, or whenever a file is written to, the metadata cache is updated. For example, when we write to a file, the new length and the last modification timestamp are updated in the metadata cache.

The metadata cache is also used for navigation, for example to list all the files and directories in directory “/home/bela/grid”. Since changes to the metadata cache are replicated to every server, a read is always local and doesn't require network communication.

Chunk cache

The chunk cache stores the individual chunks. Keys are the chunk names and value are byte[] buffers. A chunk name is constructed by appending “.#<chunk number>” to the full path name of the file, e.g. “/home/bela/bond.iso.#125”.

With a chunk size of 4000, writing a file “/home/bela/tmp.txt” of 14K would generate chunks

  1. /home/bela/tmp.txt.#0 (4000 bytes) [0 - 3999]

  2. /home/bela/tmp.txt.#1 (4000 bytes) [4000 - 7999]

  3. /home/bela/tmp.txt.#2 (4000 bytes) [8000 - 11999]

  4. /home/bela/tmp.txt.#3 (2000 bytes) [12000 - 13999]

The computation of the correct chunk is a function of the current read or write pointer and the chunk size. For example, reading 1000 bytes at position 7900 would read 99 bytes from chunk #1 and 901 bytes from chunk #2.

The chunk name (“/home/bela/tmp.txt.#2”) is used to pick the server to write the chunk to, or to read it from, using consistent hashing.

API

The GridFileSystem framework is implemented with 4 classes:

The entry point into the system is GridFilesystem: it is used to create instances of GridFile, GridOutputStream and GridInputStream. The main methods of this class are shown in Listing 2.

Listing 2: GridFileSystem class major methods

public GridFilesystem(Cache<String, byte[]> data, Cache<String, 
                      GridFile.Metadata> metadata, int default_chunk_size) {
    ...
}

public File getFile(String pathname) {
    ...
}

public OutputStream getOutput(String pathname) throws IOException {
    ...
}

public InputStream getInput(String pathname) throws FileNotFoundException {
    ...
}

The constructor takes two (fully created and running) Infinispan caches, the first for the data chunks and the second for the metadata. The default_chunk_size parameter sets the default for the default chunk size. Listing 3 below shows the code on how to create a GridFilesystem object.

Listing 3: Code to create a GridFileSystem object

Cache<String,byte[]> data;
Cache<String,GridFile.Metadata> metadata;
GridFilesystem fs;

data = cacheManager.getCache(“distributed”);
metadata = cacheManager.getCache(“replicated”);
data.start();
metadata.start();

fs = new GridFilesystem(data, metadata, 8000);

Method getFile() can be used to grab a handle to a file, which can be used to list files, create new files, or create directories. The getFile() method returns an instance of GridFile which extends java.io.File, which overrides most (but not all) of the methods of File class. The code snippet using the getFile() method is shown in Listing 4.

Listing 4. GridFileSystem getFile() method example.

// create directories
File file=fs.getFile(“/home/bela/grid/config”);
fs.mkdirs(); // creates directories /home/bela/grid/config

// List all files and directories under “/usr/local”
file=fs.getFile(“/usr/local”);
File[] files=file.listFiles();

// Create a new file 
file=fs.getFile(“/home/bela/grid/tmp.txt”);
file.createNewFile();

Method getOutput() returns an instance of GridOutputStream, and can be used to write data into the grid. The following code, in Listing 5, shows how to copy a file from the local file system to the grid (error handling omitted):

Listing 5: GridFileSystem getOutput() method example.

InputStream in=new FileInputStream(“/home/bela/bond.iso”);
OutputStream out=fs.getOutput(“/home/bela/bond.iso”); // same name in the grid
byte[] buffer=new byte[20000];
int len;
while((len=in.read(buffer, 0, buffer.length)) != -1) {
    out.write(buffer, 0, len);
}
in.close();
out.close();

And the method getInput() creates an instance of GridInputStream, which can be used to read data from the grid. Listing 6 below shows an example of getInput() method call.

Listing 6: GridFileSystem getInput() method example.

InputStream in=in.getInput(“/home/bela/bond.iso”);
OutputStream out=new FileOutputStream(“/home/bela/bond.iso”); // local name same as grid
byte[] buffer=new byte[20000];
int len;
while((len=in.read(buffer, 0, buffer.length)) != -1) {
    out.write(buffer, 0, len);
}
in.close();
out.close();

Exposing the Grid File System with WebDav

From version 4.1.0.ALPHA2 onwards, Infinispan ships with a GridFileSystem WebDAV demo. This demo, available from the downloads page on the website, is compiled as a self-contained WAR file, from the latest Infinispan distribution. Deploy the WAR file in your favorite servlet container, and you will be able to mount the file system using the WebDAV protocol, on http://YOUR_HOST/infinispan-gridfs-webdav/.

There is another demo, where 2 JBoss AS instances are started up running the infinispan-gridfs-webdav demo web application, and the WebDAV instances are mounted as remote drives. Files are copied to them, and high availability is demonstrated as one instance is shut down while files are still read from the second instance.

Monitoring

The monitoring support provided by Grid File System includes JMX and/or JOPR / JON, via the underlying Infinispan framework.

JMX

JMX reporting can be enabled at 2 different levels: CacheManager and Cache. CacheManager governs all the cache instances that have been created from it. The cache level management includes the information generated by individual cache instances.

JOPR

The other way to manage multiple Infinispan instances is to use JOPR which is JBoss' enterprise management solution. JOPR's agent and auto discovery capabilities help with the monitoring of both Cache Manager and Cache instances. With JOPR, administrators have access to graphical views of key run-time parameters or statistics and can also be notified if these exceed or go below certain limits.

Conclusions

The article discussed how to use GridFS to store large amounts of data in a grid via a new streaming API, which makes use of Infinispan. The value-add is the ability to chunk up data and store them in a grid, with varying degrees of redundancy, which allows for processing of very large data sets entirely in memory.

Grid File System is a prototype which doesn't have all of the java.io.* functionality implemented yet, but it currently includes the most important methods. For more information, refer to the Infinispan documentation on this API.

References

GridFileSystem: http://community.jboss.org/wiki/GridFileSystem

ReplCache: http://www.jgroups.org/javagroupsnew/docs/replcache.html

Infinspan: http://www.infinispan.org

Enabling distributed mode: http://docs.jboss.org/infinispan/4.0/apidocs/config.html#ce_default_clustering

About the Authors

Bela Ban completed his PhD at the University of Zurich, Switzerland. After some time at IBM Research, he did a post-doc at Cornell. Then he worked on NMS/EMS for Fujitsu Network Communications in San Jose, California. In 2003, he joined JBoss to work full-time on open source. Bela manages the Clustering Team at JBoss and created and leads the JGroups project. Bela's interests include network protocols, performance, group communication, trail running, biking and beerathlon. When not hacking code he spends time with his family.

Manik Surtani is the project lead of Infinispan caching framework.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Use case? by Ashwin Jayaprakash

Looks interesting, but why would any one want to store such a large file on a grid and hold it in memory instead of using HDFS or Cassandra?

Do you support RandomAccess? That might be somewhat useful.

Re: Use case? by Bela Ban

Looks interesting, but why would any one want to store such a large file on a grid and hold it in memory instead of using HDFS or Cassandra?

Do you support RandomAccess? That might be somewhat useful.


The point would be to access the file faster, e.g. for map-reduce operations or data mining. Another scenario is to have applications (EARs, WARs) in the grid and being able to access them quickly.

No, random access is currently not supported, but of course we want to do this because we'd like to support most of the java.io classes...

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT