NoSQL, JSON, and Time Series Data Management: Interview with Anuj Sahni
Time series data is an ordered sequence of values of a variable at equally spaced time intervals. Time series data is coming at us from all directions: sensors, mobile devices, Web tracking, financial events, factory automation, and utilities—to name just a few.
Time series data management has been gaining more attention lately. InfoQ spoke with Anuj Sahni, Principal Product Manager of Oracle NoSQL Database at Oracle about the time series data and how to do data modeling for this type of data.
InfoQ: What is Time Series data and how is it different from structured data?
Sahni: Time series is sequence of data points that is captured at uniform time interval. Examples could be stock tick information from global stock exchanges, precious metals prices captured periodically, weather details at a specific longitude/latitude at periodic interval, continuous sensor feed from manufacturing machines or oil rigs etc. Time series information is not necessarily different but it is a) the volume and velocity aspect and b) the sparseness of the information that makes it challenging to be stored in traditional stores that are designed for well defined structured information.
InfoQ: What are the advantages of modeling a data set as Time Series data?
Sahni: Not all the data need to be represented as time series but if you have continuous feed of information coming at you (let's say from sensors) and you want to analyze the data based on the time dimension then it is prudent that you keep the arrival time of each feed and index it for faster queries.
InfoQ: What are the design considerations to keep in mind to store Time Series type of data in our databases?
Sahni: Time series data, generically speaking, is produced in large volumes so one need to give special consideration on how to persist them in the database and how well to access them. Therefore it is very important to find out in advance how you would be accessing your data for example in stock exchange demo, we know that stock tick information is generated once a second for each stock i.e. 86K ticks per stock per day. If we store that many records as separate rows then the time complexity to access this information would be huge, so we can group 5 minutes or 1 hour or one day worth of records as a single vector record. The benefits of storing information in larger chunks is obvious as you would do way fewer lookups into the NoSQL store to fetch the information for a specific period of time. Other point to remember is that if your window size is very small then you will be doing a lot of read/write operation and if it is too big then durability would be a concern as you can lose the information in the event of system failure. So you need to balance out both the forces.
InfoQ: Are there any design or architecture patterns that developers should follow when using Time Series Data?
Sahni: What's most important while designing the data model for any time series application is to know the access pattern and the granularity of information required by the client application. Though we might need to persist all the time series data but more than often we don't need to store each data point as a separate record in the database. If we can define a time window and store all the readings for that period of time as an array then we can significantly cut the number of actual records persisted in the database, improving the performance of the system. From our previous example, if each stock generates 84K data points per day then you would require that many random I/O on the disk to access the information which is way too many but if you could store the same information as an array where each array holds information for a specific hour then all you need is 24 lookups to fetch day's worth of data.
The other aspect to balance out is the size of your array object (whether it is JSON, XML or anything else). You wouldn't want the window size of your array to be so big that you end up returning way too much information than you need, wasting the precious bandwidth. And if you make the window too fine grained then you are potentially lowering the throughput and increasing the latency of your system. So balance out both sides of the equation and come up with the right size time window to read/write collections of time series data to/from the database.
InfoQ: What are the limitations of storing a data set as Time Series data?
Sahni: I would say the real challenge with the time series data is to fine tune the system based on the requirements and the access patterns. If the access pattern changes in future then you might have to re-index, re-calculate the array size to optimize your queries. So each time-series application is very custom made, where you can apply the best practices but cannot just import the data modeling templates to a different time-series problem.
Anuj also spoke at last year’s JavaOne Conference on managing Time Series Data in applications.
About the Interviewee
Anuj Sahni is a Principal Product Manager at Oracle, where he is responsible for managing company's leadership NoSQL database and big data products. Anuj has over twelve years of product management/development experience in Fortune 500 companies. He has led highly successful, globally distributed, cross-functional teams to develop new software and cloud-based services. Anuj obtained his Master’s degree in Computer Engineering from University of Florida, and has published academic papers in bio-informatics fields as well. In his spare time, he enjoys cycling, tracking, and playing with his two daughters.
Tom Gilb & Kai Gilb Jan 26, 2015