Embedded Analytics and Statistics for Big Data
This article first appeared in IEEE Software magazine and is brought to you by InfoQ & IEEE Computer Society.
Embedded analytics and statistics for big data have emerged as an important topic across industries. As the volumes of data have increased, software engineers are called to support data analysis and applying some kind of statistics to them. This article provides an overview of tools and libraries for embedded data analytics and statistics, both stand-alone software packages and programming languages with statistical capabilities. I look forward to hearing from both readers and prospective column authors about this column and the technologies you want to know more about. —Christof Ebert
Big data has emerged as a key concept both in the information technology and the embedded technology worlds.1 Such software systems are characterized by a multitude of heterogeneous connected software applications, middleware, and components such as sensors. The growing usage of cloud infrastructure makes available a wealth of data resources; smart grids, intelligent vehicle technology, and medicine are recent examples of such interconnected data sources. We’re producing approximately 1,200 exabytes of data annually, and that figure is only growing.2,3 Such a massive amount of unstructured data presents enormous and mounting challenges for business and IT executives.
Big data is defined by four dimensions: volume, source complexity, production rate, and potential number of users. The data needs to be organized to transform the countless bits and bytes into actionable information—the sheer abundance of data won’t be helpful unless we have ways to make sense out of it. Traditionally, programmers wrote code and statisticians did statistics. Programmers typically used a general-purpose programming language, whereas statisticians plied their trade using specialized programs such as IBM’s SPSS (Statistical Package for the Social Sciences). Statisticians pored over national statistics or market research usually only available to select groups of people, whereas programmers handled large amounts of data in databases or log files. Big data’s availability from the cloud to virtually everybody changed all that.
(Click on the image to enlarge it)
As the volumes and types of data have increased, software engineers are called more and more often to perform different statistical analyses with them. Software engineers are active in gathering and analyzing data on an unprecedented scale to make it useful and grow new business models.1 As an example, consider proactive maintenance. We can continuously monitor machines, networks, and processes to immediately detect irregularities and failures, allowing us to correct them before damage occurs or the system comes to a standstill. This reduces maintenance costs in both material cost as well as human intervention. Often, processing and making sense of data is just part of a bigger project or is embedded in some software, configuration, or hardware optimization problem. Luckily, the community has responded to this need by creating a set of tools that bring some of statisticians’ magic to programmers—in fact, these are often more powerful than traditional statistics tools because they can handle volumes that are scales of magnitudes larger than old statistical samples.
Technologies for Embedded Analytics and Statistics
There’s a wealth of software available for performing statistical analysis; Table 1 shows the most popular ones. They differ in the statistical sophistication required from their users, ease of use, and whether they’re primarily stand-alone software packages or programming languages with statistical capabilities.
- Because these are stand-alone programming languages, they’re relatively easy to interface with other systems via standard language mechanisms or by importing and exporting data in various formats.
- Scripts in Python and R can be embedded directly into larger analytical workflows.
- Python and R programs can be directly used to build applications that read data from various sources and interact directly with the user for analysis and visualization via the Web.
- Through D3, users can elevate analytics to a higher level by interactively manipulating statistical graphics via Web browsers.
- These are much closer to a programmer’s frame of mind than specialized statistical packages are.
With the exception of D3, all entries in the table provide facilities for carrying out advanced statistics, such as multivariate and time-series analysis, either by themselves or via libraries. Each one, though, has a particular focus that will better suit working on a given target problem. Python’s Pandas package, for instance, has good support for time-series analysis because part of it was written to cater to such analysis regarding financial data.
The Python Statistics Ecosystem
The most popular general-purpose programming language for doing statistics today is Python. It’s always been a favorite for scientific computation, and several excellent Python tools are available for doing even complex statistical tasks. The fundamental scientific library in Python is NumPy. Its main addition to Python is a homogeneous, multidimensional array that offers a host of methods for manipulating data. It can integrate with C/C++ and Fortran and comes with several functions for performing advanced mathematics and statistics. Internally, it primarily uses its own data structures, implemented in native code, so that matrix calculations in NumPy are much faster than equivalent calculations in Python. SciPy, which builds on top of NumPy, offers a number of higher-level mathematical and statistical functions. SciPy deals again with NumPy’s arrays; these are fine for doing mathematics but a bit cumbersome for handling heterogeneous data with possibly missing values. Pandas solves that problem by offering a flexible data structure that allows easy indexing, slicing, and even merging and joining (similar to joins between SQL tables). One attractive setup involves using iPython, an interactive Python shell with commandline completion, nice history facilities, and many other features that are particularly useful when manipulating data. Matplotlib can then visualize the results.
The World Bank is a trove of information, and it makes a lot of its data available over the Web. For more sophisticated analysis, the public can download data from the World Bank’s Data Catalog or access it through an API. The most popular dataset is the World Development Indicators (WDI). WDI contains, according to the World Bank, “the most current and accurate global development data available, and includes national, regional and global estimates.” WDI comes in two downloadable forms: Microsoft Excel and commaseparated values (CSV) files. (Because Microsoft Excel files aren’t suitable for programmatic analysis, we deal with the CSV files here.)
Figure 1. A program for calculating World Development Indicators correlations using Python. The program collects the top 30 most measured indicators, calculates the Spearman pairwise correlations, and shows the results graphically.
The WDI CSV bundle is a 42.5-Mbyte zipped archive. After downloading and unzipping it, you’ll see that the main file is called WDI_Data.csv. A good way to get an overview of the file contents is to examine it interactively. Because we’ll be using Python, the best way to interact with the tools that we’ll use is by launching a session of iPython, and then loading the data:
In : import pandas as pd
In : data = pd.read_csv(“WDI_Data.csv”)
The result, in data, is a DataFrame containing the data. Think of a DataFrame as a two-dimensional array with some extra features that allow for easy manipulation. In a DataFrame, data is organized in columns and an index (corresponding to the rows). If we enter
In : data.columns
we’ll get an output that shows the names of the columns: the country name, the code for that country, an indicator name, and an indicator code. These are followed by columns for each year from 1960 to 2012. Similarly, if we enter
In : data.index
we’ll see that the data contains 317,094 rows. Each row corresponds to the values of one particular indicator for one country for the years 1960 to 2012; years without values in a row indicate no measurement in that year for that indicator in that country. Let’s see first, how many indicators there are
In : len(data[‘Indicator Name’].unique())
and second, how many countries there are
In : len(data[‘Country Name’].unique())
Now we have a problem to solve: Are the indicators independent among themselves, or are some of them related to others?
Because we measure indicators by year and by country, we must more precisely define the problem by de-ciding which parameters to retain as constant. In general, we get the best statistical results as our samples increase. It makes sense then to rephrase the problem: For the year in which we have most measurements, are the most measured indicators independent among themselves, or as some of them related to others? By “most measured indicators,” we mean those that have been measured in more countries. It turns out that we can find the answer to the question in about 50 LOC. Figure 1 contains the full program.
Lines 1–10 are imports of the libraries that we’ll be using. Line 11 reads the data. In line 13, we give the number of most measured indicators that we would like to examine. In line 15, we find the zero-based position of the first column with yearly measurements. After that, we’re able in line 17 to find the column with the most measurements (the year 2005). We then remove all data for which measurements aren’t available. In lines 20–26, we get the most measured indicators.
The actual statistical calculations start from line 28, where we prepare a table of ones to hold the result of the correlation values between each pair of indicators. In the loop that follows, we calculate each pairwise correlation and store it in the table we prepared. Finally, in lines 41–52, we display the results on screen and save them to a PDF file (see Figure 2). We take care to reverse the vertical order of the correlation matrix so that the most important indicator comes on the top of the matrix (lines 41 and 49).
The diagonal has perfect correlation—as it should, because we’re examining the same indicators. In addition to that, we do see that there are indicators that correlate with each other—some positively, even strongly so, and some negatively or very negatively.
Figure 2. World Development Indicators correlations matrix with Python created from the program in Figure 1.
More Advanced Components in the Python Ecosystem
As Python has attracted interest from the research community, several specialized tools have emerged. Among them, Scikit-learn builds on NumPy, SciPy, and matplotlib and offers a comprehensive machine-learning toolkit. For very big datasets that follow a hierarchical schema, Python offers PyTables, which are built on top of the HDF5 library. This is a hot topic, and DARPA awarded US$3 million in 2013 to Continuum Analytics as part of the XDATA program to develop further Python data analytics tools. You can expect the ecosystem to keep evolving steadily over the next few years.
The R Project for Statistical Computing
R is a language for doing statistics. You can think of Python bringing statistics to programmers and R bringing statisticians to programming. It’s a language centered on the efficient manipulation of objects representing statistical datasets. These objects are typically vectors, lists, and data frames that represent datasets organized in rows and columns. R has the usual control flow constructs and even uses ideas from object-oriented programming (although its implementation of object orientation differs considerably from the concepts we find in more traditional object-oriented languages). R excels in the variety of statistical libraries it offers. It’s unlikely that a statistical test or method isn’t already implemented in an R library (whereas in Python, you might find that you have to roll out your own implementation). To get an idea of what it looks like, Figure 3 shows the same program as Figure 1 and adopts the same logic, but using R instead of Python. Figure 4 shows the results.
Figure 3. A program similar to that in Figure 1 that calculates World Development Indicators correlations using R.
Combining, Federating, and Integrating Embedded Analytics Technologies
The examples we give in this article are typical of the way different applications can be merged to handle big data. Data flows from the source (in some raw format) to a format acceptable to our statistical package. The package must have some means of manipulating and querying data so that we can get the data subsets that we want to examine. These are subject to statistical analysis. The results of the statistical analysis can be rendered in textual form or as a figure. We can perform this process on a local computer or via the Web (in which case data crunching and processing is performed by a server, and parameters, results, and figures go through a Web browser). This is a powerful concept, because a host of different settings, from an ERP framework to car diagnostic software, can export their data in simple formats like CSV—in fact, we would see a warning sign whenever we encounter a piece of software that doesn’t allow exporting to anything but closed and proprietary data formats.
To analyze your data in any way you will, you must first have access to it. So you should by all means select technologies that facilitate the exchange of data, either by simple export mechanisms or via suitable calls, for instance through a REST (representational state transfer) API.
Data is getting bigger all the time, so you must investigate whether the tool you’re considering will be able to handle your data. It’s not necessary for you to be able to process all the data in main memory. For instance, R has the big memory library, which lets us handle huge datasets by using shared memory and memory-mapped files. Also, make sure that the software package can handle not only big input but also big data structures: if table sizes are limited to 32-bit integers, for instance, you won’t be able to handle tables with 5 million entries.
In the examples above, the alert reader will have noticed that we’ve spent more code manipulating the data to bring it to the appropriate format for statistical analysis than on the statistical analysis per se, which was provided anyway by functions already written for us. Our examples were somewhat trite, so these ratios of preprocessing to actual processing might have been especially top-heavy, but the examples highlight the fact that data manipulation is usually as important (and demanding) as the analysis. In effect, real proficiency in R and NumPy/SciPy doesn’t come from mastery of statistics but from knowing how to work efficiently with the data structures they offer. And this is essentially work for programmers, not statisticians. Further reading is available elsewhere.4-7
Figure 4. World Development Indicators correlations matrix with R.
1C. Ebert and R. Dumke, Software Measurement, Springer, 2007.
2K. Michael and K.W. Miller, eds., Computer, vol. 46, no. 6, 2013.
3T. Menzies and T. Zimmermann, eds., IEEE Software, vol. 30, no. 4, 2013.
About the Authors
Panos Louridas is a consultant with the Greek Research and Technology Network and a researcher at the Athens University of Economics and Business. Contact him at email@example.com or firstname.lastname@example.org.
Christof Ebert is managing director at Vector Consulting Services. He’s a senior member of IEEE and is the editor of the Software Technology department of IEEE Software. Contact him at email@example.com.
This article first appeared in IEEE Software magazine. IEEE Software's mission is to build the community of leading and future software practitioners. The magazine delivers reliable, useful, leading-edge software development information to keep engineers and managers abreast of rapid technology change.
Tom Gilb & Kai Gilb Jan 26, 2015