BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles R for Everyone: Advanced Analytics and Graphics – Book Review and Interview

R for Everyone: Advanced Analytics and Graphics – Book Review and Interview

 

The book "R for Everyone: Advanced Analytics and Graphics" authored by Jared P. Lander covers the R programming language and how to use it for data analytics and visualizations.

The discussion in the book starts with how to download and install R and the R Environment which includes tools like Command Line Interface and RStudio IDE.

Jared then covers the data structures like data.frames, lists, matrices, and arrays. Reading data into R is also discussed. This includes reading data from CSVs, Excel documents, and database tables, and also data from other statistical tools.

Author discussed the basic statistics, linear and non-linear models. Some of the linear models covered are logistic regression and Poisson regression. And the nonlinear models include non-linear least squares, Decision Trees, and Random Forests.

Jared also discusses the clustering models like K-means, PAM, and Hierarchical Clustering.

InfoQ spoke with Jared about the R programming language, book, and big data analytics and visualization topics.

InfoQ: When should we use R for data exploration compared to other solutions like Hadoop and MapReduce?

Jared P. Lander: This is not an either or situation. While R can work with data in memory it can also be used as the language for programming Hadoop and MapReduce jobs. If the data can fit comfortably on one reasonably sized machine, it should be explored in R, otherwise R can be used to do the exploration on Hadoop.

InfoQ: Can you talk about some popular Machine Learning algorithms and what use cases or business problems they can solve?

Jared: The most popular algorithms and models I have seen lately are the Elastic Net, Decision Trees and Random Forests. The Elastic Net is great when there are a very large number of predictors since it performs variable selection and regularization and still maintains the interpretability of a generalized linear model. Its most notable implementation is the glmnet package in R that was written in very efficient FORTRAN code by Jerome Friedman, Trevor Hastie and Rob Tibshirani. Decision Trees are great for when there is a nonlinear relationship between the response and predictors. Depending on where the tree is cut they can be very interpretable and have strong predictive power. Their natural extension is the Random Forest, which combines hundreds or thousands of trees to gain predictive power at the expense of interpretability. All three perform very well in situations requiring prediction such as targeted advertising, fraud detection and sports analysis.

InfoQ: How does R language compare with other Machine Learning frameworks like Spark MLlib?

Jared: R is both a language and a collection of statistical packages whereas other frameworks have predefined functionality. If some method does not exist in R—which is rare—it is possible to use R as a language to build it.

InfoQ: You discussed the analysis of time series data. This type of data is generated from more and more devices every day. Can you talk about some best practices in analyzing time series data?

Jared: With time series it is very important to account for autocorrelation, which means standard methods are no longer applicable. This even trickles down to ensuring the data is sequential for cross-validation. There are a number of different ways to fit models including autoregressive moving average (ARMA), generalized autoregressive conditional heteroskedasticity (GARCH) and Hidden Markov model (HMM). Clustering also needs special attention where dynamic time warping is used to measure the distance between series. Storing time series efficiently is another important step in the process and InfluxDB is a great solution.

InfoQ: Can you talk about the data visualization in general and what role R plays in the visualization space?

Jared: The most sophisticated analysis in the world would be near useless if the information it reveals cannot be communicated effectively and visualization is, perhaps, the best way to share information. A graph will almost always provide a better explanation than a table of numbers. One of the biggest selling points of R is its visualization capabilities. For years the gold standard was Hadley Wickham’s ggplot2 which makes amazing graphics with surprisingly few lines of code. With the move toward web graphics Hadley built ggvis, which is essentially a version of ggplot2 that generates Vega graphs. Ramnath Vaidyanathan wrote rCharts for easy creation of D3 graphics from within R. With all these options R makes data visualization easy and professional.

InfoQ: What do you recommend the application developers who want to learn R, start with in terms of tutorials or tools/IDE?

Jared: The most important tool when using R is the RStudio IDE which has made coding in R so much easier and accessible. It offers so much convenience and functionality that its helpfulness cannot be overstated. One of the great things about R is that all the code is open source so it is really possible to learn by looking at the work of others. And, of course, my book, R for Everyone, is a great place to get started.

InfoQ: What are the limitations of R programming language?

Jared: For the most part, data must be stored in memory so the size of the data is limited to the amount of RAM on a computer. Traditionally getting around this required using a cluster of machines, which is made easier with the foreach and parallel packages. There are a number of packages for working with data on disc, such as bigmemory and dplyr, so that has helped considerably. In the past R had some questionable memory management but that has improved considerably in recent releases.

InfoQ: Are there any improvements or new features you would like to see in R programming language?

Jared: Given the growing size of data the ability to use R to manipulate and analyze data still sitting in databases is crucial. While this functionality exists, it would be great to see it for more databases.

InfoQ: Please add other comments or thoughts on Statistical Programming and Big Data Analytics landscape in general or R in particular.

Jared: Being able to program on data instead of relying on point-and-click tools can improve productivity that everybody should make efforts to become programming literate. The quantity and quality of what they produce will improve so much that the time investment will certainly be worthwhile. This goes for anyone who works with spreadsheets all day. The next step is statistics and machine learning to extract ever more information out of that data. These skills will benefit both the users and their audience.

About the Book Author

Jared P. Lander is a statistical consultant based in New York City. He is the organizer of the New York Open Statistical Programming Meetup and speaks at the New York R, Predictive Analytics and Machine Learning meetups. Jared specializes in data management, multilevel models, machine learning, generalized linear models, data management, and statistical computing. His consulting ranges from music and fund raising to finance and humanitarian relief efforts. He also teaches an R course at Columbia University. With a masters from Columbia University in statistics and a B.A. in mathematics, Jared has experience in both academic research and industry.

Rate this Article

Adoption
Style

BT