Big Data: Do Languages Really Matter?
Big Data is a field where even a single millisecond loss can be significant over billions of events. Yet, languages often regarded as slow like Python have gained a lot of popularity in the past year. Recent articles and discussions in the Big Data community have started reigniting the debate around the choice of a programming language for data science and Big Data.
According to Ville Tuulos, principal engineer at AdRoll, the raw performance of a language doesn’t matter. Ville’s findings were presented in a meetup in September 2013 in San Francisco, showing AdRoll’s backend stack built around Python, and how they are able to outperform giants like Amazon’s Redshift. The key here is that they built their system based on their own very specific use case, which allowed them to optimize for that one use case. As Ville says:
You can use a high-level language to quickly implement domain-specific solutions that outperform generic solutions, regardless of the language they use.
This doesn’t mean that languages do not matter at all. There has been a lot of controversy recently as to which language is best suited for data science and Big Data, and some recurring challengers are Python and R. Some even speak of the data science wars. An interesting discussion was started on this topic on LinkedIn, and the general consensus seems to be that R is an academic language and "what makes R better for the data scientist is the large amount of packages and the diversity of it as well.".
But Python overall appeals more to programmers at heart and when dealing with large amounts of data, as Tom Rampley, data scientist at Dish Network says:
I use R extensively for the statistical functionality that comes with the various packages. I also use it for data manipulation with small data sets. However, for text parsing, large data set manipulation, and coding my own algorithms I much prefer Python in combination with the Numpy, Scipy, and Pandas packages.
In the most recent months, Python seems to be winning on most fronts, like described by Karissa McKelvey who wrote in October 2013 that "My Data Is Big Because It Doesn't Load Into R" or Matt Asay who states that "R remains popular with the PhDs of data science, but as data moves mainstream, Python is taking over."
Performance for a given language is often an important factor when deciding which one to use in a complex architecture, and this is often overstated. What really matters is how you use this language; like Linus Torvalds put it: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
Take for example Cloudera’s recently open-sourced project Impala, which is proposed as a replacement for Hive to make queries an order of magnitude faster. They chose to write it in C++ while the rest of the Hadoop stack is written in Java, and the reason for this switch is often advertised as for performance reasons. But the performance boost does not come simply from the switch from Java to C++. Impala doesn’t even use MapReduce and caches data in memory, so it becomes obvious that the performance is better primarily because it uses a completely different paradigm with different data structures and limitations. Swapping Java with C++ is a nice addition to go even further, but likely not responsible for the bulk of the performance boost.
Yes, languages do matter, as can be seen in the case of F#