BT

Big Data: Do Languages Really Matter?

by Charles Menguy on Jan 20, 2014 |

Big Data is a field where even a single millisecond loss can be significant over billions of events. Yet, languages often regarded as slow like Python have gained a lot of popularity in the past year. Recent articles and discussions in the Big Data community have started reigniting the debate around the choice of a programming language for data science and Big Data.

According to Ville Tuulos, principal engineer at AdRoll, the raw performance of a language doesn’t matter. Ville’s findings were presented in a meetup in September 2013 in San Francisco, showing AdRoll’s backend stack built around Python, and how they are able to outperform giants like Amazon’s Redshift. The key here is that they built their system based on their own very specific use case, which allowed them to optimize for that one use case. As Ville says:

You can use a high-level language to quickly implement domain-specific solutions that outperform generic solutions, regardless of the language they use.

This doesn’t mean that languages do not matter at all. There has been a lot of controversy recently as to which language is best suited for data science and Big Data, and some recurring challengers are Python and R. Some even speak of the data science wars. An interesting discussion was started on this topic on LinkedIn, and the general consensus seems to be that R is an academic language and "what makes R better for the data scientist is the large amount of packages and the diversity of it as well.". 

But Python overall appeals more to programmers at heart and when dealing with large amounts of data, as Tom Rampley, data scientist at Dish Network says:

I use R extensively for the statistical functionality that comes with the various packages. I also use it for data manipulation with small data sets. However, for text parsing, large data set manipulation, and coding my own algorithms I much prefer Python in combination with the Numpy, Scipy, and Pandas packages.

In the most recent months, Python seems to be winning on most fronts, like described by Karissa McKelvey who wrote in October 2013 that "My Data Is Big Because It Doesn't Load Into R" or Matt Asay who states that "R remains popular with the PhDs of data science, but as data moves mainstream, Python is taking over."

Performance for a given language is often an important factor when deciding which one to use in a complex architecture, and this is often overstated. What really matters is how you use this language; like Linus Torvalds put it: "Bad programmers worry about the code. Good programmers worry about data structures and their relationships." 

Take for example Cloudera’s recently open-sourced project Impala, which is proposed as a replacement for Hive to make queries an order of magnitude faster. They chose to write it in C++ while the rest of the Hadoop stack is written in Java, and the reason for this switch is often advertised as for performance reasons. But the performance boost does not come simply from the switch from Java to C++. Impala doesn’t even use MapReduce and caches data in memory, so it becomes obvious that the performance is better primarily because it uses a completely different paradigm with different data structures and limitations. Swapping Java with C++ is a nice addition to go even further, but likely not responsible for the bulk of the performance boost.

Hello stranger!

You need to Register an InfoQ account or to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Yes, languages do matter, as can be seen in the case of F# by Marc Sigrist

F# is the only language who features type providers, which let you access big, semi-structured data in a statically typed way, even if the data source contains thousands or millions of implied data types (e.g. the FreeBase type provider). See also here why it is specifically well suited for data science. F# enables various performance optimizations such as tail call elimination and structs (non-reference types) and has excellent support for async and parallel processing. F# has quickly risen to rank 12 of the top programming languages in the TIOBE index. It is open source and supports all major desktop and device platforms.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

1 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2013 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT