In his new article “MapReduce Patterns, Algorithms, and Use Cases”, Ilya Katsov gives a systematic view of the different MapReduce patterns, algorithms and techniques that can be found on the web or in scientific articles along with several practical use case studies.
After six years of gestation, Big data framework Apache Hadoop 1.0.0 was recently released. Core features in the release include Kerberos Authentication, support for Apache HBase and RESTful API to HDFS. InfoQ spoke with Arun Murthy, VP of Apache Hadoop, about the new release.
Corporations are increasingly using social media to learn more about what their customers are saying about their products. This presents unique challenges as unstructured content needs analytic techniques to interpret the sentiment embodied in the blog posts. InfoQ caught up with Subramanian Kartik to learn more about the blog sentiment analysis project his team worked on.
HPCC Systems, which is part of LexisNexis, is launching this week its Thor Data Refinery Cluster on the Amazon EC2. HPCC Systems is an enterprise-grade, open source Big Data analytics technology platform capable of ingesting vast amounts of data, transforming, linking and indexing that data, with parallel processing power spread across the nodes.
Recently Steve Jones, from Cap Gemini, questioned whether NoSQL/Big Data is the panacea that some vendors would have us believe. He suggests that in some cases in-memory RDBMS may well be the optimal solution and that approaches such as Map Reduce could be too difficult to understand for typical IT departments. He concludes with a suggestion some sometimes Big Data may be a Big Con.
Hortonworks, a company created in June 2011 by Yahoo! and Benchmark Capital, has announced the Technical Preview Program of Data Platform based on Hadoop. The company employs many of the core Hadoop contributors and intends to provide support and training.
The Amazon Web Services (AWS) team announced a set of resources targeting the high performance computing needs of the scientific community. AWS specifically highlights their “spot pricing” market as a way to do cost-effective, massive scale computing in Amazon cloud environment.
MapR Technologies released a big data toolkit, based on Apache Hadoop with their own distributed storage alternative to HDFS. The software is commercial, with both a free edition, M3, as well as a paid edition, M5. M5 includes snapshots and mirroring for data, Job Tracker recovery, and commercial support. MapR's M5 edition will form the basis of EMC Greenplum's upcoming HD Enterprise Edition.
Yahoo spun-out its core Hadoop team, forming a new company Hortonworks. CEO Eric Baldeschwieler presented their vision of easing adoption of Hadoop and making core engineering improvements for availability, performance, and manageability. Hortonworks will sell support, training, and certification, primarily indirects through partners.
A prevalent trend in IT in the last twenty years was scaling-out, rather than scaling-up. But due to the recent technological advances there is a new option, scaling-out scaled-up servers based on GPUs.
Yahoo recently announced and presented a redesign of the core map-reduce architecture for Hadoop to allow for easier upgrades, larger clusters, fast recovery, and to support programming paradigms in addition to Map-Reduce. The new design is quite similar to the open source Mesos cluster management project - both Yahoo and Mesos commented on the differences and opportunities.
Ricky Ho revisited his three year old post on that question and realized that a lot had changed since then.
Google's Daniel Peng and Frank Dabek published a paper on "Large-scale Incremental Processing Using Distributed Transactions and Notifications” explaining that databases do not meet the storage or throughput requirements for Google's indexing system which stores tens of petabytes of data and processes billions of updates per day on thousands of machines.
Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.