Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Hadoop Content on InfoQ

  • Big Data Analytics for Security

    In this article, authors discuss the role of big data and Hadoop in security analytics space and how to use MapReduce to efficiently process data for security analysis for use cases like Security Information and Event Management (SIEM) and Fraud Detection.

  • Building Applications With Hadoop

    When building applications using Hadoop, it is common to have input data from various sources coming in various formats. In his presentation, “New Tools for Building Applications on Apache Hadoop”, Eli Collins overviews how to build better products with Hadoop and various tools that can help, such as Apache Avro, Apache Crunch, Cloudera ML and the Cloudera Development Kit.

  • Building a Real-time, Personalized Recommendation System with Kiji

    Jon Natkins explains in this article how to create a personalized recommendation system fed with large amounts of real-time data using Kiji, which leverages HBase, Avro, Map-Reduce and Scalding.

  • Costin Leau on Elasticsearch, BigData and Hadoop

    Elasticsearch is an open source, distributed real-time search and analytics engine for the cloud. The first milestone of elasticsearch-hadoop 1.3.M1 was released last month. InfoQ spoke with Costin Leau about Elasticsearch and how it integrates with Hadoop and other Big Data technologies.

  • Spoilt for Choice – How to choose the right Big Data / Hadoop Platform?

    In his new article Kai Wähner compares several alternatives for installing a version of Hadoop and realizing big data processes. He compares distributions and tooling from Apache and many other vendors including Cloudera, HortonWorks, MapR, Amazon, IBM, Oracle, Microsoft. He additionally describes pros and cons of every distribution and provides a decision tree for choosing a most appropriate one.

  • Interview and Video Review: Working with Big Data: Infrastructure, Algorithms, and Visualizations

    Paul Dix leads a practical exploration into Big Data in this video training series. The first five lessons of the training span multiple server systems with a focus on the end to end processing of large quantities of XML data from real Stack Exchange posts. He completes the training with a lesson on developing visualizations for gaining insights from the macro level analysis of Big Data.

  • Hadoop Virtual Panel

    In this virtual panel, InfoQ talks to several Hadoop vendors and users about their views at current and future state of Hadoop and the things that are the most important for Hadoop’s further adoption and success.

  • Interview with Arun Murthy on Apache YARN

    Apache Hadoop YARN – a new Hadoop resource manager - has just been promoted to a high level Hadoop subproject. InfoQ had the chance to discuss YARN with Arun Murthy - founder and architect at Hortonworks.

  • Generating Avro Schemas from XML Schemas Using JAXB

    Apache Avro is an up and coming binary marshalling framework. In his new article Benjamin Fagin explains how one can leverage existing XSD tooling to create data definitions and then use XJC plugin to directly generate AVRO schemes and marshaling classes.

  • Exploring Hadoop OutputFormat

    As more companies adopt Hadoop, its integration with other applications is becoming more important. A key to such integration is usage of the appropriate OutputFormat allowing to produce output data in a form most appropriate for other applications.

  • Extending Oozie

    In this article authors show how leverage Oozie extensibility to implement custom language extensions. This approach can be viewed a specializing workflow language for a given company/line of business.

  • An Open, Interoperable Cloud

    This article describes how interoperable clouds can be created, today, through the integration of open standards such as the Open Cloud Compute Interface, the Open Virtualisation Format and CDMI. They provide the means to package virtual infrastructure deployments, an API for the runtime management of storage infrastructure and an API for the runtime management of infrastructure as service.