InfoQ Homepage Hadoop Content on InfoQ
-
Pinterest Automates Hadoop Cluster Scaling and Migration with Internal Orchestration System
Recently, Pinterest disclosed its internal orchestration framework, called Hadoop Control Center (HCC), to automate the scaling and migration of its large-scale Hadoop clusters. This move addresses the operational complexity and limitations Pinterest previously faced when managing thousands of nodes across dozens of YARN clusters on AWS.
-
Google Releases Cloud Dataproc for Kubernetes in Alpha
Google Cloud Dataproc is an open-source data and analytic processing service based on Hadoop and Spark. Google has recently announced the alpha availability of Cloud Dataproc for Kubernetes, which provides customers with a more efficient method to process data across platforms.
-
Google Introduces Cloud Storage Connector for Hadoop Big Data Workloads
In a recent blog post, Google announced a new Cloud Storage connector for Hadoop. This new capability allows organizations to substitute their traditional HDFS with Google Cloud Storage. Columnar file formats such as Parquet and ORC may realize increased throughput, and customers will benefit from Cloud Storage directory isolation, lower latency, increased parallelization and intelligent defaults
-
Data Engineering in Badoo: Handling 20 Billion Events Per Day
Badoo is a dating social network that currently handles billions of events per day, explains Vladimir Kazanov, data platform engineering lead. At Skills Matter, he talked through some of the challenges of operating at this scale, and what tooling Badoo uses in order to process and report on this data.
-
Microsoft Announces New Azure Analytics Services ADLS, ADX and More
Microsoft has announced the general availability of two new Azure analytics services - Azure Data Lake Storage Gen2 (ADLS) and Azure Data Explorer (ADX). Furthermore, Microsoft also announced the preview of Azure Data Factory Mapping Data Flow.
-
Cloudera and Hortonworks Merge with Goal to Increase Competition with Cloud Offerings
Earlier this month, Cloudera and Hortonworks announced an all-stock merger at a combined value of around $5.2 billion. Analysts have argued that this merger is aimed at increased competition that both companies are facing from cloud vendors like Amazon, Google and Microsoft. In this article we log reactions from analysts and the industry, and the implications for current customers.
-
Dataiku's Latest Release Integrates Deep-Learning for Computer Vision
Collaborative data science platform Dataiku's latest release of its Data Science Studio includes pre-trained deep learning models for image processing. The DSS platform implements each step of a data-science project from data-sourcing and visualization to production deployment. Its machine-learning module supports standard libraries and it integrates with Hadoop and multiple Spark engines.
-
DevOps Workbench Launched by ZeroStack
Private cloud provider, ZeroStack, has announced a self-service capability from which developers can create their own workbenches. Forty developer tools from a mix of open source and commercial providers are available to users of the DevOps Workbench through Zerostack’s Intelligent Cloud Platform.
-
Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow
Julien Le Dem, the PMC chair of the Apache Arrow project, presented on Data Eng Conf NY on the future of column-oriented data processing. Apache Arrow is an open-source standard for columnar in-memory execution. InfoQ interviewed Le Dem to find out the differences between Arrow and Parquet.
-
Combine SQL Server with Hadoop Using PolyBase
With the recently released SQL Server 2016, you can now use SQL queries against Hadoop and Azure blob storage. Not only do you no longer need to write map/reduce operations, you can also join relational and non-relational data with a single query.
-
Elephant in the Cloud - Hadoop as a Service
Hadoop and other big data technologies revolutionized the way organizations run data analytics but the organizations are still facing challenges with operating costs of using these technologies for on-premise data processing. Ashish Thusoo recently spoke at Enterprise Data World Conference about Hadoop as a service offering that helps organizations bridge the gaps with these capabilities.
-
Google Cloud Machine Learning and Tensor Flow Alpha Release
Late last month Google released an alpha version of their TensorFlow (TF) integrated cloud machine learning service as a response to a growing need to make their Tensor Flow library to run at scale on the Google Cloud Platform (GCP). Google describes several new feature sets around making TF usage scale by integrating several pieces of the GCP like Dataproc, a managed Hadoop and Spark service.
-
Apache Flink 1.0.0 is Released
InfoQ's Rags Srinivas caught up with Stephan Ewen, a project committer for Apache Flink about the 1.0.0 Release and the roadmap
-
Hunk/Hadoop: Performance Best Practices
When working with Hadoop, with or without Hunk, there are a number of ways you can accidentally kill performance. While some of the fixes require more hardware, sometimes the problems can be solved simply by changing the way you name your files.
-
Using Hunk+Hadoop as a Backend for Splunk
Splunk can now store archived indexes on Hadoop. At the cost of performance, this offers a 75% reduction in storage costs without losing the ability to search the data. And with the new adapters, Hadoop tools such as Hive and Pig can process the Splunk-formatted data.