InfoQ Homepage Hadoop Content on InfoQ
-
Q&A with Saumitra Buragohain on Hortonworks Data Platform 3.0
InfoQ caught up with Saumitra Buragohain, senior director of Product Management at Hortonworks, regarding Hadoop in general and HDP 3.0 in particular.
-
Dataiku's Latest Release Integrates Deep-Learning for Computer Vision
Collaborative data science platform Dataiku's latest release of its Data Science Studio includes pre-trained deep learning models for image processing. The DSS platform implements each step of a data-science project from data-sourcing and visualization to production deployment. Its machine-learning module supports standard libraries and it integrates with Hadoop and multiple Spark engines.
-
DevOps Workbench Launched by ZeroStack
Private cloud provider, ZeroStack, has announced a self-service capability from which developers can create their own workbenches. Forty developer tools from a mix of open source and commercial providers are available to users of the DevOps Workbench through Zerostack’s Intelligent Cloud Platform.
-
Apache HBase 1.3 Ships with Multiple Performance Improvements
Apache HBase 1.3.0 was released mid-January 2017 and ships with support for date-based tiered compaction and improvements in multiple areas, like write-ahead log (WAL), and a new RPC scheduler, among others. The release includes almost 1,700 resolved issues in total.
-
Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow
Julien Le Dem, the PMC chair of the Apache Arrow project, presented on Data Eng Conf NY on the future of column-oriented data processing. Apache Arrow is an open-source standard for columnar in-memory execution. InfoQ interviewed Le Dem to find out the differences between Arrow and Parquet.
-
Combine SQL Server with Hadoop Using PolyBase
With the recently released SQL Server 2016, you can now use SQL queries against Hadoop and Azure blob storage. Not only do you no longer need to write map/reduce operations, you can also join relational and non-relational data with a single query.
-
Elephant in the Cloud - Hadoop as a Service
Hadoop and other big data technologies revolutionized the way organizations run data analytics but the organizations are still facing challenges with operating costs of using these technologies for on-premise data processing. Ashish Thusoo recently spoke at Enterprise Data World Conference about Hadoop as a service offering that helps organizations bridge the gaps with these capabilities.
-
Google Cloud Machine Learning and Tensor Flow Alpha Release
Late last month Google released an alpha version of their TensorFlow (TF) integrated cloud machine learning service as a response to a growing need to make their Tensor Flow library to run at scale on the Google Cloud Platform (GCP). Google describes several new feature sets around making TF usage scale by integrating several pieces of the GCP like Dataproc, a managed Hadoop and Spark service.
-
Apache Flink 1.0.0 is Released
InfoQ's Rags Srinivas caught up with Stephan Ewen, a project committer for Apache Flink about the 1.0.0 Release and the roadmap
-
Hunk/Hadoop: Performance Best Practices
When working with Hadoop, with or without Hunk, there are a number of ways you can accidentally kill performance. While some of the fixes require more hardware, sometimes the problems can be solved simply by changing the way you name your files.
-
Using Hunk+Hadoop as a Backend for Splunk
Splunk can now store archived indexes on Hadoop. At the cost of performance, this offers a 75% reduction in storage costs without losing the ability to search the data. And with the new adapters, Hadoop tools such as Hive and Pig can process the Splunk-formatted data.
-
Splunk .conf 2015 Keynote
Splunk opened their big data conference with an emphasis on “making machine data accessible, usable, and valuable to everyone”. This is a shift from their original focus: indexing arbitrary big data sources. Reasonably happy with their ability to process data, they want to ensure that developers, IT staff, and normal people have a way to actually use all of the data their company is collecting.
-
Parquet Becomes Top-Level Apache Project
Apache Parquet, the open-source columnar storage format for Hadoop, recently graduated from the Apache Software Foundation Incubator and became a top-level project. Initially created by Cloudera and Twitter in 2012 to speed up analytical processing, Parquet is now openly available for Apache Spark, Apache Hive, Apache Pig, Impala, native MapReduce, and other key components of the Hadoop ecosystem.
-
MemSQL 4 Database Supports Community Edition, Geospatial Intelligence and Spark Integration
Latest version of MemSQL, in-memory database with support for transactions and analytics, includes a new Community Edition for free use by organizations. MemSQL 4, released last week, also supports integration with Apache Spark, Hadoop Distributed File System (HDFS), and Amazon S3.
-
Glenn Tamkin on Applying Apache Hadoop to NASA's Big Climate Data
NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance data analytics. Glenn Tamkin from NASA team, recently spoke at ApacheCon Conference and shared the details of the platform they built for climate data analysis with Hadoop.