On August 12, Google announced that its big data processing service has reached general availability. This managed service allows customers to build pipelines that manipulate data prior to being processed by big data solutions. Cloud Dataflow supports both streaming and batch programming in a unified model.
Airbnb recently opensourced Airflow, its own data workflow management framework. Airflow is being used internally at Airbnb to build, monitor and adjust data pipelines. Airflow’s creator, Maxime Beauchemin and Agari’s Data Architect and one of the framework’s early adopters Siddharth Anand discuss about Airflow, where it can be of use and future plans.
Any cloud provider that believes in data gravity is trying to make it easier to collect and store data in its facilities. To make data movement between cloud and on-premises endpoints easier, Microsoft recently announced the general availability of Azure Data Factory (ADF).
At QCon San Francisco, we offer two days of workshops (Nov 19-20). Workshops focus on developing the technical skills that leverage technologies you heard about from our expert practitioners during the conference sessions. Here is a glimpse at some of the experts you can learn from QCon SF ‘15 workshops.
For an organization to be data-driven, it's not enough to just dump mountains of data. That data needs to be accurate and meaningful. Julianna Göbölös-Szabó, data engineer at Prezi shared how they improved the quality of its log data. Their solution involved moving from unstructured to structured data with a lightweight, contract-based approach to nudge all teams in the right direction.
Basho Data Platform supports integration with NoSQL databases like Redis, in-memory analytics, caching, and search. Basho Technologies, the company behind Riak NoSQL database, announced in May, the availability of the data platform that can be used to deploy and manage Big Data, IoT and hybrid cloud applications.
At the recent devopsdays Amsterdam 2015, Patrick Roelke contended that monitoring still has lots of issues. Roelke believes that data science can help by eliminating static thresholds and coalescing information from various data sources into a single metric. The talk included a quick overview of monitoring tools that leverage data science: Kale, Bosun and AnomalyDetection.
Metanautix recently announced the newest version of its product, Quest. Quest allows enterprises to build software defined data marts that can run in virtualized servers. Designed from the ground up with security and auditability in mind, Quest can deal with Big Data workloads and encapsulate it into different logical views, ready for consumption by different departments in enterprise.
The demand for IT project managers is increasing. Agile methodologies support collaboration with distributed teams for creative problem solving. The Internet of Things, cloud, big data, and cyber security will continue to dominate the IT landscape. Project managers have to pioneer IOT initiatives, be prepared for the influx of data and ensure that deliverables from their projects are secure.
Twitter has replaced Storm with Heron which provides up to 14 times more throughput and up to 10 times less latency on a word count topology, and helped them reduce the needed hardware to a third.
Apache Parquet, the open-source columnar storage format for Hadoop, recently graduated from the Apache Software Foundation Incubator and became a top-level project. Initially created by Cloudera and Twitter in 2012 to speed up analytical processing, Parquet is now openly available for Apache Spark, Apache Hive, Apache Pig, Impala, native MapReduce, and other key components of the Hadoop ecosystem.
Capgemini are currently working on Apollo, an open source application platform built on top of the Apache Mesos cluster manager and Docker, which is designed to power next generation web services, microservices and big data platforms running at scale.
Latest version of MemSQL, in-memory database with support for transactions and analytics, includes a new Community Edition for free use by organizations. MemSQL 4, released last week, also supports integration with Apache Spark, Hadoop Distributed File System (HDFS), and Amazon S3.
Flipboard recently reported on an in-house application of deep learning to scale up low-resolution images that illustrates the power and flexibility of this class of learning algorithms.
NASA Center for Climate Simulation (NCCS) is using Apache Hadoop for high-performance data analytics. Glenn Tamkin from NASA team, recently spoke at ApacheCon Conference and shared the details of the platform they built for climate data analysis with Hadoop.