Microsoft increased its foothold in the data science community last winter by acquiring Revolution Analytics, a major provider of software and services based on the open-source R project for computational statistics. The deal is expected to bring R capabilities to the Microsoft suite of products and facilitate the adoption of R-based solutions in the enterprise environment.
Apache Spark has released version 1.3 of their project. The main improvements are the addition of the DataFrames API, better maturity of the Spark SQL, as well as a number of new methods added to the machine learning library MLlib, and better integration of Spark Streaming with Apache Kafka.
Google announced last week the release of open source MapReduce framework for C, called MR4C, that allows developers to run native code in Hadoop framework. MR4C framework brings together the performance and flexibility of natively developed algorithms with the scalability and throughput provided by Hadoop execution framework.
Some time ago, when MongoDB 2.6 was released Kelly Stirman, Director of Products at MongoDB answered our questions regarding the latest release. Now with MongoDB 3.0 announced for March and MongoDB 3.0 RC-8 already available, it’s time to see in more detail what WiredTiger storage engine, new and improved MMS and storage compression can bring to NoSQL users.
Pivotal has decided to open source core components of their Big Data Suite and has announced the Open Data Platform, an initiative promoting open source and standardization for Big Data.
Project Pachyderm Aims to Build "Modern" Hadoop using Docker and CoreOS.
Apache Hive has released version 1.0 of their project on February 6th, 2015. Originally planned as version 0.14.1, the community voted to change the version numbering to 1.0.0 to reflect the amount of maturity the project has reached.
Amazon recently announced EMRFS, an implementation of HDFS that allows EMR clusters to use S3 with a stronger consistency model. When enabled, this new feature keeps track of operations performed on S3 and provides list consistency, delete consistency and read-after-write-consistency, for any cluster created with Amazon Machine Image (AMI) version 3.2.1 or greater.
Apache Flink has released the version 0.8.0 of their project. Besides the usual performance, compatibility, and stability improvements, it has also added a streaming Scala API, where streaming capabilities had so far been missing. Apache Flink has also been promoted to the top-level of the Apache projects recently after joining the incubator roughly nine months ago.
A number of Google researchers and engineers presented their view on the technical debt of using machine learning at a NIPS workshop. They identified different aspects of technical debt and came to the conclusion that without proper care, using machine learning or complex data analysis in your company can induce new kinds of technical debt different from classical software engineering.
Apache Spark 1.2.0 was released with Netty-based implementation, High Availability and Machine Learning APIs. It represents the work of 172 contributors from over 60 institutions and comprises more than 1000 patches. InfoQ talks with Patrick Wendell, a Spark committer and PMC member.
The latest version of big data analytics tools Splunk Enterprise and Hunk support instant pivot, enhanced event pattern detection, and prebuilt dashboard panels. Splunk Inc., provider of the software platform for operational intelligence, recently announced the general availability (GA) of version 6.2 of Splunk Enterprise and Hunk: Splunk Analytics for Hadoop and NoSQL Data Stores.
ThoughtWorks has published a digital preview of the January 2015 radar, providing opinion on techniques, tools, platforms and languages and taking a snapshot of the current trends in software technology.
Splice Machine version 1.0 supports analytic window functions and integration with Hadoop ecosystem. Splice Machine team recently released their Hadoop based RDBMS data management solution that can be used for transactional workloads on Hadoop.
Google announced earlier this year their Cloud Dataflow, a service and SDK for processing large amounts of data in batches or real time. Now they have open sourced the Dataflow Java SDK, enabling developers to see how it works and possibly use the SDK for services running on-premises or in other clouds.