Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Spark Content on InfoQ

  • AWS Announces a Data Management and Analytics Solution Called Amazon FinSpace

    Recently, AWS announced a data management and analytics solution purpose-built for the Financial Services Industry (FSI) called Amazon FinSpace. The service aims to reduce the time it takes for financial analysts to find and access all types of financial data for analysis.

  • Simplifying ETL in the Cloud, Microsoft Releases Azure Data Factory Mapping Data Flows

    In a recent blog post, Microsoft announced the general availability (GA) of their serverless, code-free Extract-Transform-Load (ETL) capability inside of Azure Data Factory called Mapping Data Flows. This tool allows organizations to embrace a data-driven culture without the need to manage large infrastructure footprints while having the ability to dynamically scale data processing workloads.

  • Google Introduces Cloud Storage Connector for Hadoop Big Data Workloads

    In a recent blog post, Google announced a new Cloud Storage connector for Hadoop. This new capability allows organizations to substitute their traditional HDFS with Google Cloud Storage. Columnar file formats such as Parquet and ORC may realize increased throughput, and customers will benefit from Cloud Storage directory isolation, lower latency, increased parallelization and intelligent defaults

  • Dataiku's Latest Release Integrates Deep-Learning for Computer Vision

    Collaborative data science platform Dataiku's latest release of its Data Science Studio includes pre-trained deep learning models for image processing. The DSS platform implements each step of a data-science project from data-sourcing and visualization to production deployment. Its machine-learning module supports standard libraries and it integrates with Hadoop and multiple Spark engines.

  • Yahoo Open Sources TensorFlowOnSpark

    Yahoo open sources TensorFlowOnSpark, allowing Spark-native TensorFlow runtime and integration for distributed training and serving on Spark or Hadoop.

  • Google Cloud Machine Learning and Tensor Flow Alpha Release

    Late last month Google released an alpha version of their TensorFlow (TF) integrated cloud machine learning service as a response to a growing need to make their Tensor Flow library to run at scale on the Google Cloud Platform (GCP). Google describes several new feature sets around making TF usage scale by integrating several pieces of the GCP like Dataproc, a managed Hadoop and Spark service.

  • IBM to Open Source 50 Projects

    IBM has announced a new web portal called developerWorks Open, bringing together various projects they are open sourcing. The projects cover many domains including Analytics, Cloud, IoT, Mobile, Security, Social, Watson and others. So far, IBM has open sourced about 30 projects, and they plan to increase the number up to 50 by the end of the year, and others may come in the future.

  • MemSQL 4 Database Supports Community Edition, Geospatial Intelligence and Spark Integration

    Latest version of MemSQL, in-memory database with support for transactions and analytics, includes a new Community Edition for free use by organizations. MemSQL 4, released last week, also supports integration with Apache Spark, Hadoop Distributed File System (HDFS), and Amazon S3.

  • LinkedIn Open Sources Cubert With an Eye To Big Data Analytics

    LinkedIn recently open sourced Cubert, its High Performance Computation Engine for Complex Big Data Analytics. Cubert is a framework written for analysts and data scientists in mind.Developed completely in Java and expressed as a scripting language, Cubert is designed for complex joins and aggregations that frequently arise in the reporting world.

  • Mahout to Get Self-Optimizing Matrix Algebra Interface with Pluggable Backends for Spark and Flink

    At the recent GOTO conference in Berlin, Mahout committer Sebastian Schelter outlined recent advances in Mahout's ongoing effort to create a scalable foundation for data analysis that is as easy to use as R or Python.

  • Apache Drill Included in MapR Latest Distribution Release

    MapR recently announced including Apache Drill in its latest release of MapR distribution. Apache Drill is the open source version of Google’s Dremel. Dremel is the infrastructure on which BigQuery is based upon. Drill is offering a low latency SQL-on-Hadoop interface. While this puts it in the same space as several other technologies around Hadoop, Drill has some unique characteristics setting it

  • DataBricks Announces Spark SQL for Manipulating Structured Data Using Spark

    DataBricks, the company behind Apache Spark, has announced a new addition into the Spark ecosystem called Spark SQL. Spark SQL is separate from Shark, and does not use Hive under the hood. InfoQ reached out to Reynold Xin and Michael Armbrust, software engineers at DataBricks, to learn more about Spark SQL.

  • A Roundup of Cloudera Distribution Containing Apache Hadoop 5

    Cloudera recently released the latest version of its software distribution, CDH5. Almost 20 months after the last major version, CDH4 seems like ages in the Big Data world. We take a look at new features this release brings and the future direction of Cloudera after the latest round of investment from Intel and Google Ventures.

  • Spark Gets a Dedicated Big Data Platform

    Spark users can now use a new Big Data platform provided by intelligence company Atigeo, which bundles most of the UC Berkeley stack into a unified framework optimized for low-latency data processing that can provide significant improvements over more traditional Hadoop-based platforms.

  • Spark Officially Graduates From Apache Incubator

    Recently, Spark graduated from the Apache incubator. Spark claims up to 100x speed improvements over Apache Hadoop over in-memory datasets and gracefully falling back to 10x speed improvement for on-disk performance. Based on Scala, it can run SQL queries and be used directly in R. It provides Machine Learning, Graph database capabilities and other further discussed in the article.