InfoQ Homepage Big Data Content on InfoQ
-
Introducing Interoperable Blockchain Identity Solutions with Hyperledger Aries
In a recent blog post, the Hyperledger project announced their 13th project called Hyperledger Aries, which provides an interoperable identity management toolkit that enables creating, transmitting and storing verifiable digital certificates. Using this toolkit, organizations can support, secure, interoperable peer-to-peer messaging across different distributed ledger technologies (DLT).
-
Expo: Real Time A/B Testing and Monitoring with Spark Streaming and Kafka at Walmart Labs
The WalmartLabs engineering team developed a real time A/B testing tool called Expo that collects and analyzes user engagement metrics. It uses Spark Structured Streaming to process the incoming data and stores the metrics in KairosDB.
-
Databricks Open Sources Delta Lake to Make Data Lakes More Reliable
Databricks recently announced open sourcing Delta Lake, their proprietary storage layer, to bring ACID transactions to Apache Spark and big data workloads. Databricks is the company behind the creators of Apache Spark, while Delta Lake is already being used in several companies like McAffee, Upwork etc . Delta Lake is addressing the heterogeneous data problem that data lakes often have...
-
Microsoft Releases High-Performance C# and F# Support for Apache Spark
Microsoft announced the release of .NET for Apache Spark, adding new high-performance C# and F# binding to the big-data computation engine.
-
A Framework for High-Value Big Data
Asha Saxena recently spoke at the Enterprise Data World 2019 Conference about the value big data analytics initiatives bring to the organizations. Saxena proposed a big data framework that can help with organizational maturity and internal competencies.
-
Introducing TensorFlow Privacy, a New Machine Learning Library for Protecting Sensitive Data
In a recent blog post, TensorFlow announced TensorFlow Privacy, an open source library that allows researchers and developers to build machine learning models that have strong privacy. Using this library ensures user data are not remembered through the training process based upon strong mathematical guarantees.
-
Microsoft Announces New Azure Analytics Services ADLS, ADX and More
Microsoft has announced the general availability of two new Azure analytics services - Azure Data Lake Storage Gen2 (ADLS) and Azure Data Explorer (ADX). Furthermore, Microsoft also announced the preview of Azure Data Factory Mapping Data Flow.
-
Microsoft Announces the General Availability of Azure Data Box Disk
In a recent blog post, Microsoft has announced the general availability of Azure Data Box Disk, an SSD-based solution for offline data transfer to Azure. Furthermore, Microsoft also announced the public preview of Azure Data Box Blob Storage – a feature allowing customers to copy data to Blob Storage on a Data Box.
-
Q&A with Christoph Windheuser on AI Applications in the Industry
Increased hardware power and huge amounts of data are making existing machine learning approaches like pattern recognition, natural language processing, and reinforcement learning possible. Artificial Intelligence is impacting the development process; it’s increasing the complexity of things like version control, CI/CD and testing.
-
Amazon Announces Managed Streaming for Kafka in Public Preview
At the recent AWS re:Invent 2018 event, Amazon announced a new fully managed service that makes it easy for customers to build and run applications that use Apache Kafka to process streaming data. This new service is called Amazon Managed Streaming for Kafka, Amazon MSK for short, and is now in public preview.
-
Google Cloud Announces Transfer Appliance in Beta for Cloud Data Migrations in the EU
Google announced that Transfer Appliance, a high-capacity server that lets customers move large amounts of data to Google Cloud Platform (GCP) quickly and securely, is available in beta in the European Union (EU). Google will handle the data transfer with Transfer Appliance in GCP in the EU, and data will not leave the EU.
-
The Evolution of Uber’s 100+ Petabyte Big Data Platform
Uber’s engineering team wrote about how their big data platform evolved from traditional ETL jobs with relational databases to one based on Hadoop and Spark. A scalable ingestion model, standard transfer format and a custom library for incremental updates are the key components of the platform.
-
Data Lakes and Modern Data Architecture in Clinical Research and Healthcare
Dr. Prakriteswar Santikary, chief data officer at ERT, spoke at Data Architecture Summit 2018 Conference last month about data lake architecture his team developed at their clinical research organization. He discussed the data platform deployed in the cloud to streamline data collection, aggregation and clinical reporting and analytics, using concepts like serverless computing and data services.
-
Event Sourcing to the Cloud at HomeAway
Adam Haines, Data Architect at HomeAway, recently spoke at the Data Architecture Summit 2018 Conference about how his team leverages event sourcing cloud design pattern to accelerate the big data initiatives in their organization.
-
Cloudera and Hortonworks Merge with Goal to Increase Competition with Cloud Offerings
Earlier this month, Cloudera and Hortonworks announced an all-stock merger at a combined value of around $5.2 billion. Analysts have argued that this merger is aimed at increased competition that both companies are facing from cloud vendors like Amazon, Google and Microsoft. In this article we log reactions from analysts and the industry, and the implications for current customers.