InfoQ Homepage Apache Spark Content on InfoQ

News

RSS Feed

Newer Older

Cloud

Amazon Athena Now Supports Apache Spark Engine

Amazon Athena now supports the open-source distributed processing system Apache Spark to run fast analytics workloads. Data analysts and engineers can use Jupyter Notebook in Athena to perform data processing and programmatically interact with Spark applications.

Renato Losio
on Jan 22, 2023
Architecture & Design

Uber Reduces Logging Costs by 169x Using Compressed Log Processor (CLP)

Uber recently published how it dramatically reduced its logging costs using Compressed Log Processor (CLP). CLP is a tool capable of losslessly compressing text logs and searching them without decompression. It achieved a 169x compression ratio on Uber's log data, saving storage, memory, and disk/network bandwidth.

Eran Stiller
on Nov 28, 2022
.NET

Microsoft Releases SynapseML 0.1.0 with .NET and Cognitive Services Support

Microsoft announced the first .NET-compatible version of SynapseML, a new machine learning (ML) library for Apache Spark distributed processing platform. Version 0.1.0 of the SynapseML library adds support for .NET bindings, allowing .NET developers to write ML pipelines in their preferred language.

Edin Kapić
on Sep 06, 2022
AI, ML & Data Engineering

Uber Open-Sourced Its Highly Scalable and Reliable Shuffle as a Service for Apache Spark

Uber engineering has recently open-sourced its highly scalable and reliable shuffle as a service for Apache Spark. Spark is one of the most important tools and platforms in data engineering and analytics. It is shuffling data on local machines by default and causes challenges while the scale is getting very large. Shuffle as a service is a solution developed at Uber for this problem.

Reza Rahimi
on Aug 14, 2022
Cloud

Amazon Elastic MapReduce Now Generally Available as a Serverless Offering

AWS recently announced that Amazon Elastic MapReduce (EMR) Serverless is generally available (GA). The offering is a serverless deployment option for customers to run big data analytics applications using open-source frameworks like Apache Spark and Hive without configuring, managing, and scaling clusters or servers.

Steef-Jan Wiggers
on Jun 07, 2022
AI, ML & Data Engineering

Microsoft Open-Sources Distributed Machine Learning Library SynapseML

Microsoft announced the release of SynapseML, an open-source library for creating and managing distributed machine learning (ML) pipelines. SynapseML runs on Apache Spark, provides a language-agnostic API abstraction over several datastores, and integrates with several existing ML technologies, including Open Neural Network Exchange (ONNX).

Anthony Alford
on Dec 28, 2021
AI, ML & Data Engineering

Apache Spark Brings Pandas API with Version 3.2

The Apache Spark team has integrated the Pandas API in the product's latest 3.2 release. With this change, dataframe processing can be scaled to multiple clusters or multiple processors in a single machine using the PySpark execution engine.

Sabri Bolkar
on Nov 04, 2021
Cloud

AWS Announces Customizable Image Support for Amazon EMR on EKS

Recently, AWS announced customizable image support for Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS) that allows customers to modify the Docker runtime image that runs their analytics application using Apache Spark on their EKS cluster.

Steef-Jan Wiggers
on Jul 28, 2021
Architecture & Design

Airbnb Builds Himeji - a Scalable Centralized Authorization System

Airbnb recently described how it built Himeji, a scalable centralized authorization system. Himeji stores permissions data and performs permission checks as a central source of truth. It uses a sharded and replicated in-memory cache to improve performance and lower latencies and has served checks in production for about a year.

Eran Stiller
on May 12, 2021
Architecture & Design

Designing for Failure in the BBC's Analytics Platform

Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."

Eran Stiller
on Feb 24, 2021
Cloud

Google Brings Databricks to Its Cloud Platform

Recently Google announced a partnership with Databricks to bring their fully-managed Apache Spark offering and data lake capabilities to Google Cloud. The offering will become available as Databricks on Google Cloud.

Steef-Jan Wiggers
on Feb 23, 2021
.NET

Microsoft Releases .NET for Apache Spark 1.0

Last month, Microsoft released the first major version of .NET for Apache Spark, an open-source package that brings .NET development to the Apache Spark platform. The new release allows .NET developers to write Apache Spark applications using .NET user-defined functions, Spark SQL, and additional libraries such as Microsoft Hyperspace and ML.NET.

Arthur Casals
on Nov 28, 2020
AI, ML & Data Engineering

Spark AI Summit 2020 Highlights: Innovations to Improve Spark 3.0 Performance

At the recent Spark AI Summit 2020, held online for the first time, the highlights of the event were innovations to improve Apache Spark 3.0 performance, including optimizations for Spark SQL, and GPU acceleration.

Carol McDonald
on Jul 03, 2020
AI, ML & Data Engineering

Boosting Apache Spark with GPUs and the RAPIDS Library

At the 2019 Spark AI Summit Europe conference, NVIDIA software engineers Thomas Graves and Miguel Martinez hosted a session on Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS Library. InfoQ recently talked with Jim Scott, head of developer relations at NVIDIA, to learn more about accelerating Apache Spark with GPUs and the RAPIDS library.

Carol McDonald
on Feb 25, 2020
AI, ML & Data Engineering

Databricks' Unified Analytics Platform Supports AutoML Toolkit

Databricks recently announced the Unified Data Analytics Platform, including an automated machine learning tool called AutoML Toolkit. The toolkit can be used to automate various steps of the data science workflow.

Srini Penchikala
on Oct 08, 2019

Newer News

Older News

InfoQ Software Architects' Newsletter

News