InfoQ Homepage Big Data Content on InfoQ
-
Amazon Redshift Serverless Generally Available to Automatically Scale Data Warehouse
Amazon recently announced the general availability of Redshift Serverless, an elastic option to scale data warehouse capacity. The new service allows data analysts, developers and data scientists to run and scale analytics without provisioning and managing data warehouse clusters.
-
Shopify’s Practical Guidelines from Running Airflow for ML and Data Workflows at Scale
Shopify engineering shared its experience in the company's blog post on how to scale and optimize Apache Airflow for running ML and data workflows. They shared practical solutions for the challenges they faced like slow file access, insufficient control over DAG, irregular level of traffic, resource contention among workloads, and more.
-
Fitting Presto to Large-Scale Apache Kafka at Uber
The need for ad-hoc real-time data analysis has been growing at Uber. They run a large Apache Kafka deployment and need to analyse data going through the many workflows it supports. Solutions like stream processing and OLAP datastores were deemed unsuitable. An article was published recently detailing why Uber chose Presto for this purpose and what it had to do to make it performant at scale.
-
Amazon Elastic MapReduce Now Generally Available as a Serverless Offering
AWS recently announced that Amazon Elastic MapReduce (EMR) Serverless is generally available (GA). The offering is a serverless deployment option for customers to run big data analytics applications using open-source frameworks like Apache Spark and Hive without configuring, managing, and scaling clusters or servers.
-
Google Introduces Autoscaling for Cloud Bigtable for Optimizing Costs
Cloud Bigtable is a fully-managed, scalable NoSQL database service for large operational and analytical workloads on the Google Cloud Platform (GCP). And recently, the public cloud provider announced the general availability of Bigtable Autoscaling, which automatically adds or removes capacity in response to the changing demand for applications allowing cost optimizations.
-
Amazon OpenSearch Adds Anomaly Detection for Historical Data
Amazon OpenSearch recently introduced the support of anomaly detection for historical data. The machine learning based feature helps identifying trends, patterns, and seasonality in OpenSearch data.
-
AWS Announces the Public Preview of AWS Data Exchange for Amazon Redshift
Recently AWS announced the public preview of AWS Data Exchange for Amazon Redshift. This new feature enables customers to find and subscribe to third-party data in AWS Data Exchange to query in an Amazon Redshift data warehouse.
-
AWS Announces the General Availability and Open Sourcing of the Amazon Genomics CLI
Amazon Genomics CLI is a tool that makes it easier to process genomics data at a petabyte-scale on AWS. Earlier this year, the public cloud vendor shared a preview of the tool, and it is now open source and generally available.
-
Hazelcast Jet 4.4 Released - the Four-Year Anniversary Release as Seen by Scott McMahon
Hazelcast Jet recently celebrated its four-year anniversary with the release of version 4.4. Besides the normal bug fixes and performance enhancements, this new version ships with new features such as the unified file connector and the first beta version of the SQL interface. InfoQ spoke to Scott McMahon, technical director of field engineering at Hazelcast, about this new release.
-
Using Machine Learning in Testing and Maintenance
With machine learning, we can reduce maintenance efforts and improve the quality of products. It can be used in various stages of the software testing life-cycle, including bug management, which is an important part of the chain. We can analyze large amounts of data for classifying, triaging, and prioritizing bugs in a more efficient way by means of machine learning algorithms.
-
DataStax Announces Astra Serverless Database-as-a-Service
DataStax , the company behind the Cassandra database, announced last week the general availability of Astra serverless, the open, multi-cloud serverless database-as-a-service (DBaaS).
-
Designing for Failure in the BBC's Analytics Platform
Last week at InfoQ Live, Blanca Garcia-Gil, principal systems engineer at BBC, gave a session on Evolving Analytics in the Data Platform. During this session, Garcia-Gil focused on how her team prepared and designed for two types of failure - "known unknowns" and "unknown unknowns."
-
Google Brings Databricks to Its Cloud Platform
Recently Google announced a partnership with Databricks to bring their fully-managed Apache Spark offering and data lake capabilities to Google Cloud. The offering will become available as Databricks on Google Cloud.
-
PayPal Standardizes on Apache Airflow and Apache Gobblin for Its Next-Gen Data Movement Platform
PayPal recently described how it standardized on Apache Airflow and Apache Gobblin for implementing its next-gen data movement platform. In a recent blog post, PayPal engineers detail how the existing data movement platform evolved into many tools & platforms in a complex and unmanageable ecosystem and their shift towards a new implementation.
-
Analyzing Large Amounts of Feedback to Learn from Users
Making it easy for users to give feedback and automating the collection of feedback helps to get more feedback faster. Using artificial intelligence, you can analyze large amounts of feedback to get insights and visualize trends. Sharing this information widely supports taking action to enhance your product and solve issues that users are having.