InfoQ Homepage Machine Learning Content on InfoQ

News

RSS Feed

Newer Older

Architecture & Design

Enhancing A/B Testing at DoorDash with Multi-Armed Bandits

While experimentation is essential, traditional A/B testing can be excessively slow and expensive, according to DoorDash engineers Caixia Huang and Alex Weinstein. To address these limitations, they adopted a "multi-armed bandits" (MAB) approach to optimize their experiments.

Sergio De Simone
on Jan 25, 2026
Architecture & Design

DoorDash Applies AI to Safety across Chat and Calls, Cutting Incidents by 50%

DoorDash deploys SafeChat, an AI-driven safety system for moderating chat, images, and voice calls between Dashers and customers. Using a layered text moderation architecture, machine learning models, and human review, SafeChat detects unsafe content in real time, enabling immediate actions and reducing low- and medium-severity safety incidents by roughly 50 percent.

Leela Kumili
on Jan 23, 2026
Cloud

AWS Hikes EC2 Capacity Block Rates by 15% in Uniform ML Pricing Adjustment

AWS has raised EC2 Capacity Block prices for ML by 15% across all regions, impacting GPU-based workloads. The uniform price hikes affect top-tier instances powered by NVIDIA GPUs, underscoring supply chain pressures and inflation. With limited alternatives, organizations face higher costs, emphasizing the need for effective workload optimization and cost management strategies.

Steef-Jan Wiggers
on Jan 15, 2026
Architecture & Design

Swiggy Rolls out Hermes V3: from Text-to-SQL to Conversational AI

Swiggy has released Hermes V3, a GenAI-powered text-to-SQL assistant that enables employees to query data in plain English. The Slack-native system combines vector retrieval, conversational memory, agentic orchestration, and explainability to improve SQL accuracy and support multi-turn analytical queries.

Leela Kumili
on Jan 02, 2026
Architecture & Design

Benchmarking beyond the Application Layer: How Uber Evaluates Infrastructure Changes and Cloud Skus

Uber’s Ceilometer framework automates infrastructure performance benchmarking beyond applications. It standardizes testing across servers, workloads, and cloud SKUs, helping teams validate changes, identify regressions, and optimize resources. Future plans include AI integration, anomaly detection, and continuous validation.

Leela Kumili
on Dec 26, 2025
AI, ML & Data Engineering

QConAI NY 2025 - Designing AI Platforms for Reliability: Tools for Certainty, Agents for Discovery

Aaron Erickson at QCon AI NYC 2025 emphasized treating agentic AI as an engineering challenge, focusing on reliability through the blend of probabilistic and deterministic systems. He argued for clear operational structures to minimize risks and optimize performance, highlighting the importance of specialized agents and deterministic paths to enhance accuracy and control in AI workflows.

Andrew Hoblitzell
on Dec 21, 2025
Architecture & Design

AWS Expands Well-Architected Framework with Responsible AI and Updated ML and Generative AI Lenses

At AWS re:Invent 2025, AWS expanded its Well-Architected Framework with a new Responsible AI Lens and updated Machine Learning and Generative AI Lenses. The updates provide guidance on governance, bias mitigation, scalable ML workflows, and trustworthy AI system design across the full AI lifecycle.

Leela Kumili
on Dec 19, 2025
AI, ML & Data Engineering

QCon AI New York 2025: AI Platform Scaling at LinkedIn

At QCon AI NY 2025, LinkedIn's Prince Valluri and Karthik Ramgopal unveiled an internal platform for AI agents, prioritizing execution over intelligence. By using structured specifications within a robust orchestration layer, they enhance agent observability and interoperability while ensuring human accountability.

Andrew Hoblitzell
on Dec 19, 2025
AI, ML & Data Engineering

Meta's Optimization Platform Ax 1.0 Streamlines LLM and System Optimization

Now stable, Ax is an open-source platform from Meta designed to help researchers and engineers apply machine learning to complex, resource-intensive experimentation. Over the past several years, Meta has used Ax to improve AI models, accelerate machine learning research, tune production infrastructure, and more.

Sergio De Simone
on Dec 16, 2025
Architecture & Design

Lyft Rearchitects ML Platform with Hybrid AWS SageMaker-Kubernetes Approach

Lyft has rearchitected its machine learning platform LyftLearn into a hybrid system, moving offline workloads to AWS SageMaker while retaining Kubernetes for online model serving. Its decision to choose managed services where operational complexity was highest, while maintaining custom infrastructure where control mattered most, offers a pragmatic alternative to unified platform strategies.

Eran Stiller
on Dec 16, 2025
Architecture & Design

Karrot Improves Conversion Rates by 70% with New Scalable Feature Platform on AWS

Karrot replaced its legacy recommendation system with a scalable architecture that leverages various AWS services. The company sought to address challenges related to tight coupling, limited scalability, and poor reliability in its previous solution, opting instead for a distributed, event-driven architecture built on top of scalable cloud services.

Rafal Gancarz
on Dec 04, 2025
AI, ML & Data Engineering

How Discord Scaled its ML Platform from Single-GPU Workflows to a Shared Ray Cluster

Discord has detailed how it rebuilt its machine learning platform after hitting the limits of single-GPU training. The changes enabled daily retrains for large models and contributed to a 200% uplift in a key ads ranking metric.

Matt Foster
on Dec 03, 2025
AI, ML & Data Engineering

KubeCon NA 2025 - Robert Nishihara on Open Source AI Compute with Kubernetes, Ray, PyTorch, and vLLM

AI workloads are growing more complex in terms of compute and data, and technologies like Kubernetes and PyTorch can help build production-ready AI systems to support them. Robert Nishihara from Anyscale recently spoke at KubeCon + CloudNativeCon North America 2025 Conference about how an AI compute stack comprising Kubernetes, PyTorch, VLLM and Ray technologies can support these new AI workloads.

Srini Penchikala
on Nov 28, 2025
AI, ML & Data Engineering

LinkedIn’s Migration Journey to Serve Billions of Users by Nishant Lakshmikanth at QCon SF

Engineering Manager Nishant Lakshmikanth showcased LinkedIn's transformation at QCon SF 2025, detailing a shift from legacy batch-based systems to a real-time architecture. By decoupling recommendations and leveraging dynamic scoring techniques, LinkedIn achieved a 90% reduction in offline costs, enhanced session-level freshness, and improved member engagement while future-proofing its platform.

Steef-Jan Wiggers
on Nov 26, 2025
AI, ML & Data Engineering

Google Announces Gemini 3

Google's Gemini 3, unveiled on November 18, 2025, sets a new standard for multimodal AI, integrating seamlessly across platforms like Search and Vertex AI. With capabilities for text, code, and rich media, it empowers both consumer and enterprise applications. Gemini 3 Pro and its advanced Deep Think mode enhance reasoning and task execution, revolutionizing workflows and analytics.

Andrew Hoblitzell
on Nov 20, 2025

Newer News

Older News

InfoQ Software Architects' Newsletter

News