InfoQ Homepage Model Inference Content on InfoQ

News

RSS Feed

AI, ML & Data Engineering

KubeCon NA 2025 - Robert Nishihara on Open Source AI Compute with Kubernetes, Ray, PyTorch, and vLLM

AI workloads are growing more complex in terms of compute and data, and technologies like Kubernetes and PyTorch can help build production-ready AI systems to support them. Robert Nishihara from Anyscale recently spoke at KubeCon + CloudNativeCon North America 2025 Conference about how an AI compute stack comprising Kubernetes, PyTorch, VLLM and Ray technologies can support these new AI workloads.

Srini Penchikala
on Nov 28, 2025
Architecture & Design

LinkedIn Re-Architects Edge-Building System to Support Diverse Inference Workflows

LinkedIn has detailed its re-architected edge-building system, an evolution designed to support diverse inference workflows for delivering fresher and more personalized recommendations to members worldwide. The new architecture addresses growing demands for real-time scalability, cost efficiency, and flexibility across its global platform.

Leela Kumili
on Sep 02, 2025
Mobile

Gemma 3n Introduces Novel Techniques for Enhanced Mobile AI Inference

Launched in early preview last May, Gemma 3n is now officially available. It targets mobile-first, on-device AI applications, using new techniques designed to increase efficiency and improve performance, such as per-layer embeddings and transformer nesting.

Sergio De Simone
on Jul 04, 2025
AI, ML & Data Engineering

Nvidia's GB200 NVL72 Supercomputer Achieves 2.7× Faster Inference on DeepSeek V3

In collaboration with NVIDIA, researchers from SGLang have published early benchmarks of the GB200 (Grace Blackwell) NVL72 system, showing up to a 2.7× increase in LLM inference throughput compared to the H100 on the DeepSeek-V3 671B model.

Matt Foster
on Jun 29, 2025
Mobile

Google Brings Gemini Nano to ML Kit with New On-Device GenAI APIs

The new GenAI APIs recently added to ML Kit enable developers to use Gemini Nano for on-device inference in Android apps, supporting features like summarization, proofreading, rewriting, and image description.

Sergio De Simone
on Jun 03, 2025
Cloud

Google Unveils Ironwood TPU for AI Inference

Google's Ironwood TPU, its most advanced custom AI accelerator, powers the "age of inference" with unmatched performance and scalability. With up to 9,216 liquid-cooled chips, it outpaces competitors, delivering 42.5 Exaflops. Engineered for high-efficiency, low-latency AI tasks, Ironwood redefines potential in AI hardware, leveraging AlphaChip to revolutionize chip design.

Steef-Jan Wiggers
on May 02, 2025
AI, ML & Data Engineering

Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

Two recent papers from Anthropic attempt to shed light on the processes that take place within a large language model, exploring how to locate interpretable concepts and link them to the computational "circuits" that translate them into language, and how to characterize crucial behaviors of Claude Haiku 3.5, including hallucinations, planning, and other key traits.

Sergio De Simone
on Apr 12, 2025
AI, ML & Data Engineering

Hugging Face Expands Serverless Inference Options with New Provider Integrations

Hugging Face has launched the integration of four serverless inference providers Fal, Replicate, SambaNova, and Together AI, directly into its model pages. These providers are also integrated into Hugging Face's client SDKs for JavaScript and Python, allowing users to run inference on various models with minimal setup.

Daniel Dominguez
on Feb 04, 2025
AI, ML & Data Engineering

Nvidia Announces Arm-Powered Project Digits, Its First Personal AI Computer

Capable of running 200B-parameter models, Nvidia Project Digits packs the new Nvidia GB10 Grace Blackwell chip to allow developers to fine-tune and run AI models on their local machines. Starting at $3,000, Project Digits targets AI researchers, data scientists, and students to allow them to create their models using a desktop system and then deploy them on cloud or data center infrastructure.

Sergio De Simone
on Jan 13, 2025
DevOps

Meta Optimises AI Inference by Improving Tail Utilisation

Meta (formerly Facebook) has reported substantial improvements in the efficiency and reliability of its machine-learning model serving infrastructure by focusing on optimising tail utilisation.

Matt Saunders
on Aug 02, 2024
Java

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

Karpathy's 700-line llama.c inference interface demystified how developers can interact with LLMs. Even before that, JLama started its journey of becoming the first pure Java-implemented inference engine for any Hugging Face model, from Gemma to Mixtral. Leveraging the new Vector API and PanamaTensorOperations class with native fallback the library is available in Maven Central.

Olimpiu Pop
on May 29, 2024

Unlock the full InfoQ experience

Don't have an InfoQ account?

Topics

Expanding Swift from Apps to Services

[Video Podcast] Improving Valkey with Madelyn Olson

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Scaling to 100+ as a Director: Lessons from Growing Engineering Organizations

From Alert Fatigue to Agent-Assisted Intelligent Observability

Helpful links

Choose your language

News

KubeCon NA 2025 - Robert Nishihara on Open Source AI Compute with Kubernetes, Ray, PyTorch, and vLLM

LinkedIn Re-Architects Edge-Building System to Support Diverse Inference Workflows

Gemma 3n Introduces Novel Techniques for Enhanced Mobile AI Inference

Nvidia's GB200 NVL72 Supercomputer Achieves 2.7× Faster Inference on DeepSeek V3

Google Brings Gemini Nano to ML Kit with New On-Device GenAI APIs

Google Unveils Ironwood TPU for AI Inference

Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

Hugging Face Expands Serverless Inference Options with New Provider Integrations

Nvidia Announces Arm-Powered Project Digits, Its First Personal AI Computer

Meta Optimises AI Inference by Improving Tail Utilisation

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

How CNAME Ordering in RFC Specs Caused Cloudflare 1.1.1.1 Outage

Expanding Swift from Apps to Services

Google Pushes for gRPC Support in Model Context Protocol

Uber Moves In-House Search Indexing to Pull-Based Ingestion in OpenSearch

[Video Podcast] Improving Valkey with Madelyn Olson

LinkedIn Leverages GitHub Actions, CodeQL, and Semgrep for Code Scanning

Getting Feedback from Test-Driven Development and Testing in Production

Scaling to 100+ as a Director: Lessons from Growing Engineering Organizations

The Technical Founder's Path: Code, Leadership, and Balance

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

Next Moca Releases Agent Definition Language as an Open Source Specification

Cloudflare Demonstrates Moltworker, Bringing Self-Hosted AI Agents to the Edge

Datadog Integrates Google Agent Development Kit into LLM Observability Tools

From Alert Fatigue to Agent-Assisted Intelligent Observability

Etleap Launches Iceberg Pipeline Platform to Simplify Enterprise Adoption of Apache Iceberg

QCon London

QCon AI Boston

QCon San Francisco

InfoQ Software Architects' Newsletter

News