InfoQ Homepage Model Inference Content on InfoQ

News

RSS Feed

AI, ML & Data Engineering

KubeCon NA 2025 - Robert Nishihara on Open Source AI Compute with Kubernetes, Ray, PyTorch, and vLLM

AI workloads are growing more complex in terms of compute and data, and technologies like Kubernetes and PyTorch can help build production-ready AI systems to support them. Robert Nishihara from Anyscale recently spoke at KubeCon + CloudNativeCon North America 2025 Conference about how an AI compute stack comprising Kubernetes, PyTorch, VLLM and Ray technologies can support these new AI workloads.

Srini Penchikala
on Nov 28, 2025
Architecture & Design

LinkedIn Re-Architects Edge-Building System to Support Diverse Inference Workflows

LinkedIn has detailed its re-architected edge-building system, an evolution designed to support diverse inference workflows for delivering fresher and more personalized recommendations to members worldwide. The new architecture addresses growing demands for real-time scalability, cost efficiency, and flexibility across its global platform.

Leela Kumili
on Sep 02, 2025
Mobile

Gemma 3n Introduces Novel Techniques for Enhanced Mobile AI Inference

Launched in early preview last May, Gemma 3n is now officially available. It targets mobile-first, on-device AI applications, using new techniques designed to increase efficiency and improve performance, such as per-layer embeddings and transformer nesting.

Sergio De Simone
on Jul 04, 2025
AI, ML & Data Engineering

Nvidia's GB200 NVL72 Supercomputer Achieves 2.7× Faster Inference on DeepSeek V3

In collaboration with NVIDIA, researchers from SGLang have published early benchmarks of the GB200 (Grace Blackwell) NVL72 system, showing up to a 2.7× increase in LLM inference throughput compared to the H100 on the DeepSeek-V3 671B model.

Matt Foster
on Jun 29, 2025
Mobile

Google Brings Gemini Nano to ML Kit with New On-Device GenAI APIs

The new GenAI APIs recently added to ML Kit enable developers to use Gemini Nano for on-device inference in Android apps, supporting features like summarization, proofreading, rewriting, and image description.

Sergio De Simone
on Jun 03, 2025
Cloud

Google Unveils Ironwood TPU for AI Inference

Google's Ironwood TPU, its most advanced custom AI accelerator, powers the "age of inference" with unmatched performance and scalability. With up to 9,216 liquid-cooled chips, it outpaces competitors, delivering 42.5 Exaflops. Engineered for high-efficiency, low-latency AI tasks, Ironwood redefines potential in AI hardware, leveraging AlphaChip to revolutionize chip design.

Steef-Jan Wiggers
on May 02, 2025
AI, ML & Data Engineering

Anthropic's "AI Microscope" Explores the Inner Workings of Large Language Models

Two recent papers from Anthropic attempt to shed light on the processes that take place within a large language model, exploring how to locate interpretable concepts and link them to the computational "circuits" that translate them into language, and how to characterize crucial behaviors of Claude Haiku 3.5, including hallucinations, planning, and other key traits.

Sergio De Simone
on Apr 12, 2025
AI, ML & Data Engineering

Hugging Face Expands Serverless Inference Options with New Provider Integrations

Hugging Face has launched the integration of four serverless inference providers Fal, Replicate, SambaNova, and Together AI, directly into its model pages. These providers are also integrated into Hugging Face's client SDKs for JavaScript and Python, allowing users to run inference on various models with minimal setup.

Daniel Dominguez
on Feb 04, 2025
AI, ML & Data Engineering

Nvidia Announces Arm-Powered Project Digits, Its First Personal AI Computer

Capable of running 200B-parameter models, Nvidia Project Digits packs the new Nvidia GB10 Grace Blackwell chip to allow developers to fine-tune and run AI models on their local machines. Starting at $3,000, Project Digits targets AI researchers, data scientists, and students to allow them to create their models using a desktop system and then deploy them on cloud or data center infrastructure.

Sergio De Simone
on Jan 13, 2025
DevOps

Meta Optimises AI Inference by Improving Tail Utilisation

Meta (formerly Facebook) has reported substantial improvements in the efficiency and reliability of its machine-learning model serving infrastructure by focusing on optimising tail utilisation.

Matt Saunders
on Aug 02, 2024
Java

JLama: The First Pure Java Model Inference Engine Implemented With Vector API and Project Panama

Karpathy's 700-line llama.c inference interface demystified how developers can interact with LLMs. Even before that, JLama started its journey of becoming the first pure Java-implemented inference engine for any Hugging Face model, from Gemma to Mixtral. Leveraging the new Vector API and PanamaTensorOperations class with native fallback the library is available in Maven Central.

Olimpiu Pop
on May 29, 2024

InfoQ Software Architects' Newsletter

News