Effectively measuring the performance of applications that leverage Large Language Models (LLMs) is critical to the adoption of AI technologies in organizations. Legare Kerrison and Cedric Clyburn from the Red Hat team recently spoke at the Arc of AI 2026 Conference about practical methods for evaluating and optimizing LLM inference. They discussed the resource requirements and cost implications of different workloads in AI applications, like Retrieval Augmented Generation (RAG) and Agentic AI. Kerrison and Clyburn also talked about the importance of metrics like Requests Per Second (RPS), Time to First Token (TTFT), and ITL (Inter-Token Latency) when evaluating the applications.
The speakers began the presentation by highlighting that 2023 was the year of LLM's with Hugging Face and other models, 2024 was the year of RAG, 2025 was the year of model fine-tuning and AI Agents, and they predicted that 2026 will be about LLM evaluations. In terms of challenges with AI deployments and LLM model evaluations & performance, the leaderboards are helpful, but they tend to be generic. Some websites use criteria like hard prompts, coding, math, and creative writing to measure the models. Your unique business problems and data are not represented in these benchmarks, so they need to be used with the limitations in mind. The software development teams should understand the overall AI technology landscape to choose the best model and provider for their specific use cases.
The speakers highlighted the common pain points they experienced in real-world projects deploying LLM's, where delivering production-ready models meant navigating the "tradeoff triangle" between model quality (accuracy), responsiveness (latency), and the overall cost. Optimizing for any two of these factors impacts the third. For example, focusing on high accuracy and low latency would lead to higher costs of deployments. Applications built with a focus on low cost and high accuracy would typically incur high latency. And too much focus on low cost and low latency would result in low accuracy of the models. When choosing the right model, performance targets, and hardware infrastructure for your workloads, clear measurements and evaluations help make informed decisions.
Teams need to shift from just model choices to actual application requirements and priorities of their systems, in order to provide the right solutions to their customers. Service level objectives (SLOs) with clearly defined key performance and quality metrics ensure applications stay fast, useful, and trustworthy for end users, and can guide structured comparisons across models and hardware, enabling cost optimizations. The Requests Per Second (RPS) metric is all about how many inference requests a system can handle per second. It can be used to measure the overall throughput and how well the serving stack scales under load. Time to First Token (TTFT) is the time between sending a request and receiving the first generated token. It shows us the perceived latency for the user. And Inter-Token Latency (ITL) is the time between each subsequent token after the first one. It highlights how fast streaming output feels to the user and tells us what the decoder efficiency is.
They showed a few examples of different SLOs for different workloads for a variety of use cases and benchmarking metrics. An e-commerce chatbot solution would require a fast and conversational response. The TTFT metric for this use case would typically be ≤200ms and ITL ≤50ms for 99% of requests (P99). On the other hand, a RAG-based application would require more accuracy and completeness than just speed and performance. RAG use cases tend to use more input tokens and fewer output tokens. The metrics for TTFT, ITL, and request latency would be ≤300ms, ≤100ms (if streamed), and ≤3000ms, respectively, for 99% of the requests.
After deciding on the application priorities, teams should focus on hardware requirements. The LLM inference phase has two stages called Prefill, which is compute-bound, and the Decode phase (memory-bound). Techniques like structured generation, speculative decoding, prefix caching, and session caching can help with an efficient LLM model serving. It's easier to load the prefill phase, which uses the first token, than the decode phase, which depends on the subsequent tokens. The speakers mentioned that running LLMs locally, where it makes sense, has the advantage of not going to the cloud, so it can be more efficient for specific use cases.
They defined the term Model Evaluation as the process of assessing a model's overall performance and suitability for its intended purpose across various criteria, i.e. how does a specific model run under a workload on specific hardware. Model benchmarking was defined as a standardized comparison of a model's performance against predefined datasets, tasks, and other models.
They talked about what their teams typically measure for LLMs for different types of workflow patterns, like standard request flow, where a token is generated for every new request. The end-to-end request latency is an important metric for this pattern. On the other hand, in the streaming request flow, LLM requests are not homogeneous, and metrics like TTFT and ITL need to be formally tracked.
The LLM performance metrics are affected by factors like model architecture & size, quantization (compress models by reducing the precision of the weights), serving engine (e.g., Ollama, vLLM, TGI, Triton), hardware (GPU memory), and batching & concurrency choices.
Model inference performance assessment is time-consuming and fragmented, so it's difficult to measure LLM deployments. Kerrison and Clyburn showed some examples of LLM workloads that the teams need to plan for, and ask questions for evaluations, like "With NVIDIA H200, should I use a Llama 3.1 8B or Llama 3.1 70B Instruct to create a customer service chatbot?" or "How many servers do I need to keep my service running under maximum load?"
Benchmarking with open source toolkits like GuideLLM for SLO-aware benchmarking of LLM deployments. GuideLLM, a part of the vLLM project, works by simulating real-world traffic and measuring metrics like throughput and latency. Its process flow includes steps like model selection and customization, dataset selection with real data or synthetic data, configure workload, and run the benchmark tests. If the model meets the desired SLO goals, it can be deployed in production on the vLLM engine.
Clyburn showed GuideLLM test results with simulated workloads like synchronous (runs a single stream of requests one at a time) and concurrent (runs a fixed number of synchronous streams in parallel) using datasets like Hugging Face (ShareGPT), file-based, and in-memory datasets. He shared the benchmark statistics for different workloads like Chat, RAG, Summarization, and Code Generation, for P99 (99th percentile) and P90 (90th percentile) latency metrics.
In addition to LLM inference, we also need to consider the evaluation of model accuracy. LLM accuracy evaluation use cases should include categoeis like model accuracy, pipeline Accuracy (for RAG and AI Agents), and the application accuracy. Some of the open source evaluation tools include the following:
- Model-centric evaluation: lm-eval-harness (powers the LM Arena leaderboard), Unitxt, OpenAI Evals
- RAG-centric evaluation: Ragas, LlamaIndex Evals, Haystack Eval framework
- App/Workflow/Agent evaluation: Ragas (extended), Langfuse, TruLens
- Human + LLM-as-judge evaluation: Human annotation, LLM-as-a-judge
- Domain-Specific Accuracy: PubMedQA (biomedical), FiQA (finance), CaseHOLD (legal)
The speakers concluded the talk by emphasizing that the application teams should look into LLM optimization techniques like quantization (compressing models is more effective than niche optimization techniques). In one instance, the quantization using GPTQModifier resulted in 45% size reduction of the model size. Another technique is KV Cache, which saves redundant computation and accelerates decoding (but takes up more memory). They recommended for additional learning on AI topics, the Hugging Face website that has Red Hat AI-validated language models, and the deeplearning.ai website for training courses on AI in general.