Not seeing ROI from your AI? Observability may be the missing link

mardi 4 février 2025, 10:00 , par InfoWorld

From chatbots, to coding copilots, to AI agents, generative AI-powered apps are seeing increased traction among enterprises. As they go mainstream, however, their shortcomings are becoming more clear and problematic. Incomplete, offensive, or wildly inaccurate responses (aka hallucinations), security vulnerabilities, and disappointingly generic responses can be roadblocks to deploying AI — and for good reason.

In the same way that cloud-based platforms and applications gave birth to new tools designed to evaluate, debug, and monitor those services, the proliferation of AI requires its own set of dedicated observability tools. AI-powered applications are becoming too important to treat as interesting but unreliable test cases — they must be managed with the same rigor as any other business-critical application. In other words, AI needs observability.

What is AI observability?

Observability refers to the technologies and business practices used to understand the complete state of a technical system, platform, or application. For AI-powered applications specifically, observability means understanding all aspects of the system, from end to end. Observability helps companies evaluate and monitor the quality of inputs, outputs, and intermediate results of applications based on large language models (LLMs), and can help to flag and diagnose hallucinations, bias, and toxicity, as well as performance and cost issues.

We need observability in AI because the technology is starting to show its limitations at the precise moment that it’s becoming indispensable — and for enterprises, the limitations are unacceptable.

For example, I teach a computer science course on trustworthy machine learning at Stanford University and advise my students to consider LLMs’ answers as hallucinatory unless proven otherwise. Why? Because LLMs are trained to generalize from large bodies of text, generating original text modeled on the general patterns found in the text they were trained on. They are not built to memorize facts.

But when LLMs are being used in place of search engines, some users approach them with the expectation that they will deliver accurate and helpful results. If the AI fails to do that, it erodes trust. In one egregious example, two lawyers were fined for submitting a legal brief written by AI that cited non-existent cases.

Hallucinations, security leaks, and incorrect answers undermine the trust businesses need to have in the AI-powered applications they build, and present roadblocks for bringing AI into production. If the LLM produces inappropriate answers, it also hurts the ability of consumers to trust the company itself, causing damage to the brand.

Moving beyond ‘looks good to me’

As one corporate LLM user told me, “We want an easy way to evaluate and test the accuracy of different models and apps instead of taking the ‘looks good to me approach.’” From evaluation to ongoing monitoring, observability is increasingly important to any organization using AI applications.

AI observability gives the owners of AI applications the power to monitor, measure, and correct performance, helping in three different aspects of corporate AI use:

Evaluation and experimentation: With so many AI models and tools on the market, it’s important that enterprises can easily determine which elements work best for their specific AI app use case. Observability is critical for evaluating different LLMs, configuration choices, code libraries, and more, enabling users to optimize their tech choices for each project.

Monitoring and iteration: Once an AI app has been deployed and is in use, observability helps with logging execution traces and monitoring its ongoing performance. When a problem crops up, observability is crucial for diagnosing the cause, fixing the problem, and then validating the fix — an iterative process of continuous improvement familiar to anyone who has worked with cloud software.

Tracking costs and latency: Tech leaders are becoming increasingly practical about their AI efforts. Gone are the days of unchecked AI spending — leaders are now deeply concerned about the ROI of AI projects, and understanding which use cases are delivering business results. From this perspective, the two essential dimensions to measure are how much an application costs and how much time it takes to deliver answers (known as latency). Throwing more GPUs and servers at an application can reduce latency, but it drives up cost. You can’t find the right balance for your application unless you can measure both accurately. Observability gives enterprises a clearer picture of both of these elements, enabling them to maximize results and minimize costs.

What enterprises should expect and demand from AI

As enterprises bring AI applications into production, they must expect and demand more than “good enough.” For AI to become a reliable, trustworthy component of business infrastructure, its answers must align with the “the 3H rule,” being honest, harmless, and helpful.

AI needs to be honest, meaning factually accurate and free of hallucinations. Enterprises must be able to use LLMs for tasks where their generalization is desirable: Summarizing, generating inferences, and planning. Honest AI also means the system recognizes and acknowledges when it cannot accurately answer a question. For example, if the answer just does not exist, the LLM should say “I cannot answer that” as opposed to spitting out something random.

For tasks where memorization of facts is more important, we need to supplement LLMs with additional information and data sources to ensure that responses are accurate. This is an active field of research known as retrieval-augmented generation, or RAG: Combining LLMs with databases of factual data that they can retrieve to answer specific questions.

AI needs to be harmless, meaning answers don’t leak personally identifiable information and are not vulnerable to “jailbreak” attacks designed to circumvent their designers’ guardrails. Those guardrails must ensure that the answers don’t embody bias, hurtful stereotypes, or toxicity.

Finally, AI needs to be helpful. It needs to deliver answers that match the queries users give it, that are concise and coherent, and provide useful results.

The RAG Triad: A framework for evaluating AI apps

The RAG Triad is one example of a set of metrics that helps evaluate RAG apps to ensure that they are honest and helpful. It includes three metrics — context relevance, groundedness, and answer relevance — to measure the quality of the three steps of a typical RAG application.

Context relevance measures how relevant each piece of context retrieved from the knowledge base is to the query that was asked.

Groundedness measures how well the final response is grounded in or supported by the retrieved pieces of context.

Answer relevance measures how relevant the final response is to the query that was asked.

By decomposing a composite RAG system into components — query, context, and response — this evaluation framework can triage the failure points and provide a clearer understanding of where improvements are needed in the RAG system and guide targeted optimization.

Figure 1. The RAG Triad Snowflake

Guarding against harm involves aligning models to guard against safety risks (e.g. see Llama Guard) and adding guardrails to applications for metrics related to toxicity, stereotyping, adversarial attacks and more.

There has been substantial progress toward achieving the 3H requirements and making AI apps honest, harmless, and helpful. With AI observability, we can guard against hallucinations, catch irrelevant and incomplete responses, and identify security lapses. The rise of agentic workflows raises an additional set of challenges — checking that the right tools were called with the right parameters in the right sequence, that the execution traces from multi-agent distributed systems are properly logged and monitored, and that the end-to-end system behaves as expected — further underscoring the importance of AI observability.

Observability is critical to all applications that are critical to the business. AI observability will be a key enabling technology for helping AI realize its full potential to transform businesses, optimize processes, reduce costs, and unlock new revenue.

Anupam Datta is the principal research scientist and AI research team lead at Snowflake. He joined Snowflake as part of the acquisition of TruEra, where he served as co-founder, president, and chief scientist. Datta was on the faculty at Carnegie Mellon University from 2007 to 2022, most recently as a tenured professor of electrical and computer engineering and computer science.

—

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Lire la suite sur InfoWorld

https://www.infoworld.com/article/3810789/not-seeing-roi-from-your-ai-observability-may-be-the-missi...

56 sources (32 en français)

Date Actuelle

sam. 23 août - 06:39 CEST