Why AI fails at business context, and what to do about it

lundi 18 août 2025, 11:00 , par InfoWorld

Here’s the awkward truth about today’s “smart” AI: It’s great at syntax, mediocre at semantics, and really bad at business context. That last bit matters because most enterprise value hides in the seams—how your organization defines active customer, which discount codes apply on Tuesdays, which SKU names were changed after the acquisition, and why revenue means something different to the finance department than the sales team.

Models can ace academic tests and even crank out reasonable SQL. Drop them behind the firewall inside a real company, however, and they stumble. Badly.

Tom Tunguz highlights a sharp example: The Spider 2.0 benchmarks test how well models translate natural language into SQL across realistic enterprise databases. These models peak around 59% exact-match accuracy and fall to roughly 40% when they add transformation/code-generation complexity. These aren’t toy data sets; they reflect messy, sprawling schemas that look like what real enterprises run in production. In other words, the closer we get to real business context, the more the artificial intelligence struggles.

If you build enterprise software, this shouldn’t surprise you. As I’ve noted, developers’ primary issue with AI isn’t whether it can spit out code—it’s whether they can trust it, consistently, on their data and their rules. That’s the “almost-right” tax: You spend time debugging and fact-checking what the model produced because it doesn’t quite understand your specifics.

Why business context is hard for AI

Large models are mostly pattern engines trained on public text. Your business logic—how you calculate churn, the way your sales territories work, the subtle differences between two nearly identical product lines—isn’t on the public web. That information lives in Jira tickets, PowerPoints, institutional knowledge, and databases whose schemas are artifacts of past decisions (and the key to enterprise AI’s memory). Even the data model fights you: tables with a thousand columns, renamed fields, leaky dimensions, and terminology that drifts with each reorg.

Spider 2.0 measures that reality, which is why scores drop as tasks get closer to actual workflows, such as multi-step queries, joins across unfamiliar schemas, dialect differences, transformations in DBT, etc. Meanwhile, the enterprise is moving toward agentic models that can browse, run code, or query databases, which only magnifies the risk when the model’s understanding is off.

Put differently: Business context isn’t just data; it’s policy plus process plus history. AI gets the shape of the problem but not the lived reality.

Can we fix this?

The good news is we don’t need a philosophical breakthrough in understanding. We just need better engineering around the memory, grounding, governance, and feedback of the model. I’ve made the case that AI doesn’t need more parameters as much as it needs more memory: structured ways to keep track of what happened before and to retrieve the domain data and definitions that matter. Do that well and you narrow the trust gap.

Is the problem fully solvable? In bounded domains, yes. You can make an AI assistant that’s reliable on your finance metrics, your customer tables, your DBT models, and your security policies. But business context is a moving target, and humans will keep changing the rules. That means you’ll always want humans (including developers, of course) in the loop to clarify intent, adjudicate edge cases, and evolve the system to keep up with the business. The goal isn’t to eliminate people; it’s to turn them into context engineers who teach systems how the business actually works. Here’s how to get there.

First, if you want reliable answers about your business, the model has to see your business. That starts with retrieval-augmented generation (RAG) that feeds the model the right slices of data and metadata—DDL, schema diagrams, DBT models, even a few representative row samples—before it answers. For text-to-SQL specifically, include table/column descriptions, lineage notes, and known join keys. Retrieval should include governed sources (catalogs, metric stores, lineage graphs), not just a vector soup of PDFs. Spider 2.0’s results make a simple point that when models face unfamiliar schemas, they guess. So, we need to reduce unfamiliarity for the models.

Second, most AI apps are amnesiacs. They start fresh each request, unaware of what came before. You thus need to add layered memory (working, long-term, and episodic memory). The heart of this memory is the database. Databases, especially ones that can store embeddings, metadata, and event logs, are becoming critical to AI’s “mind.” Memory elevates the model from pattern-matching to context-carrying.

Third, free-form text invites ambiguity; structured interfaces reduce it. For text-to-SQL, consider emitting an abstract syntax tree (AST) or a restricted SQL dialect that your execution layer validates and expands. Snap queries to known dimensions/measures in your semantic layer. Use function/tool calling—not just prose—so the model asks for get_metric('active_users', date_range='Q2') rather than guessing table names. The more you treat the model like a planner using reliable building blocks, the less it hallucinates.

Fourth, humans shouldn’t spend all day correcting commas in SQL. Build an approval flow that focuses attention where ambiguity is highest. For example, highlight risky joins, show previews with row-level diffs against known-good queries, and capture structured feedback (“status_code in (3,5) should be excluded from active customers”) and push it back into memory and retrieval. Over time, your system becomes a better context learner because your experts are training it implicitly as they do their jobs.

Fifth, measure what matters. Benchmarks are useful, but your KPI should be “helped the finance team close the quarter accurately,” not “passed Spider 2.0 at 70%.” Hence, you need to build task-specific assessments. Can the system produce the three canonical revenue queries? Does it respect access controls 100% of the time? Run these evaluations nightly. Spider 2.0 also shows that the more realistic the workflow (think Spider2-V’s multi-step, GUI-spanning tasks), the more room there is to fail. Your evaluations should match that realism.

People and machines

All this should make it clear that however sophisticated AI may get, we’re still going to need people to make it work well. That’s a feature, not a bug.

The business context problem is engineering-solvable within a scope. With the right grounding, memory, constraints, evaluations, and security, you can build systems that answer enterprise questions reliably most of the time. You’ll shrink the “almost-right” tax significantly. But context is social. It’s negotiated in quarterly business reviews and hallway conversations. New products launch, legal policies change, someone tweaks a definition, a merger redraws everything. That continual renegotiation guarantees you’ll want human judgment in the loop.

The role of developers shifts accordingly. They go from code generators to context engineers, curators of semantic layers, authors of policy as code, designers of retrieval and memory, and stewards of the feedback loops that keep AI aligned with reality. That’s also why developers remain indispensable even as AI gets better. The more we automate, the more valuable it is to have someone who understands both the machine and the business.

If you’re trying to make AI useful in your company, aim for a system that remembers, retrieves, and respects:

Remembers what happened and what’s been decided (layered memory)

Retrieves the right internal truth at the right moment (governed grounding)

Respects your policies, people, and processes (authorization that travels with the task)

Do that and your AI will feel less like a clever autocomplete and more like a colleague who actually gets it. Not because the model magically developed common sense, but because you engineered the surrounding system to supply it.

That’s the real story behind Spider 2.0’s sobering scores, which are not an indictment of AI but a blueprint for where to invest. If your model isn’t delivering on business context, the fix isn’t a different model so much as a different architecture—one that pairs the best of human intelligence with the best of artificial intelligence. In my experience, that partnership is not just inevitable. It’s the point.

Lire la suite sur InfoWorld