Before you build your first enterprise AI app

lundi 15 décembre 2025, 10:00 , par InfoWorld

It is becoming increasingly difficult to separate the signal from the noise in the world of artificial intelligence. Every day brings a new benchmark, a new “state-of-the-art” model, or a new claim that yesterday’s architecture is obsolete. For developers tasked with building their first AI application, particularly within a larger enterprise, the sheer volume of announcements creates a paralysis of choice.

The rankings won’t help. They change too often. In just the past week we got a new model from Mistral, a massive update from Google, and an open-weights contender that claims to beat GPT-4o on coding benchmarks. What are you supposed to do? Wait because if you build on today’s (yesterday’s) model you’ll be shipping legacy code before you even push to production? What if you have a more fundamental concern, i.e., that if you’re not yet building fully autonomous agentic systems that plan, reason, and execute complex workflows, you must already be way behind?

Stop that nonsense. You’re not.

The reality of enterprise AI has almost nothing to do with the winner of this week’s “chatbot arena.” It has everything to do with the unglamorous, boring work of data engineering, governance, and integration. We are leaving AI’s phase of magical thinking and entering the phase of industrialization. The challenge isn’t picking the smartest model. It is building a system that can survive the inanity of the real world.

Here are some suggestions on how to approach building that first app.

It’s a trap!

It is easy to get caught up in the “Leaderboard Illusion.” You see a model score 1% higher on a math benchmark and assume it is the only viable choice. Simon Willison calls this “vibes-based evaluation.” It is a decent proxy for which chatbot feels “smart” in a casual conversation, but it is a terrible proxy for your production workload. We need to stop looking at AI through the lens of 1990s software wars where one platform takes all.

Model weights are becoming undifferentiated heavy lifting, the boring infrastructure that everyone needs but no one wants to manage. Whether you use Anthropic, OpenAI, or an open weights model like Llama, you are getting a level of intelligence that is good enough for 90% of enterprise tasks. The differences are marginal for a first version. The “best” model is usually just the one you can actually access securely and reliably.

Andrew Ng, who has seen more AI cycles than almost anyone, recently offered this astute (if unremarkable) bit of advice: “Worry much more about building something valuable.” Seems obvious but too often isn’t. Ng argues that the application layer is where the real value sits, not the model layer. If you build a tool that solves a genuine business problem, such as automatically reconciling invoices or summarizing legal briefs, no one (including you!) will care if the underlying model is ranked first or third on a leaderboard.

The physics of AI are fundamentally different from traditional software. In the open source world, we are used to code being the asset. In the AI world, the model is a transient commodity. The asset is your data and how you feed it to that commodity model.

Think like a database

Of course, once you’ve picked a model, the temptation is to immediately build an “agent” because, well, who doesn’t want the kudos for designing an AI agent that can browse the web, query databases, and make decisions? I suggest caution. You likely aren’t ready for agents. Not because the AI isn’t smart enough, and not even because you may not have much AI experience yet.

No, the primary problem is your data isn’t clean enough.

As I noted recently, AI memory is really a database problem. If you strip an agent of its memory, it is nothing more than a very expensive random number generator. Agents operate at machine speed with human data. If that data is messy, unstructured, or ungoverned, your agent will be confidently wrong at scale.

Most enterprises are still trying to figure out where their data lives, let alone how to expose it to a large language model. We tend to treat memory in AI as a magical context window. It isn’t. It’s a database. It needs the same rigor we apply to transaction logs, including schemas, access controls, and firewalls that prevent the AI from hallucinating facts or leaking sensitive info to the wrong user.

If you are designing your first AI system, start with the memory layer. Decide what the AI is allowed to know, where that knowledge lives, and how it is updated. Then, and only then, worry about the prompt. Oh, and what should you think about first? Inference.

Start with inference

We used to obsess over the massive cost of training models. But for the enterprise, that is largely irrelevant. AI is all about inference now, or the application of knowledge to power applications. In other words, AI will become truly useful within the enterprise as we apply models to governed enterprise data. The best place to build up your AI muscle isn’t with some moonshot agentic system. It’s a simple retrieval-augmented generation (RAG) pipeline.

What does this mean in practice? Find a corpus of boring, messy documents, such as HR policies, technical documentation, or customer support logs, and build a system that allows a user to ask a question and get an answer based only on that data. This forces you to solve the hard problems that actually build a moat for your company. Some examples:

Data ingestion: How do you chunk and index your PDFs so the model understands them?

Governance: How do you ensure the model doesn’t answer questions the user isn’t authorized to ask?

Latency: How do you make it fast enough that people actually use it?

You may think this is boring work. But as Andrej Karpathy has pointed out, LLMs are effectively the kernel of a new operating system. You don’t interact with the kernel directly. You build user-space applications on top of it. Your job is to build that user space, which includes the UI, the logic, and the data plumbing.

Create a golden path

If you are in a platform engineering role, your instinct might be to lock this down. You want to pick one model, one API, and force every developer to use it. This is a mistake. Platform teams should not act as the “Department of No.” When you build gates, developers just route around them using their personal credit cards and unmonitored APIs.

Instead, build a “golden path.” Create a set of composable services and guardrails that make the right way to build AI apps also the easiest way. Standardize on an interface, like the OpenAI-compatible API format supported by many providers, including vLLM, so you can swap the back-end model later if the leaderboard changes. For now, pick one that is fast, compliant, and available. Then move on.

The goal is to channel developer velocity, not stifle it. Give them a safe sandbox where the data governance is baked in so they can experiment without doing serious damage.

When you build your first application, design it to keep the human in the loop. Don’t try to automate the entire process. Use the AI to generate the first draft of a report or the first pass at a SQL query, and then force a human to review and execute it. This mitigates the risk of hallucinations and ensures you are augmenting human intelligence rather than replacing it with robot drivel.

Still, if you aren’t watching the public rankings, how do you know if your model is good enough? You don’t guess. You test.

OpenAI and Anthropic both emphasize “eval-driven development,” but you don’t need a complex framework to start. You just need 50 to 100 real examples of what you want the model to do—specific questions with the correct answers—and a script to run them. Whenever some new model drops that promises to take the leaderboard to new heights, just run your 50 examples against it. If it solves your specific problems faster or cheaper than what you have, switch. If not, ignore it. Your own leaderboard is the only one that matters.

Be boring

In short, focus on your data. Focus on your governance. Focus on solving a boring problem for a specific user in your company who is drowning in documentation or repetitive tasks. Ignore the leaderboards. They are vanity metrics for researchers.

As I have said before, the AI era will be won by whoever makes intelligence on top of governed data cheap, easy, and safe. It might not get you a viral thread on X, but it will get you an application that actually survives in the enterprise.

Lire la suite sur InfoWorld