How to automate the testing of AI agents

mardi 18 novembre 2025, 10:00 , par InfoWorld

Testing APIs and applications was challenging in the early devops days. As teams sought to advance their CI/CD pipelines and support continuous deployment, test automation platforms gained popularity, and many devops organizations developed continuous testing strategies.

While shifting left on QA, implementing security testing, and increasing observability are now devsecops non-negotiables for many organizations, these practices are not equally deployed across all applications. Legacy applications are lagging due to underlying technical debt, and test automation is still an emerging capability when developing AI agents.

Developing an LLM testing strategy is challenging because the model’s inputs are open-ended and responses are non-deterministic. AI agents couple language models with the ability to take human-in-the-middle and automated actions, so testing decision accuracy, performance, and security is vital for developing trust and growing employee adoption.

As more companies evaluate AI agent development tools and consider the risks of rapidly deploying AI agents, more devops teams will need to consider how to automate the testing of AI agents. IT and security leaders will seek testing plans to determine release-readiness and avoid the risks of deploying rogue AI agents.

Developing end-to-end testing strategies

Experts view testing AI agents as a strategic risk management function that encompasses architecture, development, offline testing, and observability for online production agents. The approach enables continuous improvement as AI models evolve, and the agent responds to a wider range of human and agent-to-agent inputs in production.

“Testing agentic AI is no longer QA, it is enterprise risk management, and leaders are building digital twins to stress test agents against messy realities: bad data, adversarial inputs, and edge cases,” says Srikumar Ramanathan, chief solutions officer at MPhasis. “Validation must be layered, encompassing accuracy and compliance checks, bias and ethics audits, and drift detection using golden datasets.”

One best practice is to model AI agents’ role, workflows, and the user goals they are intended to achieve. Developing end-user personas and evaluating whether AI agents meet their objectives can inform the testing of human-AI collaborative workflows and decision-making scenarios.

“AI agents are stochastic systems, and traditional testing methods based on well-defined test plans and tools that verify fixed outputs are not effective,” says Nirmal Mukhi, VP and head of engineering at ASAPP. “Realistic simulation involves modeling various customer profiles, each with a distinct personality, knowledge they may possess, and a set of goals around what they actually want to achieve during the conversation with the agent. Evaluation at scale involves then examining thousands of such simulated conversations to evaluate them based on desired behavior, policies, and checking if the customer’s goals were achieved.”

Ramanathan of Mphasis adds, “The real differentiator is resilience, testing how agents fail, escalate, or recover. Winners will not chase perfection at launch; they will build trust as a living system through sandboxing, monitoring, and continuous adaptation.”

Testing AI agents requires shifts in QA strategy

Testing tools and methodologies are built on the simple premise that a test case is deterministic and is properly developed with criteria that either pass or fail. QA engineers will need to consider broader expressions, such as whether an AI agent’s actions are appropriate and whether it provides similar responses to comparable inputs.

“The biggest misconception about testing AI agents is treating them like traditional applications with predictable outputs,” says Esko Hannula, SVP of robotics at Copado. “AI agents learn and adapt continuously, meaning your testing strategy must evolve from validating exact responses to ensuring response appropriateness and business alignment.”

Traditional API and application testing often utilizes automation testing platforms that run in development and test environments, but separate monitoring tools are used for production errors. So, a second change for testing AI agents is that testing scenarios must be automated to run in development, test environments, and continuously in production. Organizations will want to release new versions of AI agents frequently as LLM platforms release new versions and the agents are improved based on end-user feedback.

“Agentic systems are non-deterministic and can’t be trusted with traditional QA alone; enterprises need tools that trace reasoning, evaluate judgment, test resilience, and ensure adaptability over time,” says Nikolaos Vasiloglou, VP of research ML at RelationalAI. “Agents may quickly be replaced by newer LLMs, so leaders must continually benchmark their custom solutions against frontier models and avoid the sunk-cost fallacy.”

Validating the response accuracy of AI agents

How should QA engineers validate an AI agent’s response when inputs and outputs are not deterministic?

Jerry Ting, head of agentic AI at Workday, shares two recommendations for testing AI agents across the enterprise. “Use AI to create synthetic training data that simulates the messiness of real-life prompts and embedded data. Then, have the same prompt go into different large language models, and orchestrate a tournament of prompts and responses with AI as judges.”

Part of the implementation strategy will require integrating feedback from production back into development and test environments. Although testing AI agents should be automated, QA engineers will need to develop workflows that include reviews from subject matter experts and feedback from other end users.

“Hierarchical scenario-based testing, sandboxed environments, and integrated regression suites—built with cross-team collaboration—form the core approach for test strategy,” says Chris Li, SVP of products at Xactly. “To ensure valid and accurate responses, sandboxed replays, automated and human reviews, and rich audit trails are reliable approaches for full workflow validation. As agentic AI systems scale in complexity, balancing precision, safety, fairness, and performance becomes essential to delivering robust, trustworthy agents capable of operating successfully in real-world environments.”

QA engineers will also need to automate the calculation of an accuracy metric that can be compared across deployments and AI model upgrades. Without these metrics, it will be challenging to know whether a deployment improved the AI agent’s decision-making and recommendation capabilities.

“The real challenge isn’t whether the agent gives the ‘right’ answer—it’s whether it consistently makes decisions that advance your business objectives while respecting security boundaries,” adds Hannula of Copado. “This requires a fundamental shift from deterministic testing to contextual validation that accounts for the agent’s learning trajectory.”

Ensuring AI agents take the correct actions

More organizations will look to automate workflows with AI agents. Testing must consider that an agent may have several plausible actions, and its responses should help justify its recommendations.

“Testing must validate not only the agent’s thinking (its responses), but also its actions (the operations it executes),” says Zhijie Chen, co-founder and CEO of Verdent. “For high-risk, complex, or ambiguous decision-making scenarios, fully automated testing may not guarantee safety and reliability; in such cases, a human-in-the-loop approach provides a strategic layer of assurance.”

How should QA develop and automate testing around the AI agent’s actions? Consider how challenging it can be to evaluate people’s decisions, especially in real time, before actions are undertaken. We look for non-verbal clues and seek opinions from outside experts. Testing an AI agent’s recommended and automated actions may require a similar approach.

Mike Finley, co-founder of StellarIQ, says, “One key way to automate testing of agentic AI is to use verifiers, which are AI supervisor agents whose job is to watch the work of others and ensure that they fall in line. Beyond accuracy, they’re also looking for subtle things like tone and other cues. If we want these agents to do human work, we have to watch them like we would human workers.”

Establishing release-readiness practices

AI agents encompass the complexities of applications, automations, and AI models when evaluating their security and operational readiness for production. Many organizations are just getting started with AI agents, and fewer than 5% have reached production. Experts weighed in on security and operational recommendations.

Rishi Bhargava, co-founder of Descope, recommends pressure-testing AI agents against all the OWASP top 10 for LLM applications during automated testing. These include:

Test that AI agent connections with third-party tools and enterprise systems follow recommended implementations of standard protocols such as MCP and OAuth.

Test for AI agent permissions and ensure the agent’s permissions are a subset of the bound user’s permissions at all times.

Andrew Filev, CEO and founder of Zencoder, shares how security considerations for AI agents extend beyond deterministic systems, such as applications and automations. Andrew says, “You’ve got prompt injection attacks, model manipulation, context poisoning, adversarial inputs, and data extraction attempts—basically a whole new category of vulnerabilities that most security teams may not yet have on their radar.”

Filev also says that performance testing is tricky because devops teams must do more than evaluate response times, and must consider the following questions:

Can the agent maintain quality and consistency when it’s being hammered with requests?

Does the underlying AI model start hallucinating under load?

How should teams architect performance testing without overspending on API costs?

A good place to start is with practices that should already be in place for validating the release-readiness of applications and AI models. Ian Beaver, chief data scientist at Verint, shared thes following recommendations:

Collect detailed AI agent audit logs and record every interaction and action, allowing for inspection and correction if needed.

Follow a policy of least privilege for all APIs and MCP tools to minimize risk.

Evaluate LLMs for bias and reliability.

“Comprehensive logging, robust monitoring, and user feedback interfaces are even more critical for agents,” adds Beaver.

Automating testing for agentic AI

As challenging as it is to automate AI agent testing, devops teams must also consider how to future-proof their implementations for agentic AI, an objective that’s not fully realized in most organizations. Devops teams will need to use testing frameworks that enable running validations in production and support agent-to-agent communication.

“Perhaps more important than automating the testing of agents is how to best organize and orchestrate agent interactions in a way that both minimizes the possibility of error while also enabling them to recover gracefully in the event of one,” says Sohrob Kazerounian, distinguished AI researcher at Vectra AI. “By decomposing problems into a set of well-specified tasks, you can not only design agents that are more likely to succeed, but importantly, you can create agents that can evaluate and error-correct at each step of the way.”

Probably the most important lesson here is one that most devops teams have learned from developing applications and automations: There’s no free lunch when it comes to testing, and less than half the work is done when the functional code is completed. For teams looking to deploy AI agents, it is essential to consider test-driven approaches to evaluate quality and establish release readiness.

Lire la suite sur InfoWorld