Rewriting infrastructure as code for the AI data center

lundi 30 juin 2025, 11:00 , par InfoWorld

Generative AI has officially entered the infrastructure as code (IaC) trenches. What started as a bottom-up phenomenon — developers using ChatGPT and Copilot to avoid Googling Terraform syntax or getting bogged down in endless StackExchange threads — has grown into something more complex and widespread. Today, organizations are embracing AI as a tool not just for writing configs, but for shaping infrastructure decisions, integrating with observability, and even catching deployment errors.

But the story isn’t just one of acceleration. As AI-generated IaC becomes more common as part of the AI data center, so do the pitfalls — from silent misconfigurations to public-facing APIs no one meant to expose.

Rise of the machines

Let’s start with what you can probably guess: Developers have been using generative AI tools to write IaC config code for some time now. In many places, especially early on, this was a bottom-up movement driven individual developers. “A lot of developers I know, especially those who aren’t IaC experts, are leaning on ChatGPT or Copilot to generate Terraform or Ansible configs,” says Siri Varma Vegiraju, Security Tech Lead at Microsoft. “It’s fast, and it helps people avoid looking up every AWS resource syntax or module.”

That speed and accessibility come from the way AI has lowered the bar to writing configuration code. Ivan Novikov, CEO of Wallarm, puts it this way: “AI reduces the threshold for devs to write configs without deep knowledge. Before AI, writing production-ready Kubernetes or Terraform config was for SREs, DevOps, infra teams. Now, any backend dev can open ChatGPT and ask ‘make me a Helm chart for my API with autoscaling and ingress.’ And AI will do it.”

This democratization of IaC also means that a lot of experimentation happens without much oversight. “Many developers quietly use ChatGPT/Copilot to draft IaC templates, especially for unfamiliar cloud services,” says Fergal Glynn, chief marketing officer and AI security advocate of Mindgard. “While this speeds up tasks, unreviewed AI code risks security gaps (e.g., overly permissive rules).”

“In many companies,” Milankumar Rana says, software engineer advisor and senior cloud engineer at FedEx, “such usage began informally — engineers ‘on the sly’ asking ChatGPT how to create a resource block or fix an obscure provider error. However, we are now observing a more structured approach to adoption.”

That shift is being driven by larger organizations that see potential in AI-assisted IaC but want it embedded within guardrails. As Glynn puts it, “Larger orgs use AI-augmented platforms (e.g., Torque’s Environment-as-Code) with guardrails to prevent errors. Startups and devops teams often experiment first, while enterprises prioritize governance frameworks.”

When enterprises get on board with AI

As the use of generative AI expands in infrastructure engineering, many organizations are responding by developing internal tools to guide and govern that usage. Nimisha Mehta, Senior DevOps Engineer at Confluent, notes that “AI-forward tech organizations adopt various tools such as IDEs with AI plugins, and even invest time and money into building bespoke systems and tools to integrate LLMs with their specific environments.”

One increasingly common approach is to create internal AI “playgrounds ”— sandbox environments that allow teams to test configurations without risking production infrastructure. “Sandboxes allow developers to experiment with IaC templates, validate outputs to catch errors before deployment,” says Mindgard’s Glynn. “By balancing innovations with oversight, these playgrounds can minimize risks, such as security gaps, while encouraging controlled adoption of AI infrastructure-as-code workflows.”

Sometimes, organizations are driven to develop such internal tools specifically in response to chaotic early AI-generated IaC efforts. Ori Yemini, CTO and co-founder of ControlMonkey, describes one such case: “A customer used ChatGPT to bulk-generate Terraform files for around 80 microservices. It worked, until they realized none of the generated code adhered to their tagging policies, module conventions, or team-based permissions. Their drift detection flagged hundreds of deltas against their baseline. The code ‘worked’ technically, but operationally it created chaos.”

The solution? A tailored internal tool that wrapped the LLM in organizational context. “They shifted toward a controlled approach: using an internal wrapper around the LLM with prompts that inject organization-specific context, like required tags, naming conventions, and known module repositories. This drastically reduced both drift and rework,” Yemini says.

Promises and pitfalls of gen AI and IaC

At its best, generative AI acts as a powerful accelerant for infrastructure work. “We’re seeing a quiet but significant shift in how engineering teams approach Infrastructure as Code,” says ControlMonkey’s Yemini. “It’s not just about writing a quick Terraform snippet anymore, it’s about accelerating infrastructure decisions in environments that are growing more complex by the day.” FedEx’s Rana echoes this, noting that “what used to take hours of cross-referencing documentation is now often accelerated by a single well-phrased cue.” He points to common use cases like creating reusable Terraform modules, converting shell scripts into Ansible playbooks, and scaffolding TypeScript for Pulumi code.

The benefits go beyond code generation. AI is starting to integrate with observability systems to help manage infrastructure in real time. Microsoft’s Vegiraju notes, “More advanced setups are experimenting with feeding telemetry into AI systems that can suggest or even automatically apply IaC-based fixes. For example, if a service is repeatedly scaling out due to CPU exhaustion, the AI might propose a config tweak to increase CPU limits or change autoscaling thresholds.” While these are mostly proof-of-concept efforts in telemetry-heavy environments, they signal a direction where AI becomes more than just a code-writing assistant.

Confluent’s Mehta points to similar developments on the operational side, extolling the troubleshooting prowess of agentic AI. “Say you have a network packet that flows through several layers in the networking stack, she says. “AI is great at eliminating options to pinpoint the root cause of the issue.” She sees this as a precursor to more autonomous, self-healing systems, though notes they’re still in early stages.

But for all its promise, AI still lacks a basic quality that human engineers rely on: context. “Although AI is great at writing IaC and YAML manifests,” Mehta says, “its biggest current shortfall is not having visibility on how distributed production-grade infrastructure is actually set up in the real world.” Agentic tools are starting to address this by integrating more directly with infrastructure, but, she notes, “they don’t scale to thousands of compute clusters.”

Wallarm’s Novikov is even more blunt: “Prompts don’t carry full context about your infra and settings. Your infra is big. You have dozens of services, secrets, RBAC rules, sidecars, CI/CD flows, policies, naming rules, and many things in Terraform state. AI doesn’t know all that. When you ask ‘write config for API X,’ it works in a vacuum.”

That vacuum can result in mistakes that are difficult to spot but potentially damaging. “AI-generated configs can be syntactically right but semantically wrong,” says Microsoft’s Vegiraju. He offers a simple example of Terraform config code written by AI based on a simple prompt:

resource 'azurerm_storage_account' 'example' {

name = 'examplestorageacct1'

public_network_access_enabled = true

}

That configuration will deploy successfully — but also opens the storage account to the public internet. “Without strict network rules or identity-based access controls, this could lead to unauthorized access or data exfiltration,” he says. “In over 90% of real-world scenarios, public network access should be disabled.”

FAQ: How AI is rewriting infrastructure as code

What is infrastructure as code (IaC)?: Infrastructure as Code is the practice of managing and provisioning infrastructure through code instead of manual processes. It allows you to define resources (like servers, networks, and databases) in configuration files, enabling automation, version control, and consistency.

How are developers currently using AI with IaC?: Initially, individual developers used generative AI tools like ChatGPT and GitHub Copilot informally to quickly generate syntax for IaC tools like Terraform or Ansible, helping them avoid manual lookups and speed up basic coding tasks.

How has enterprise adoption of AI in IaC evolved?: What started as informal use has become more structured. Larger organizations are now embedding AI into IaC workflows with guardrails, creating internal tools and sandbox environments (“AI playgrounds”) to test configurations and ensure adherence to organizational policies before deployment.

What are the main benefits of using AI for IaC?: AI accelerates infrastructure work by quickly generating reusable code modules, converting scripts, and scaffolding new environments. Beyond code generation, it can integrate with observability systems to suggest or even apply automated fixes, and assist in troubleshooting complex issues.

What are the primary risks or pitfalls of AI-generated IaC?: The main risks stem from AI’s lack of full contextual understanding. This can lead to syntactically correct but semantically flawed configurations (e.g., publicly exposing resources), security oversights (e.g., missing rate limits, open ports), non-compliance with internal policies (e.g., tagging), and an increase in “config misfires.”

Why does AI-generated IaC often lack proper context?: AI models typically operate “in a vacuum,” without inherent knowledge of an organization’s specific, complex infrastructure, existing services, secrets, access control rules, naming conventions, or CI/CD flows. This missing context can lead to configurations that are technically functional but operationally problematic or insecure.

How can organizations mitigate the risks of AI in IaC?: Mitigation strategies include human oversight (engineers must understand AI-generated code), building internal wrappers around LLMs to inject organizational context, establishing “AI playgrounds” for testing, implementing strong guardrails and policy checks, and using GitOps systems and peer-reviewed version control for accountability.

Will AI replace human engineers in IaC roles? The consensus is that AI will not replace engineers, but rather augment their capabilities. AI acts as a powerful assistant for accelerating tasks and decision support, but human engineers remain crucial for validation, providing organizational context, ensuring security, and making strategic infrastructure decisions.

Security oversights like that are far from theoretical. “Configs often miss security best practices,” says Novikov. “No rate limits, wide network exposure (0.0.0.0/0), missing resource limits, open CORS, and no auth on internal APIs.” In one real-world case, a fintech developer used AI to generate ingress for an internal API. “They forgot to add IP whitelisting. The API went public, got scanned in 20 minutes, and attackers found an old debug route.”

A cautious look ahead at AI and infrastructure

As generative AI becomes more embedded in infrastructure workflows, its role is evolving. “One pattern we’re noticing across several mid-to-large scale orgs is this: AI is being used as a ‘first draft generator,’ but increasingly also as a decision-support tool,” says ControlMonkey’s Yemini. “Engineers aren’t just asking, ‘How do I write this AWS security group?’ they’re asking, ‘What’s the cleanest way to structure this VPC for future scale?’” He notes that these questions aren’t confined to early design stages —t hey come up mid-sprint, when real-world blockers hit. “From our perspective, the most successful orgs treat generative AI like an untrained junior engineer: useful for accelerating tasks, but requiring validation, structure, and access to internal standards.”

That need for human oversight was a recurring theme with everyone we spoke to. Microsoft’s Vegiraju puts it simply: “Engineers should first understand the code coming out of the LLM before using it.” At Confluent, Mehta emphasizes the importance of safeguards: “We need guardrails built into the system to prevent accidental breaking changes, be it due to human error or due to AI-generated changes.” She points to GitOps systems and peer-reviewed version control as ways to build accountability into the workflow.

Mindgard’s Glynn sees a similar pattern emerging. “Models like WISDOM-ANSIBLE generate Ansible playbooks just by providing natural language prompts,” he says, “but AI-generated YAML/Chef files do require manual tweaks for edge cases.” Enterprises may use these tools to enforce compliance — for instance, automatically adding HIPAA settings — but they still review outputs for accuracy before deployment.

Without that diligence, the risks can compound quickly. Wallarm’s Novikov recounts a troubling trend: “One large SaaS org told us 30% of their IaC is now AI-generated. But they also see three times more config misfires in CI/CD than last year—wrong secrets, open ports, wrong S3 policies, unprotected APIs.” That company now uses tools like Checkov, tfsec, and custom Wallarm rules to catch misconfigurations after the fact. But the root cause is often speed outpacing safety. “One junior dev told us: ‘I just paste the prompt, review the YAML looks ok, and push.’ That’s where issues sneak in.”

The tools are getting better — yet the need for caution is still there. “AI is so powerful,” Novikov says. “But when it comes to PaaS and APIs, it’s risky if used blindly. Without proper policy checks, context awareness, and testing, AI-generated configs become new security debt.”

“You use AI for infra?” he says. “Cool. Just don’t trust it too much.”

Lire la suite sur InfoWorld