Are cloud ops teams too reliant on AI?

mardi 29 juillet 2025, 11:00 , par InfoWorld

The use of artificial intelligence to manage cloud operations has significantly changed how businesses design and oversee their IT systems. The consensus is that using AI for automation provides greater scalability, reliability, and consistency, as well as reduced manual mistakes and resolves common issues faster. However, a closer look reveals that greater dependence on AI could introduce vulnerabilities in cloud setups that are often overlooked.

Based on my experience with businesses adopting cloud services and technologies, I’ve observed a consistent pattern emerging. Cloud operation experts are increasingly relying on AI-powered automation tools for their tasks and processes. Although these technologies effectively streamline operations, organizations might be delegating too much authority to machines, potentially overlooking knowledge and risking essential operational checks.

Issues with oversight and budgets

One of the benefits often highlighted in AI-driven cloud operations is the concept of “set and forget.” It is now common to enable AI-driven processes for tasks such as resource allocation and anomaly detection, trusting these mechanisms to manage systems smoothly without constant supervision. However, being hands-off can create unintended problems with awareness and vigilance, as automated systems heavily depend on the quality of their training data and algorithms, as well as understanding the environments in which they operate. If an AI misses important contexts during analysis, it could easily overlook issues within the system.

In some cases, I’ve seen AI monitoring systems miss signs of outages that an experienced operations expert would have easily noticed immediately. This often happens when irregularities differ from the usual or when AI algorithms are trained on sanitized or incomplete data sets. Relying too much on AI can lead to overdependence; operators lose trust in their instincts or curiosity, causing them to miss opportunities to take proactive steps or make necessary corrections.

Often companies overlook the cost of the AI tools themselves. Companies that invest heavily in AI-powered surveillance and automation may face increased overhead costs later on. This includes both upfront licensing fees or subscriptions and the often vague expenses associated with cloud computing due to the constantly active nature of these services. Some businesses realize too late that the AI intended to cut costs actually raises them, especially as automation uses more resources without proper oversight. It performs corrective actions without detailed human monitoring.

The widening gap in skills development

The slow decline of skills is viewed as a risk arising from AI and automation in the cloud and devops fields, where they are often presented as solutions to skill shortages. “Leave it to the machines to handle” becomes the common attitude. However, this creates a pattern where more and more tasks are delegated to automated systems without professionals retaining the practical knowledge needed to understand, adjust, or even challenge the AI results.

A surprising number of business executives who faced recent service disruptions were caught off guard. Without practiced strategies and innovative problem-solving skills, employees found themselves stuck and unable to troubleshoot. AI technologies excel at managing issues and routine tasks. However, when these tools encounter something unusual, it is often the human skills and insight gained through years of experience that prove crucial in avoiding a disaster.

This raises concerns that when the AI layer simplifies certain aspects and tasks, it might result in professionals in the operations field losing some understanding of the core infrastructure’s workload behaviors. There’s a chance that skill development may slow down, and career advancement could hit a wall. Eventually, some organizations might end up creating a generation of operations engineers who merely press buttons.

AI-powered automation may also hamper adherence to regulations and security measures. Consider AI operations platforms that automatically resolve security breaches to comply with requirements or policies. In the absence of scrutiny and contextual comprehension, these automated corrections may unintentionally conceal issues. For example, they could disrupt processes or neglect to produce audit trails for compliance documentation purposes.

It is vital to recognize that the growing reliance on AI for decision-making in businesses requires a reassessment of accountability. When mistakes happen, identifying who or what is responsible becomes essential. Does the fault lie with the developer who built and trained the AI, the vendor offering the technology, or the operations team that uses its output as information? The frameworks governing AI are constantly evolving, and organizations must clearly define roles and responsibilities both internally and with their technological partners.

Humans in the loop

Enterprises must now decide if increasing automation to tackle ongoing operations isn’t the answer, what is? Would a more deliberate approach that strikes a balance between total delegation and on-staff expertise be better?

Like almost everything in IT, there is no one-size-fits-all solution or simplistic answer. AI is a valuable tool to navigate the challenges in cloud operations, and every enterprise will require a somewhat different toolkit. Here are several key measures that all enterprises should consider when devising a strategy for cloud operations.

First, ensure that oversight is involved throughout the process to prevent unchecked control when using AI for cloud operations. Experienced engineers should review the suggestions made by AI and double-check the results of automated processes to ensure everything is in order. Humans should step in when necessary to address unanticipated circumstances or nuances. This promotes a culture that values efficiency and accountability.

A second important aspect is to invest in skills development for professionals in the cloud operations field. People need opportunities to expand their knowledge by gaining a thorough understanding of the fundamentals of AI and machine learning and by improving their troubleshooting skills without relying on AI tools. This can include designated days without AI assistance, hands-on exercises, and simulated incident response drills where automation is intentionally disabled. Companies should cultivate a mindset that encourages growth, where automation supports and improves competence instead of replacing it.

Ensure transparency by leveraging AI for analysis, rather than relying solely on it for automated tasks. Clarify every automated step taken while incorporating AI and establish a system that encourages oversight and learning from machine-generated decisions within proficient cloud operations teams. A well-rounded approach involves pairing AI automation with observability through metrics, such as logs and traces, enabling human supervision for evaluation, and creating a system to capture knowledge about the machine’s decisions. Such practices are crucial for adherence to compliance requirements and security measures, as well as for assessing incidents post-resolution.

Monitor expenses by establishing limits and notifications for AI resource decisions. Regularly review your use of AI and automation tools to prevent unnecessary costs. You can also modify or discontinue AI tasks that do not deliver benefits or save money.

To achieve success in AI-driven cloud operations, teamwork across disciplines is essential. Collaboration among developers, operations, security compliance experts, and financial controllers should be prioritized, with regular reviews of AI tools and automation workflows collectively to ensure they align with both business objectives. At the end of the day, even with AI, it’s all about “trust but verify.”

Lire la suite sur InfoWorld