IBM can’t afford an unreliable cloud

mardi 19 août 2025, 11:00 , par InfoWorld

On August 12, 2025, IBM Cloud experienced its fourth major outage since May, resulting in a two-hour service disruption that affected 27 services globally across 10 regions. This “Severity 1” event left enterprise customers locked out of critical resources due to authentication failures, with users unable to access IBM’s cloud console, CLI, or APIs. Such recurring failures reveal systemic weaknesses in IBM’s control plane architecture, the layer responsible for handling user access, orchestration, and monitoring.

This incident followed previous outages on May 20, June 3, and June 4, and further eroded confidence in IBM’s reliability. This does not reflect well on a provider that promotes itself as a leader in hybrid cloud solutions. For industries with strict compliance requirements or businesses that depend on cloud availability for real-time operations, these disruptions raise doubts about IBM’s ability to meet their needs on an ongoing basis. These recurring incidents give enterprises a reason to consider switching to platforms with more reliable track records, such as AWS, Microsoft Azure, or Google Cloud.

For enterprises that have entrusted IBM Cloud with hybrid strategies that balance on-premises systems with public cloud integration, these events strike at the heart of IBM’s value proposition. The hybrid cloud’s supposed benefit is resilience, giving businesses flexibility in handling workloads. A fragile control plane undermines this perceived advantage, leaving IBM’s multi-billion-dollar investments in hybrid systems on shaky ground.

Opening the door for competitors

IBM has traditionally been a niche player in the cloud market, holding a 2% global market share compared to AWS (30%), Microsoft Azure (21%), and Google Cloud (11%). IBM Cloud targets a specific enterprise audience with hybrid cloud integration and enterprise-grade features.

AWS, Azure, and Google Cloud have consistently demonstrated their reliability, operational efficiency, and capacity to scale. Since the control plane is crucial for managing cloud infrastructure, the Big Three hyperscalers have diversified their architectures to avoid single points of failure. Enterprises having issues with IBM Cloud might now consider switching critical data and applications to one of these larger providers that also offer advanced tools for AI, machine learning, and automation.

These outages couldn’t come at a worse time for IBM. With healthcare, finance, manufacturing, and other industries increasingly depending on AI-driven technologies, companies are focused on cloud reliability. AI workloads require real-time data processing, continuity, and reliable scaling to work effectively. For most organizations, disruptions caused by control-plane failures could lead to catastrophic AI system failures.

What IBM can do

IBM must make major changes if it wants to recover its credibility and regain enterprise trust. Here are several critical steps I would take if I were CTO of IBM:

Adopt a resilient control-plane architecture. Duh. IBM’s reliance on centralized control-plane management has become a liability. A distributed control plane infrastructure will allow individual regions or functions to operate independently and limit the scope of global outages.

Enhance IAM design with segmentation. Authentication failures have been at the core of the past four outages. Regionally segmented identity and access management (IAM) and distributed identity gateways must replace the globally entangled design currently in place.

Strengthen SLAs targeting control-plane uptime. Cloud customers demand operational guarantees. By implementing robust service-level agreements (SLAs) focused explicitly on control-layer reliability, IBM could reassure customers that their vital management functions will remain stable even under pressure.

Increase transparency and communication. IBM needs to be proactive with customers following outages. Offering incident reports, clear timelines for fixes, and planned updates to infrastructure can help rebuild trust, though it will take time. Silence, on the other hand, will only deepen dissatisfaction.

Accelerate stress-testing procedures. IBM must regularly perform extensive load and resilience testing to identify vulnerabilities before they impact customers. Routine testing in simulated high-pressure operating conditions should be a priority.

Develop hybrid systems with multi-control-plane options. IBM should adopt multi-control-plane designs to enable enterprises to manage workloads independently of centralized limitations. This would enable hybrid strategies to retain their resilience advantage.

Increasing enterprise resilience

For enterprises wary of any cloud provider’s reliability, there are several steps to build resilience into their operations:

Adopt a multicloud strategy. By distributing workloads across multiple cloud providers, enterprises reduce dependency on any single vendor. This ensures that even if one provider has a disruption, core business functions remain active.

Integrate disaster recovery automation. Automated failover systems and data backups across multiple regions and providers can minimize downtime when outages occur.

Demand stronger SLAs. Enterprises should negotiate contracts that prioritize uptime guarantees for control planes and include penalties for SLA violations.

Monitor and audit vendor reliability. Enterprises should actively track their cloud providers’ reliability performance metrics and plan for migration if vendors continuously fail to meet standards.

IBM has reached a critical juncture. In today’s competitive market, cloud reliability is the baseline expectation, not a value-added bonus. IBM’s repeated failures—particularly at the control-plane level—fundamentally undermine its positioning as a trusted enterprise cloud partner. For many customers, these outages may serve as the final justification to migrate workloads elsewhere.

To recover, IBM must focus on transforming its control-plane architecture, ensuring transparency, and reaffirming its commitment to reliability through clear, actionable changes. Meanwhile, enterprises should see this as a reminder that resilience must be built into their cloud strategies to safeguard their operations, regardless of provider.

In a world increasingly dependent on AI and automation, reliability isn’t optional—it’s essential. IBM has a lot of work ahead.

Lire la suite sur InfoWorld