AWS launches Flexible Training Plans for inference endpoints in SageMaker AI

vendredi 28 novembre 2025, 13:54 , par InfoWorld

AWS has launched Flexible Training Plans (FTPs) for inference endpoints in Amazon SageMaker AI, its AI and machine learning service, to offer customers guaranteed GPU capacity for planned evaluations and production peaks.

Typically, enterprises use SageMaker AI inference endpoints, which are managed systems, to deploy trained machine learning models in the cloud and run predictions at scale on new data.

For instance, a global retail enterprise can use SageMaker inference endpoints to power its personalized-recommendation engine: As millions of customers browse products across different regions, the endpoints automatically scale compute and storage to handle traffic spikes without the company needing to manage servers or capacity planning.

However, the auto-scaling nature of these inference endpoints might not be enough for several situations that enterprises may encounter, including workloads that require low latency and consistent high performance, critical testing and pre-production environments where resource availability must be guaranteed, and any situation where a slow scale-up time is not acceptable and could harm the application or business.

According to AWS, FTPs for inferencing workloads aim to address this by enabling enterprises to reserve instance types and required GPUs, since automatic scaling up doesn’t guarantee instant GPU availability due to high demand and limited supply.

FTPs support for SageMaker AI inference is available in US East (N. Virginia), US West (Oregon), and US East (Ohio), AWS said.

Reducing operational load and costs

The guarantee of GPU availability, according to analysts, solves major challenges that enterprises face around scaling AI and machine learning workloads.

“The biggest change is reliability,” said Akshat Tyagi, associate practice leader at HFS Research. “Before this update, enterprises had to deploy Inference Endpoints and hope the required GPU instances were available. When GPUs were scarce, deployments failed or got delayed. Now they can reserve the exact GPU capacity weeks or months in advance. This can be huge for teams running LLMs, vision models, or batch inference jobs where downtime isn’t an option.”

Forrester’s principal analyst Charlie Dai termed the new capability a “meaningful step” toward cost governance that reduces cost unpredictability for AI operationalization: “Customers can align spend with usage patterns and avoid overprovisioning, which will lower idle costs,” Dai said.

Tyagi pointed out that by reserving capacity in advance, AWS customers can pay a lower committed rate compared to on-demand pricing, lock in pricing for a set period, avoid expensive last-minute scrambling or scaling up to more costly instance types, and plan budgets more accurately because the expenditure is fixed upfront.

The ability to reserve instances, Tyagi added, also might stop the trend of enterprises being forced to “run” inference endpoints 24*7 in fear of not being able to secure them when needed, which in itself causes more unavailability.

AWS isn’t the only hyperscaler that is offering the option to reserve instances for inference workloads.

While Microsoft Azure offers reserved capacity for inference via Azure Machine Learning, Google Cloud provides committed use discounts for Vertex AI.

Lire la suite sur InfoWorld