Snowflake open sources SwiftKV to reduce inference workload costs

jeudi 16 janvier 2025, 17:00 , par InfoWorld

Cloud-based data warehouse company Snowflake has open-sourced a new proprietary approach — SwiftKV — designed to reduce the cost of inference workloads for enterprises running generative AI-based applications. SwiftKV was launched in December.

The development of SwiftKV assumes significance as inferencing costs for generative AI applications are still high and work as a deterrent for enterprises either looking to scale these applications or infuse generative AI into newer use cases, the company explained.

SwiftKV goes beyond KV cache compression

SwiftKV, according to Snowflake’s AI research team, tries to go beyond key-value (KV) cache compression — an approach used in large language models (LLMs) to reduce the memory required for storing key-value (KV) pairs generated during inference.

The reduction in memory is made possible by storing earlier computed data via methods such as pruning, quantization, and adaptive compression. What this essentially does is it makes optimized LLMs handle longer contexts and generate output faster while using a lesser memory footprint.

However, Snowflake claims that a simple KV cache compression might not be enough to “meaningfully” curtail the cost of inferencing workloads as most workloads end up using more input tokens than output tokens. And processing costs of input tokens remain unaffected by KV cache compression.

In contrast, SwiftKV reduces inference computation during prompt processing (input tokens) by combining techniques such as model rewiring and knowledge-preserving self-distillation.

As part of these techniques, SwiftKV reuses the hidden states of earlier transformer layers to generate KV cache for later layers, the company’s AI research team explained, adding that this eliminates redundant computations in the pre-fill stage, which in turn reduces the computational overhead by at least 50%.

In order to maintain the accuracy of the LLMs, SwiftKV ensures that the rewired or optimized model replicates the behavior of the original LLM by using self-distillation, the research team further explained.

SwiftKV concept is not new

Analysts view SwiftKV as yet another clever means of optimizing model inferencing costs in line with many similar efforts, such as prompt caching, flash attention, model pruning, and quantization. The concept itself is not new.

“This idea is not new and Snowflake is certainly not the first to illustrate its value, of course. SAP, for example, introduced this idea with its model plug-in, Finch, earlier in 2024,” said Bradley Shimmin, chief analyst at Omdia.

However, despite Snowflake’s claims of minimal accuracy loss of SwiftKV-optimized LLMs, Shimmin warned that there could be tradeoffs in terms of how complex they are to implement, how much they degrade capability, and how compatible they are with the underlying inferencing architecture.

“Methods like quantization are super-popular because they do not impose that many tradeoffs. So, if customers find this technique from Snowflake to be of similar value, I imagine they will use it perhaps even alongside other techniques as required by whatever project they have at hand,” Shimmin explained.

How can enterprises use SwiftKV?

Enterprises do have the choice of accessing SwiftKV either through Snowflake or they can deploy it on their model checkpoints on Hugging Face or their optimized inference on vLLM.

While model checkpoints on Hugging Face refer to a saved set of weights of a model during its training, vLLM is a library meant for LLM inferencing and serving.

Snowflake customer enterprises particularly can take advantage of SwiftKV by accessing the new SwiftKV-optimized models, currently Llama 3.3 70B and Llama 3.1 405B, from inside Cortex.

Snowflake, earlier in December, had open-sourced the model weights and vLLM code.

However, it was not ready, until now, to release the SwiftKV-optimized models in Cortex AI or release the training code used to develop SwiftKV.

Presently, the company is also open-sourcing the training library called ArcticTraining, which allows engineers to build their own SwiftKV models.

Lire la suite sur InfoWorld

https://www.infoworld.com/article/3804018/snowflake-open-sources-swiftkv-to-reduce-inference-workloa...

56 sources (32 en français)

Date Actuelle

jeu. 17 juil. - 00:48 CEST