Perplexity’s open-source tool to run trillion-parameter models without costly upgrades

jeudi 6 novembre 2025, 13:52 , par InfoWorld

Perplexity AI has released an open-source software tool that solves two expensive problems for enterprises running AI systems: being locked into a single cloud provider and the need to buy the latest hardware to run massive models.

The tool, called TransferEngine, enables large language models to communicate across different cloud providers’ hardware at full speed. Companies can now run trillion-parameter models like DeepSeek V3 and Kimi K2 on older H100 and H200 GPU systems instead of waiting for expensive next-generation hardware, Perplexity wrote in a research paper. The company also open-sourced the tool on GitHub.

“Existing implementations are locked to specific Network Interface Controllers, hindering integration into inference engines and portability across hardware providers,” the researchers wrote in their paper.

The vendor lock-in trap

That lock-in stems from a fundamental technical incompatibility, according to the research. Cloud providers use different networking protocols for high-speed GPU communication. Nvidia’s ConnectX chips use one standard, while AWS’s Elastic Fabric Adapter (AWS EFA) uses an entirely different proprietary protocol.

Previous solutions worked on one system or the other, but not both, the paper noted. This forced companies to commit to a single provider’s ecosystem, or accept dramatically slower performance.

The problem is particularly acute with newer Mixture-of-Experts models, Perplexity found. DeepSeek V3 packs 671 billion parameters. Kimi K2 hits a full trillion. These models are too large to fit on single eight-GPU systems, according to the research.

The obvious answer would be Nvidia’s new GB200 systems, essentially one giant 72-GPU server. But those cost millions, face extreme supply shortages, and aren’t available everywhere, the researchers noted. Meanwhile, H100 and H200 systems are plentiful and relatively cheap.

The catch: running large models across multiple older systems has traditionally meant brutal performance penalties. “There are no viable cross-provider solutions for LLM inference,” the research team wrote, noting that existing libraries either lack AWS support entirely or suffer severe performance degradation on Amazon’s hardware.

TransferEngine aims to change that. “TransferEngine enables portable point-to-point communication for modern LLM architectures, avoiding vendor lock-in while complementing collective libraries for cloud-native deployments,” the researchers wrote.

How TransferEngine works

TransferEngine acts as a universal translator for GPU-to-GPU communication, according to the paper. It creates a common interface that works across different networking hardware by identifying the core functionality shared by various systems.

TransferEngine uses RDMA (Remote Direct Memory Access) technology. This allows computers to transfer data directly between graphics cards without involving the main processor—think of it as a dedicated express lane between chips.

Perplexity’s implementation achieved 400 gigabits per second throughput on both Nvidia ConnectX-7 and AWS EFA, matching existing single-platform solutions. TransferEngine also supports using multiple network cards per GPU, aggregating bandwidth for even faster communication.

“We address portability by leveraging the common functionality across heterogeneous RDMA hardware,” the paper explained, noting that the approach works by creating “a reliable abstraction without ordering guarantees” over the underlying protocols.

Already live in production environments

The technology isn’t just theoretical. Perplexity has been using TransferEngine in production to power its AI search engine, according to the company.

The company deployed it across three critical systems. For disaggregated inference, TransferEngine handles the high-speed transfer of cached data between servers, allowing companies to scale their AI services dynamically. The library also powers Perplexity’s reinforcement learning system, achieving weight updates for trillion-parameter models in just 1.3 seconds, the researchers said.

Perhaps most significantly, Perplexity implemented TransferEngine for Mixture-of-Experts routing. These models route different requests to different “experts” within the model, creating far more network traffic than traditional models. DeepSeek built its own DeepEP framework to handle this, but it only worked on Nvidia ConnectX hardware, according to the paper.

TransferEngine matched DeepEP’s performance on ConnectX-7, the researchers said. More importantly, they said it achieved “state-of-the-art latency” on Nvidia hardware while creating “the first viable implementation compatible with AWS EFA.”

In testing DeepSeek V3 and Kimi K2 on AWS H200 instances, Perplexity found substantial performance gains when distributing models across multiple nodes, particularly at medium batch sizes, the sweet spot for production serving.

The open-source bet

Perplexity’s decision to open-source production infrastructure contrasts sharply with competitors like OpenAI and Anthropic, which keep their technical implementations proprietary.

The company released the complete library, including code, Python bindings, and benchmarking tools, under an open license.

The move mirrors Meta’s strategy with PyTorch — open-source a critical tool, help establish an industry standard, and benefit from community contributions. Perplexity said it’s continuing to optimize the technology for AWS, following updates to Amazon’s networking libraries to further reduce latency.

Lire la suite sur InfoWorld