Benchmarks show even an old Nvidia RTX 3090 is enough to serve LLMs to thousands

vendredi 23 août 2024, 23:00 , par TheRegister

For 100 concurrent users, the card delivered 12.88 tokens per second—just slightly faster than average human reading speed
If you want to scale a large language model (LLM) to a few thousand users, you might think a beefy enterprise GPU is a hard requirement. However, at least according to Backprop, all you actually need is a four-year-old graphics card.…

Lire la suite sur TheRegister