Advances in Open Models: Running Large Language Models with SG Lang, VLM and Tensor TLM

Exploring the rapidly evolving landscape of Large Language Models (LLMs) and their potential applications, with a focus on performance benchmarking and optimization.

1. The speaker recently worked on evaluating the speed of inference engines for open models.
2. They have been attending the AI Engineer summit since its inception two years ago, and have noticed a shift towards more practical applications of AI technologies.
3. For a long time, the conference was focused on showcasing new technologies rather than providing opportunities to work with them directly.
4. However, the development of open models like Llama, Quen, and Deep Seek has changed this landscape, making it possible to run interesting experiments and applications using these models.
5. The speaker notes that there are now open source engines available, such as VLM, SGLang, and Tensor TLM, which make it easier to run language models.
6. Running one's own models is becoming less common, and would require a strong motivation, such as working for the US government or wanting to use decentralized crypto models.
7. The speaker predicted in 2023 that open models would catch up to proprietary ones once capabilities requirements saturate.
8. At that time, there were only a few LM inference libraries available, but now VLM is a good option that has stuck around.
9. The speaker's benchmarking software allows users to see how different language models perform with various context lengths and engines.
10. The benchmarking results are available on Modal's website and include methodology details, open-source code, and an executive summary.
11. The speaker hopes to add more information about speculative decoding, multi-token prediction, and quantization in the future.
12. The benchmarking software allows users to select a model and engine and see results for different token lengths.
13. For example, the Quen 3 mixture of experts model can process one request per second with 128 tokens in and 1024 tokens out.
14. The speaker notes that it is generally faster to have more tokens in the context than to generate new ones.
15. They also point out that matrix multiplications are more efficient than moving data around or communicating between processors.
16. The speaker encourages users to focus on throughput-oriented tasks and use technologies like BF16, which can result in faster multiplication times.
17. The speaker provides a URL where users can download the raw data and notes that the code is open source for those who want to run their own benchmarks.
18. The speaker's methodology involves calculating maximum throughput by sending a thousand requests at once and dividing the number of requests by the time it took.
19. They also calculate the fastest possible server speed by sending one request at a time and waiting for each to return before sending the next.
20. The speaker notes that total throughput can be increased by scaling out rather than up, and encourages users to consult their benchmarking methodology for more information.

Source: AI Engineer via YouTube

❓ What do you think? What does the future of language model engines look like, and how will their development impact industries and society as a whole? Feel free to share your thoughts in the comments!