Advances in Open Models: Running Large Language Models with SG Lang, VLM and Tensor TLM
Exploring the rapidly evolving landscape of Large Language Models (LLMs) and their potential applications, with a focus on performance benchmarking and optimization.
- 1. The speaker recently worked on evaluating the speed of inference engines for open models.
- 2. They have been attending the AI Engineer summit since its inception two years ago, and have noticed a shift towards more practical applications of AI technologies.
- 3. For a long time, the conference was focused on showcasing new technologies rather than providing opportunities to work with them directly.
- 4. However, the development of open models like Llama, Quen, and Deep Seek has changed this landscape, making it possible to run interesting experiments and applications using these models.
- 5. The speaker notes that there are now open source engines available, such as VLM, SGLang, and Tensor TLM, which make it easier to run language models.
- 6. Running one's own models is becoming less common, and would require a strong motivation, such as working for the US government or wanting to use decentralized crypto models.
- 7. The speaker predicted in 2023 that open models would catch up to proprietary ones once capabilities requirements saturate.
- 8. At that time, there were only a few LM inference libraries available, but now VLM is a good option that has stuck around.
- 9. The speaker's benchmarking software allows users to see how different language models perform with various context lengths and engines.
- 10. The benchmarking results are available on Modal's website and include methodology details, open-source code, and an executive summary.
- 11. The speaker hopes to add more information about speculative decoding, multi-token prediction, and quantization in the future.
- 12. The benchmarking software allows users to select a model and engine and see results for different token lengths.
- 13. For example, the Quen 3 mixture of experts model can process one request per second with 128 tokens in and 1024 tokens out.
- 14. The speaker notes that it is generally faster to have more tokens in the context than to generate new ones.
- 15. They also point out that matrix multiplications are more efficient than moving data around or communicating between processors.
- 16. The speaker encourages users to focus on throughput-oriented tasks and use technologies like BF16, which can result in faster multiplication times.
- 17. The speaker provides a URL where users can download the raw data and notes that the code is open source for those who want to run their own benchmarks.
- 18. The speaker's methodology involves calculating maximum throughput by sending a thousand requests at once and dividing the number of requests by the time it took.
- 19. They also calculate the fastest possible server speed by sending one request at a time and waiting for each to return before sending the next.
- 20. The speaker notes that total throughput can be increased by scaling out rather than up, and encourages users to consult their benchmarking methodology for more information.
Source: AI Engineer via YouTube
❓ What do you think? What does the future of language model engines look like, and how will their development impact industries and society as a whole? Feel free to share your thoughts in the comments!