Advances in Open Models: Running Large Language Models with SG Lang, VLM and Tensor TLM

Exploring the rapidly evolving landscape of Large Language Models (LLMs) and their potential applications, with a focus on performance benchmarking and optimization.

  • 1. The speaker recently worked on evaluating the speed of inference engines for open models.
  • 2. They have been attending the AI Engineer summit since its inception two years ago, and have noticed a shift towards more practical applications of AI technologies.
  • 3. For a long time, the conference was focused on showcasing new technologies rather than providing opportunities to work with them directly.
  • 4. However, the development of open models like Llama, Quen, and Deep Seek has changed this landscape, making it possible to run interesting experiments and applications using these models.
  • 5. The speaker notes that there are now open source engines available, such as VLM, SGLang, and Tensor TLM, which make it easier to run language models.
  • 6. Running one's own models is becoming less common, and would require a strong motivation, such as working for the US government or wanting to use decentralized crypto models.
  • 7. The speaker predicted in 2023 that open models would catch up to proprietary ones once capabilities requirements saturate.
  • 8. At that time, there were only a few LM inference libraries available, but now VLM is a good option that has stuck around.
  • 9. The speaker's benchmarking software allows users to see how different language models perform with various context lengths and engines.
  • 10. The benchmarking results are available on Modal's website and include methodology details, open-source code, and an executive summary.
  • 11. The speaker hopes to add more information about speculative decoding, multi-token prediction, and quantization in the future.
  • 12. The benchmarking software allows users to select a model and engine and see results for different token lengths.
  • 13. For example, the Quen 3 mixture of experts model can process one request per second with 128 tokens in and 1024 tokens out.
  • 14. The speaker notes that it is generally faster to have more tokens in the context than to generate new ones.
  • 15. They also point out that matrix multiplications are more efficient than moving data around or communicating between processors.
  • 16. The speaker encourages users to focus on throughput-oriented tasks and use technologies like BF16, which can result in faster multiplication times.
  • 17. The speaker provides a URL where users can download the raw data and notes that the code is open source for those who want to run their own benchmarks.
  • 18. The speaker's methodology involves calculating maximum throughput by sending a thousand requests at once and dividing the number of requests by the time it took.
  • 19. They also calculate the fastest possible server speed by sending one request at a time and waiting for each to return before sending the next.
  • 20. The speaker notes that total throughput can be increased by scaling out rather than up, and encourages users to consult their benchmarking methodology for more information.

Source: AI Engineer via YouTube

❓ What do you think? What does the future of language model engines look like, and how will their development impact industries and society as a whole? Feel free to share your thoughts in the comments!