Exponential Growth in Large Language Models: Training Finance Experts with Domain-Specific Models & Extended Context

As Chief Scientistic Gradient, I'll dive into how we trained large language models to become Finance experts and discuss two key solutions: a domain-specific Finance language model and an extended context length model that addresses hallucinations.

  • 1. Leo is the Chief Scientific Officer at Gradient.
  • 2. He will discuss training large language models to be Finance experts.
  • 3. Foundational models have been growing at an exponential rate.
  • 4. Different companies have their own flavor of a language model, each with its own features and use cases.
  • 5. Context length in language models has increased significantly over the past year.
  • 6. Largest context length models were around 100k a year ago but have grown to about 40 times that now.
  • 7. Large language models are not one-size-fits-all, especially for complicated use cases.
  • 8. Generalist or base language models won't get you very far in more complex scenarios.
  • 9. Gradient has built an AI Foundry, a collection of custom language models and workflow primitives.
  • 10. These pieces are combined to create solutions tailored for their customers.
  • 11. Today, Leo will focus on solutions for the finance domain - building Financial experts.
  • 12. Two components have been particularly useful in building these solutions: a domain-specific Finance language model and context length extension.
  • 13. Six requirements were identified for finance applications of language models that generalist models often lack or fall short on.
  • 14. Domain knowledge is crucial because general purpose language models are trained on a broad set of data, but may not perform well with specific technical financial information.
  • 15. Thousands of relevant documents related to the topic are needed in the model's pre-training for decent accuracy.
  • 16. To train a finance-specific language model, an automated data pipeline is required due to the vast amount of financial data available.
  • 17. Automated data curation involves borrowing ideas from membership inference literature to ensure the model hasn't already seen the data in its training data.
  • 18. After filtering out seen data, human review and synthetic data augmentation are used to create a manageable dataset for training.
  • 19. Continuous pre-training and running alignments (supervised fine-tuning and preference optimization) on the model are part of the training pipeline.
  • 20. Pre-training is like reading textbooks, while alignment is instructing the model on best practices and using information.
  • 21. Extended context length helps address hallucinations in language models.
  • 22. Hallucinations occur when a model generates irrelevant or inconsistent content with input data.
  • 23. Causes of hallucinations include outdated training data, automated data collection issues, and source reference divergence.
  • 24. In-context learning is the most direct and sample-efficient way to reduce hallucinations by introducing small amounts of information into the prompt during inference time.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!