Exponential Growth in Large Language Models: Training Finance Experts with Domain-Specific Models & Extended Context

As Chief Scientistic Gradient, I'll dive into how we trained large language models to become Finance experts and discuss two key solutions: a domain-specific Finance language model and an extended context length model that addresses hallucinations.

1. Leo is the Chief Scientific Officer at Gradient.
2. He will discuss training large language models to be Finance experts.
3. Foundational models have been growing at an exponential rate.
4. Different companies have their own flavor of a language model, each with its own features and use cases.
5. Context length in language models has increased significantly over the past year.
6. Largest context length models were around 100k a year ago but have grown to about 40 times that now.
7. Large language models are not one-size-fits-all, especially for complicated use cases.
8. Generalist or base language models won't get you very far in more complex scenarios.
9. Gradient has built an AI Foundry, a collection of custom language models and workflow primitives.
10. These pieces are combined to create solutions tailored for their customers.
11. Today, Leo will focus on solutions for the finance domain - building Financial experts.
12. Two components have been particularly useful in building these solutions: a domain-specific Finance language model and context length extension.
13. Six requirements were identified for finance applications of language models that generalist models often lack or fall short on.
14. Domain knowledge is crucial because general purpose language models are trained on a broad set of data, but may not perform well with specific technical financial information.
15. Thousands of relevant documents related to the topic are needed in the model's pre-training for decent accuracy.
16. To train a finance-specific language model, an automated data pipeline is required due to the vast amount of financial data available.
17. Automated data curation involves borrowing ideas from membership inference literature to ensure the model hasn't already seen the data in its training data.
18. After filtering out seen data, human review and synthetic data augmentation are used to create a manageable dataset for training.
19. Continuous pre-training and running alignments (supervised fine-tuning and preference optimization) on the model are part of the training pipeline.
20. Pre-training is like reading textbooks, while alignment is instructing the model on best practices and using information.
21. Extended context length helps address hallucinations in language models.
22. Hallucinations occur when a model generates irrelevant or inconsistent content with input data.
23. Causes of hallucinations include outdated training data, automated data collection issues, and source reference divergence.
24. In-context learning is the most direct and sample-efficient way to reduce hallucinations by introducing small amounts of information into the prompt during inference time.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!