Exploring the Value of Domain-Specific Models: A Case Study in Financial Services

As the co-founder and CTO of Writer, I'm excited to share our journey in building domain-specific language models, and how our latest research reveals that even with high-accuracy general models, we still need to build and refine domain-specific models for reliable utilization.

1. Wasim is a co-founder and CTO of Writer, which started in 2020.
2. They began by building decoder-encoder models and have since created a family of about 36 models, including general and domain-specific ones.
3. Recently, they noticed language models (LMs) achieving high accuracy in various domains, prompting the question: should they continue building domain-specific models if general models can achieve a
4. To answer this, Writer created a real-world scenario evaluation called "fail" to test new models and see if they deliver the promised accuracy.
5. The evaluation includes two main categories: query failure and context failure.
6. Query failure has three subcategories: misspelling queries, incomplete queries, and out-of-domain (OOD) queries.
7. Context failure also has three subcategories: messing context, OCR errors, and irrelevant context.
8. Writer created a diverse dataset for the financial services domain to evaluate these models.
9. They introduced a simple evaluation key matrix that looks at two factors: whether the model gives the correct answer and if it follows grounded context.
10. They selected various chat models and thinking models for the evaluation.
11. The results showed good behavior in answering queries, but many models still failed when given wrong data or context.
12. Most models can generate answers with high hallucination rates, especially in financial benchmarks.
13. There is a significant gap between model robustness and hallucination, even for the best models.
14. According to Wasim, having only the best model isn't enough; building full-stack systems with grounding and guardrails is necessary for reliability.
15. Despite general models achieving high accuracy, domain-specific models are still needed due to the significant gap in context following and grounding.
16. The financial services domain specifically requires robust domain-specific models.
17. Writer's evaluation set, white paper, and leaderboard are open-source and available on GitHub and Hugging Face.
18. Smaller models can sometimes outperform larger, more complex models in context following.
19. The Chain of Thought concept may need further investigation based on the data from domain-specific tasks.
20. Even with high accuracy, there is still work to be done to improve model performance and reliability.
21. According to Wasim, a combination of technology advancements and full-stack systems will be required to achieve optimal performance.
22. For now, building and maintaining domain-specific models is still necessary for reliable use in the market.
23. Wasim encourages exploring the resources available on GitHub and Hugging Face for further research and collaboration.
24. Despite progress in language model accuracy, there are still challenges to overcome before achieving satisfactory reliability and context following in various domains.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!