Exploring the Value of Domain-Specific Models: A Case Study in Financial Services

As the co-founder and CTO of Writer, I'm excited to share our journey in building domain-specific language models, and how our latest research reveals that even with high-accuracy general models, we still need to build and refine domain-specific models for reliable utilization.

  • 1. Wasim is a co-founder and CTO of Writer, which started in 2020.
  • 2. They began by building decoder-encoder models and have since created a family of about 36 models, including general and domain-specific ones.
  • 3. Recently, they noticed language models (LMs) achieving high accuracy in various domains, prompting the question: should they continue building domain-specific models if general models can achieve a
  • 4. To answer this, Writer created a real-world scenario evaluation called "fail" to test new models and see if they deliver the promised accuracy.
  • 5. The evaluation includes two main categories: query failure and context failure.
  • 6. Query failure has three subcategories: misspelling queries, incomplete queries, and out-of-domain (OOD) queries.
  • 7. Context failure also has three subcategories: messing context, OCR errors, and irrelevant context.
  • 8. Writer created a diverse dataset for the financial services domain to evaluate these models.
  • 9. They introduced a simple evaluation key matrix that looks at two factors: whether the model gives the correct answer and if it follows grounded context.
  • 10. They selected various chat models and thinking models for the evaluation.
  • 11. The results showed good behavior in answering queries, but many models still failed when given wrong data or context.
  • 12. Most models can generate answers with high hallucination rates, especially in financial benchmarks.
  • 13. There is a significant gap between model robustness and hallucination, even for the best models.
  • 14. According to Wasim, having only the best model isn't enough; building full-stack systems with grounding and guardrails is necessary for reliability.
  • 15. Despite general models achieving high accuracy, domain-specific models are still needed due to the significant gap in context following and grounding.
  • 16. The financial services domain specifically requires robust domain-specific models.
  • 17. Writer's evaluation set, white paper, and leaderboard are open-source and available on GitHub and Hugging Face.
  • 18. Smaller models can sometimes outperform larger, more complex models in context following.
  • 19. The Chain of Thought concept may need further investigation based on the data from domain-specific tasks.
  • 20. Even with high accuracy, there is still work to be done to improve model performance and reliability.
  • 21. According to Wasim, a combination of technology advancements and full-stack systems will be required to achieve optimal performance.
  • 22. For now, building and maintaining domain-specific models is still necessary for reliable use in the market.
  • 23. Wasim encourages exploring the resources available on GitHub and Hugging Face for further research and collaboration.
  • 24. Despite progress in language model accuracy, there are still challenges to overcome before achieving satisfactory reliability and context following in various domains.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!