Scaling AI for Mission-Critical Decisions: A Medical Doctor's Approach to Real-Time, Reference-Free Evaluations

Join Christopher Ljy, medical doctor turned AI engineer, as he shares his insights on building an evaluation system that works at scale and supports mission-critical decisions in healthcare.

1. Christopher LJY is a medical doctor turned AI engineer who will discuss building an evaluation system that works at scale, focusing on mission-critical decisions like those in healthcare.
2. At Anterior, they've scaled to serve insurance providers covering 50 million American lives, and share their insights from the last 18 months.
3. Real-time reference-free evaluations can build customer trust by ensuring accuracy, especially in industries where mistakes are not tolerated, such as healthcare.
4. Going from an MVP (Minimum Viable Product) to serving customers at scale comes with new challenges, including increased edge cases that may not be apparent during the initial development phase.
5. An example of a potential error is misinterpreting medical records, where "suspicious" might imply no confirmed diagnosis, but actually means there is one.
6. Mistakes in evaluations can lead to lawsuits for inappropriate AI automation use in US healthcare organizations.
7. To identify and handle failure cases, consider performing human reviews of AI outputs, but be aware that this approach does not scale well as the volume of decisions increases.
8. An internal clinical team and tooling can help make human reviews more efficient by surfacing context in an accessible way without requiring scrolling.
9. Human reviewers can add critiques to flag incorrect answers, which can then be used to generate ground truths (descriptions of the correct answer) for offline evaluations.
10. Offline evaluations using gold standard data sets can help iterate AI pipelines and monitor performance over time but relying solely on them might lead to identifying issues too late.
11. A real-time reference-free evaluation system is crucial for large-scale, high-heterogeneity input spaces like medical records, as it allows for immediate evaluation and response to issues.
12. Using an LLM (Language Learning Model) as a judge can help determine the confidence in outputs by evaluating the model's performance before human review.
13. Scoring systems for LLMs as judges can evaluate helpfulness, conciseness, on-brand tone, and confidence levels in binary or multiclass classifications.
14. Real-time reference-free evaluations can predict estimated performance across all cases, identify relevant cases with the highest probability of error, and dynamically prioritize human reviews bas
15. Validating the validator process helps improve the system's ability to detect edge cases over time as it becomes harder for competitors to replicate.
16. Incorporating a reference-free evaluation system into the pipeline can ensure customer trust by providing accurate outputs or taking further actions when necessary.
17. At Anterior, this approach has enabled them to review tens of thousands of cases with a small team of clinical experts instead of hiring hundreds of nurses.
18. Strong alignment between AI and human reviews has been achieved, along with quick error identification and response.
19. Provably industry-leading performance at prior authorization has been attained, leading to customer trust and positive feedback.
20. Principles for building a successful evaluation system include thinking big, using review data to improve the auditing system, evaluating on live production data, getting the best reviewers, empow
21. An effective evaluation system provides real-time performance estimates and enables a scalable, cost-effective solution powered by a small team of experts.
22. The talk recommends reaching out with thoughts or ideas, and encourages interested individuals to check out open positions at Anterior.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!