Exploring AI for Scientific Discovery: A Case Study on LLMs and Reasoning Embeddings

Join Hubert, a professional in the field of generative AI and medicinal chemistry, as he explores how to use Large Language Models (LLMs) for scientific reasoning and discovery, sharing insights from his experience working with Pharma Novartis.

  • 1. Speaker works for Pharma Novaris, focusing on designing small molecules with generative AI.
  • 2. Topic of talk: using large language models (LLMs) for scientific reasoning and discoveries.
  • 3. Collaboration with colleague Dr. Low, a medicinal chemist.
  • 4. Shared an interesting paradox in biology that led to a Nobel Prize discovery.
  • 5. The paradox involved trying to improve the color of petunia flowers but resulting in a "color flip."
  • 6. This phenomenon was observed across different subdomains of biology, taking eight years to resolve.
  • 7. Question: Can LLMs help speed up understanding such biological phenomena?
  • 8. Examining how LLMs process questions and provide answers using the RAG (Retrieve, Apply Attention, Generate) pipeline.
  • 9. Various techniques in RAG pipeline, like heightening or ranking, address different issues in question-answering.
  • 10. LLMs struggle with complex or convoluted questions, requiring additional reasoning steps before retrieval.
  • 11. One example of pre-retrieval reasoning is routing, demonstrated in Ting flows.
  • 12. Graph RAG or Graph Reader papers represent knowledge from documents differently, allowing for better understanding of semantically complex questions.
  • 13. Post-retrieval reasoning is typically delegated to LLMs, but it doesn't have to be the case.
  • 14. Different types of reasoning can be delegated to specialized tools, like causal inference libraries or algorithmic reasoning tools like Repple.
  • 15. Focusing on finding the cost of three biological phenomena and defining the type of reasoning and retrieval needed for the specific problem.
  • 16. Using RAG to train a language model by presenting it with scientific papers from before the discovery, avoiding "cheating" by using post-discovery knowledge.
  • 17. Defining success levels in the experiment: finding hypotheses, less obvious links, exhaustive facts, and making new hypotheses.
  • 18. Discovered hypotheses should be relevant to the problem, explain how the process happens, and ideally, not have been known before the discovery.
  • 19. Using genkin's natural breaks to determine the number of embeddings needed for optimal results.
  • 20. Utilizing strict prompting to avoid using knowledge from after the discovery.
  • 21. Implementing a relevance classifier that passes all chunks in the dataset through the LLM and asking if it is relevant to the problem.
  • 22. Building a more sophisticated prompt, considering the scientist's perspective, helps find related hypotheses in the literature.
  • 23. Pre-retrieval reasoning over the question and database brings results closer to ground truth results without cheating.
  • 24. Scientific discovery requires solving harder problems than simple Q&A, knowing your problem can help define a more efficient RAG architecture, and generalized needle and high stack methods might b

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!