Introducing Extended Mind Transformers: A New Approach for Language Models

Join Phoebe, a machine learning engineer at Normal Computing, as she dives into the world of extended mind Transformers, a new type of language model that enables granular causal citations and reduces hallucinations.

  • 1. Phoebe is a machine learning engineer at Normal Computing and will be talking about Extended Mind Transformers (EMTs).
  • 2. The problem: Pre-trained language models have general knowledge, but not enough application-specific or topical information to be useful.
  • 3. Current methods to load this description into the language model include long context and RAG (Retrieval-Augmented Generation), which solve the problem in different ways with their own downsides.
  • 4. Long context seeks to extend the context window of the Transformer model, but it can be expensive and may include irrelevant information in the prompt.
  • 5. RAG tries to subset the relevant context in the prompt, but its choices are limited by being external to the model and based on a less granular representation of data.
  • 6. Extended Mind Transformers aim to improve retrieval mechanisms by making a distinction between things that should go in memory and things that should be included along with the inference query.
  • 7. EMT uses an attention mechanism edit, where the model represents data as key-value pairs within each decoder layer.
  • 8. During generation time, each query token retrieves a particular number of relevant memory tokens and attends to them.
  • 9. The challenge is figuring out how to assign position information to those tokens, but recent softer position embeddings allow the model to generalize to these retrieved tokens without further fine-
  • 10. EMT has been tested with rotary position embeddings (present in Llama models) and Alibi linear biases, which damp down information that is further away.
  • 11. A new counterfactual retrieval Benchmark was open-sourced to compare EMT's performance with fine-tuned models and the base Llama model with interpolated position embeddings.
  • 12. EMT performs well on both short and long inputs, while fine-tuned models degrade in quality on shorter inputs.
  • 13. Extended Mind Transformers can also reduce hallucinations by using uncertainty metrics to determine how certain the model is about a generated token.
  • 14. If the model is uncertain, it can regenerate that step using more information from these memories.
  • 15. Important parameters for EMT include stride length and top K (number of key-value pairs or memories each query token is allowed to retrieve and attend to).
  • 16. Regularization techniques are used to prevent confusion when retrieving memory tokens, such as similarity masking and eliminating unknown tokens from the memory.
  • 17. EMT models are available on Hugging Face, with code on GitHub and a paper for more technical details.
  • 18. Using EMT is as simple as passing memories in as inputs during model instantiation.
  • 19. EMTs improve performance on retrieval tasks, enable granular causal citations, and reduce hallucinations without fine-tuning.
  • 20. Open-source models and code are available for easy implementation.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!