Introducing Extended Mind Transformers: A New Approach for Language Models
Join Phoebe, a machine learning engineer at Normal Computing, as she dives into the world of extended mind Transformers, a new type of language model that enables granular causal citations and reduces hallucinations.
- 1. Phoebe is a machine learning engineer at Normal Computing and will be talking about Extended Mind Transformers (EMTs).
- 2. The problem: Pre-trained language models have general knowledge, but not enough application-specific or topical information to be useful.
- 3. Current methods to load this description into the language model include long context and RAG (Retrieval-Augmented Generation), which solve the problem in different ways with their own downsides.
- 4. Long context seeks to extend the context window of the Transformer model, but it can be expensive and may include irrelevant information in the prompt.
- 5. RAG tries to subset the relevant context in the prompt, but its choices are limited by being external to the model and based on a less granular representation of data.
- 6. Extended Mind Transformers aim to improve retrieval mechanisms by making a distinction between things that should go in memory and things that should be included along with the inference query.
- 7. EMT uses an attention mechanism edit, where the model represents data as key-value pairs within each decoder layer.
- 8. During generation time, each query token retrieves a particular number of relevant memory tokens and attends to them.
- 9. The challenge is figuring out how to assign position information to those tokens, but recent softer position embeddings allow the model to generalize to these retrieved tokens without further fine-
- 10. EMT has been tested with rotary position embeddings (present in Llama models) and Alibi linear biases, which damp down information that is further away.
- 11. A new counterfactual retrieval Benchmark was open-sourced to compare EMT's performance with fine-tuned models and the base Llama model with interpolated position embeddings.
- 12. EMT performs well on both short and long inputs, while fine-tuned models degrade in quality on shorter inputs.
- 13. Extended Mind Transformers can also reduce hallucinations by using uncertainty metrics to determine how certain the model is about a generated token.
- 14. If the model is uncertain, it can regenerate that step using more information from these memories.
- 15. Important parameters for EMT include stride length and top K (number of key-value pairs or memories each query token is allowed to retrieve and attend to).
- 16. Regularization techniques are used to prevent confusion when retrieving memory tokens, such as similarity masking and eliminating unknown tokens from the memory.
- 17. EMT models are available on Hugging Face, with code on GitHub and a paper for more technical details.
- 18. Using EMT is as simple as passing memories in as inputs during model instantiation.
- 19. EMTs improve performance on retrieval tasks, enable granular causal citations, and reduce hallucinations without fine-tuning.
- 20. Open-source models and code are available for easy implementation.
Source: AI Engineer via YouTube
❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!