Introducing Extended Mind Transformers: A New Approach for Language Models

Join Phoebe, a machine learning engineer at Normal Computing, as she dives into the world of extended mind Transformers, a new type of language model that enables granular causal citations and reduces hallucinations.

1. Phoebe is a machine learning engineer at Normal Computing and will be talking about Extended Mind Transformers (EMTs).
2. The problem: Pre-trained language models have general knowledge, but not enough application-specific or topical information to be useful.
3. Current methods to load this description into the language model include long context and RAG (Retrieval-Augmented Generation), which solve the problem in different ways with their own downsides.
4. Long context seeks to extend the context window of the Transformer model, but it can be expensive and may include irrelevant information in the prompt.
5. RAG tries to subset the relevant context in the prompt, but its choices are limited by being external to the model and based on a less granular representation of data.
6. Extended Mind Transformers aim to improve retrieval mechanisms by making a distinction between things that should go in memory and things that should be included along with the inference query.
7. EMT uses an attention mechanism edit, where the model represents data as key-value pairs within each decoder layer.
8. During generation time, each query token retrieves a particular number of relevant memory tokens and attends to them.
9. The challenge is figuring out how to assign position information to those tokens, but recent softer position embeddings allow the model to generalize to these retrieved tokens without further fine-
10. EMT has been tested with rotary position embeddings (present in Llama models) and Alibi linear biases, which damp down information that is further away.
11. A new counterfactual retrieval Benchmark was open-sourced to compare EMT's performance with fine-tuned models and the base Llama model with interpolated position embeddings.
12. EMT performs well on both short and long inputs, while fine-tuned models degrade in quality on shorter inputs.
13. Extended Mind Transformers can also reduce hallucinations by using uncertainty metrics to determine how certain the model is about a generated token.
14. If the model is uncertain, it can regenerate that step using more information from these memories.
15. Important parameters for EMT include stride length and top K (number of key-value pairs or memories each query token is allowed to retrieve and attend to).
16. Regularization techniques are used to prevent confusion when retrieving memory tokens, such as similarity masking and eliminating unknown tokens from the memory.
17. EMT models are available on Hugging Face, with code on GitHub and a paper for more technical details.
18. Using EMT is as simple as passing memories in as inputs during model instantiation.
19. EMTs improve performance on retrieval tasks, enable granular causal citations, and reduce hallucinations without fine-tuning.
20. Open-source models and code are available for easy implementation.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!