Exploring Pre-training & Post-training Strategies for LLMs: Importance of High-Quality, Efficiently Formatted Data
Join Noah and Chun as they discuss the importance of data infrastructure in developing AI models, sharing their experiences with Lance TV, a database for multimodal AI that enables fast scans, random access, and indexing extensions.
- 1. Chunsa is the CEO and co-founder of Lance TV, and has been creating data tools for data science and machine learning for two decades.
- 2. Noah leads the AI data platform at Character.ai, a leading personalized AI platform.
- 3. Both companies focus on training their own foundation models and care about what they train on, giving it a nice format that does many useful things for it.
- 4. Pre-training involves thinking wider about domains (books or chat data), while post-training focuses on specific tasks and how difficult they are.
- 5. Favorite current research areas include data efficient learning, sampling from data, measuring diversity, clean data, evaluations, data set management, and analytics.
- 6. Data efficient learning aims to reduce the amount of data needed for good results.
- 7. Sampling metrics and diversity measurement are important in industry and academic papers.
- 8. Clean data is always the starting point, followed by evaluations and dataset management.
- 9. Analytics help understand data collections, including token counts, length, and complexity of code data.
- 10. Reading data has been the biggest win for understanding performance and what's going on.
- 11. Using language models to improve language models includes synthetic data, quality scoring, data set selection, and data set augmentation.
- 12. Data set selection matches distributions between desired behavior and available data through retrieval or clustering.
- 13. Synthetic data generation can be done quickly using existing Big Data tools and GPU backends for prompting, embedding, and classifying things.
- 14. Human labeling is used to improve classifiers and rewrite synthetic data with issues.
- 15. Land's platform tooling aims to make research acceleration easier by separating concerns in data materialization and training jobs.
- 16. LanCE format allows for quick random access, shuffling references rather than rows, and saves time and iteration speed.
- 17. AI workloads require fast scans, fast random access, and dealing with large binary data to stream directly into GPUs.
- 18. Existing data formats and infrastructure are good for at most two of the three properties required in AI workloads.
- 19. Lance format is a columnar file format optimized for AI, giving fast scans like Parquet and supporting fast lookups.
- 20. LanCE format is also a lightweight table format that automatically versions data, making it easy to add experimental features and roll them back later.
- 21. Indexing extensions in Lance format allow for billion-scale Vector search directly from S3, reducing the need for Elasticsearch clusters.
- 22. LanCE format enables a single table for many different workloads, including metadata columns, time series columns, large blobs, and tensors.
- 23. Lance DB is a vector database that provides distributed Vector search with billions of vectors at low latency and very high QPS.
- 24. Lance DB is designed to handle multimodal data needs for features, audio waveforms, images, vectors, metadata columns, time series columns, and other workloads.
Source: AI Engineer via YouTube
❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!