Better Data is Better Than Better Models
Image credit: Posted on Twitter
What does better data mean and how it can improve the performance, and efficiency of LLMs?
Better Data is Better Than Better Models
This year, there has been a race to create ever larger and more complex AI language models like GPT-3 & 4, Google’s LaMDA, and Anthropic’s Claude. These models can generate remarkably human-like text and engage in conversations.
Why we should focus more on curating better training data? I will explain what better data means and how it can improve the performance, efficiency, and fairness of LLMs.
💿 What is better data?
Better data is data that is relevant, representative, reliable, and responsible for the task at hand. Better data should meet the following criteria:
- Relevant: The data should match the domain, genre, style, and purpose of the task. For example, if the task is to generate product reviews, the data should consist of product reviews from the target market and not from other domains or languages.
- Representative: The data should reflect the diversity and variability of the real-world phenomena that the task aims to model. For example, if the task is to generate text for a chatbot, the data should include different types of conversations, topics, users, and contexts.
- Reliable: The data should be accurate, consistent, and complete. For example, if the task is to translate text from one language to another, the data should have correct and consistent translations that cover all the relevant aspects of the source text.
- Responsible: The data should be ethical, legal, and respectful of the rights and interests of the data providers and users. For example, if the task is to generate text for a social media platform, the data should not contain harmful or offensive content that may violate the platform’s policies or harm its users.
📈 Some reasons why better data is more important?
1. Models are only as good as their training data. No matter how sophisticated the architecture is, if the model is trained on low-quality, biased data, it will reflect those flaws. We’ve seen issues like racist and sexist outputs from large models trained on web data. Better vetted datasets prevent this.
2. Curating data well is difficult. Removing toxicity and bias from large datasets requires substantial human effort and care. Doing this well needs to be a priority over simply scraping more data from the web. High-quality training datasets will pay dividends.
3. Personalized data matters. For tasks like conversing naturally, models benefit from being trained on data relevant to the user’s interests and style. A model trained on carefully filtered data from an individual will connect better than a giant model trained on generic data.
4. Targeted data beats generic data. Even for general skills like reasoning and common sense, models do better with datasets carefully designed for the task instead of unstructured web data. Better data eliminates the need for models to learn everything.
5. Data, not models, are a scarce resource. Thanks to computing power, we can scale up models easily now. However compiling and cleaning large datasets requires human time, effort, and care. We should focus resources on what’s hardest to obtain in AI: diverse, high-quality training data.
⚙️ How can better data improve LLMs?
- Performance: Better data can help LLMs achieve higher accuracy, fluency, coherence, and diversity in their outputs. Better data can also help LLMs avoid generating incorrect, irrelevant, or inappropriate outputs that may harm their credibility or usability.
- Efficiency: Better data can help LLMs reduce their training time and cost by requiring less data and fewer parameters. Better data can also help LLMs optimize their inference speed and memory by enabling more effective pruning, quantization, or distillation techniques.
- Fairness: Better data can help LLMs mitigate their biases and ensure their outputs are fair and inclusive for all groups of users. Better data can also help LLMs respect the privacy and consent of the data providers and users by avoiding exposing or exploiting their personal or sensitive information.
Overall, training data is the real bottleneck for better AI, not model scale. Companies and researchers should prioritize building rich datasets over designing gigantic models. With carefully curated data, even smaller models can outperform the giants. For safe, ethical, and useful AI, better data is essential.