Data Curation Best Practices: Building High-Quality Training Datasets
In my previous post on training data ethics, I discussed the "why" of responsible data practices—the principles that should guide our decisions about what data to use and how to source it. Today I want to focus on the "how"—the practical methodologies and workflows that transform ethical commitments into actual datasets.
Data curation is often the least glamorous part of machine learning. It's painstaking, detail-oriented work that lacks the intellectual excitement of novel architectures or the visceral thrill of breakthrough results. Yet I'd argue it's the most important determinant of whether AI systems work reliably in the real world. As the saying goes: garbage in, garbage out. More precisely: low-quality data in, unpredictable behavior out.
At Drane Labs, we've developed systematic approaches to data curation that balance quality with practical constraints. Let me share what we've learned.
Defining Dataset Requirements
The first step in data curation is understanding exactly what you need—not just in broad strokes ("a dataset for object detection") but with precise specifications. Before we begin curating any dataset, we create a detailed requirements document covering:
Target task and evaluation criteria: What will models trained on this data actually do? How will we measure success? The clearer we are about the downstream task, the better we can curate relevant data.
Required diversity axes: What dimensions of variation must the dataset capture? For a vision dataset, this might include lighting conditions, camera angles, object orientations, background contexts, occlusion levels. For text, it might include genres, formality levels, domains, and linguistic variations.
Minimum representation thresholds: For each important category or subpopulation, what's the minimum number of examples needed? This prevents scenarios where common categories dominate while important minorities are underrepresented.
Exclusion criteria: What should definitively not be in the dataset? This includes out-of-scope content, problematic examples, and edge cases that would confuse rather than help training.
Quality standards: What defines a "good" example? What are acceptable noise levels, annotation quality thresholds, and resolution requirements?
This requirements document becomes the north star for all curation decisions. When we're unsure whether to include a particular data source or example, we refer back to these requirements.
Sourcing Strategies
With clear requirements established, the next question is where to find data. We use several complementary sourcing strategies:
Purpose-built collection: For high-stakes applications, we invest in collecting data specifically for model training. This might involve hiring domain experts to generate examples, creating controlled data collection environments, or partnering with organizations that can provide in-domain data. This is expensive but provides maximum control over quality and representation.
Licensed datasets: We purchase or license datasets from reputable providers. This requires careful due diligence—verifying the provider's data collection practices, understanding any usage restrictions, and assessing quality. But it provides access to specialized data we couldn't easily collect ourselves.
Public dataset adaptation: Many high-quality public datasets exist for common tasks. We adapt these by augmenting them with additional examples addressing identified gaps, updating outdated annotations, or combining multiple complementary datasets. This balances cost efficiency with quality control.
Synthetic data generation: For certain applications, we generate synthetic training data using physics simulators, graphics engines, or generative models. This allows precise control over data distributions and coverage of rare scenarios. The challenge is ensuring synthetic data accurately reflects real-world complexity.
User-contributed data: For products already in deployment, we sometimes incorporate user data (with appropriate consent and privacy protections). This provides in-distribution examples reflecting actual usage patterns. However, it risks feedback loops where models amplify existing biases.
Different projects require different mixes of these strategies. A medical imaging model might rely heavily on licensed clinical data, while a robotics project might use mostly synthetic simulation data supplemented with targeted real-world collection.
Quality Control Pipelines
Once we've sourced candidate data, rigorous quality control ensures only suitable examples enter the training set. Our QC pipeline includes multiple layers:
Automated filtering: Programmatic checks catch obvious issues—corrupt files, extreme outliers, duplicates, examples violating format requirements. This filters out clear defects efficiently.
Statistical profiling: Generate statistical summaries of the candidate data—distribution of labels, feature statistics, correlation analysis. Compare against requirements to identify imbalances or anomalies.
Expert review sampling: Subject matter experts manually review random samples from each data source. They assess quality, identify annotation errors, and flag examples that seem problematic despite passing automated checks. Sample size depends on data volume and stakes—we review higher proportions for critical applications.
Cross-annotation verification: For labeled data, we use multiple independent annotators for random subsets and measure inter-annotator agreement. Low agreement indicates ambiguous examples or unclear annotation guidelines—both problems requiring resolution.
Bias auditing: Systematically test for problematic biases. For demographic attributes, we check whether outcome distributions differ across groups in ways that could lead to discriminatory behavior. For contextual biases, we look for spurious correlations the model might exploit.
Adversarial probing: Red team the dataset by trying to find adversarial examples, annotation errors, or problematic content that passed previous filters. This adversarial perspective catches issues that well-intentioned reviewers miss.
Data that passes all these checks enters our curated dataset. Data that fails is either corrected (for fixable issues like annotation errors) or discarded.
Annotation Guidelines and Consistency
For supervised learning tasks, annotation quality is paramount. Ambiguous or inconsistent labels undermine training. We invest heavily in developing clear annotation guidelines and ensuring annotator consistency.
Our annotation process follows several principles:
Explicit guidelines: Written documentation that describes exactly how to label each type of example. This includes detailed definitions, example cases, decision trees for ambiguous scenarios, and common pitfalls to avoid.
Calibration training: Before annotating real data, annotators complete training exercises with known ground truth. They receive feedback on errors until they demonstrate consistency with our standards.
Ongoing consistency monitoring: Throughout the annotation process, we regularly measure inter-annotator agreement and individual annotator consistency. Declining metrics trigger retraining or discussions to realign understanding.
Annotator feedback loops: Annotators can flag ambiguous cases or propose guideline clarifications. This bidirectional communication improves guidelines over time and keeps annotators engaged.
Hierarchical review: Complex or ambiguous examples escalate to senior annotators or domain experts for adjudication. This prevents difficult cases from being labeled arbitrarily.
Annotation metadata: We preserve metadata about who annotated each example, when, their confidence level, and any flags or notes. This enables downstream analysis if quality issues emerge.
For some applications, we embrace annotation disagreement as signal rather than noise. If multiple qualified annotators disagree about a label, that reveals genuine ambiguity that models should respect rather than forcing a single canonical answer.
Documentation and Versioning
Comprehensive documentation is essential for dataset usability and reproducibility. Every dataset we curate receives extensive documentation covering:
Provenance: Where did the data come from? How was it collected? What preprocessing was applied? This enables users to assess whether the dataset fits their needs and understand any inherent limitations or biases.
Composition: What categories, distributions, and statistical properties define the dataset? What's the breakdown across important dimensions? This helps users understand coverage and identify gaps.
Intended use: What tasks is this dataset designed for? What are appropriate use cases? This sets expectations and prevents misuse.
Known limitations: What scenarios, populations, or contexts are underrepresented or absent? Where should this dataset not be used? Explicit limitation documentation prevents overconfident deployment.
Annotation methodology: How were labels generated? What guidelines were used? What quality control was applied? This enables users to assess label reliability.
Ethical considerations: Were there special ethical considerations in collection or annotation? Are there recommended safeguards for using this data? This transparency builds trust.
We use semantic versioning for datasets, treating them like software. Minor versions indicate additions or corrections that maintain compatibility. Major versions indicate structural changes that might affect existing pipelines. This versioning discipline prevents confusion and enables reproducibility.
Maintenance and Living Datasets
Datasets aren't static artifacts—they require ongoing maintenance. Distribution drift, discovered errors, new edge cases, evolving standards—all these factors mean that yesterday's high-quality dataset might be insufficient tomorrow.
Our approach treats important datasets as "living" resources that evolve over time:
Error correction: When errors are discovered—through model failures, user reports, or periodic audits—we fix them and release updated versions. We maintain public changelogs so users understand what changed.
Expansion: As we identify gaps or underrepresented scenarios, we expand datasets with targeted additional collection. This is more efficient than building new datasets from scratch.
Deprecation: When datasets become outdated or problematic, we formally deprecate them with clear guidance about replacements. We don't simply remove them (which breaks reproducibility) but actively discourage new use.
Community feedback: We solicit feedback from teams using our datasets. They encounter edge cases and limitations we didn't anticipate. This feedback drives prioritization for maintenance efforts.
Regular audits: Scheduled reviews ensure datasets remain aligned with our quality standards and ethical guidelines. Standards evolve, and datasets must evolve with them.
This maintenance work requires ongoing resource allocation—it's not just a one-time cost. But it's essential for maintaining dataset quality over time.
Balancing Quality and Scale
A frequent tension in data curation is quality versus quantity. Given fixed resources, should you curate a smaller high-quality dataset or a larger noisy dataset?
The answer depends on context, but general principles guide our decisions:
For critical applications, prioritize quality: In high-stakes domains like medical diagnosis or safety systems, errors have serious consequences. Here we invest in smaller, meticulously curated datasets rather than maximizing volume.
For data-hungry models, scale matters: Large language models and foundation models often benefit from scale even with noisier data. For these applications, we use aggressive automated filtering to maximize clean data volume while accepting some noise.
For rare scenarios, coverage matters: When dealing with long-tail phenomena or rare edge cases, getting any examples at all is more important than perfection. We relax quality thresholds to ensure adequate representation.
For prototype development, iterate quickly: Early-stage projects benefit from rapid iteration over perfection. We curate smaller provisional datasets enabling fast experimentation, then invest in quality as projects mature.
For ongoing systems, compound quality: Once systems are deployed, incremental quality improvements compound over time. We systematically improve datasets rather than pursuing one-time perfect curation.
These tradeoffs require judgment. We make them explicitly, document our reasoning, and remain willing to revisit decisions as projects evolve.
The Human Element
Finally, a reminder that data curation is fundamentally human work. Algorithms can assist, but judgment, creativity, and ethical reasoning come from people. At Drane Labs, we invest in our data curation teams—hiring experienced specialists, providing ongoing training, building supportive processes, and treating this work as the critical technical function it is.
Good data curation requires a blend of skills: domain expertise to assess relevance and quality, statistical literacy to evaluate distributions and biases, engineering rigor to build robust pipelines, ethical judgment to navigate complex tradeoffs. Building teams with these diverse capabilities is essential.
It also requires organizational culture that values dataset quality. In many ML organizations, glory goes to model architectures and training innovations while data work is undervalued. We push back against that culture. We celebrate great datasets. We recognize excellent curation work. We treat data quality as a first-class engineering concern.
Conclusion
Data curation isn't glamorous. It's meticulous, detail-oriented work that requires patience, rigor, and care. But it's the foundation of everything else in machine learning. Models are only as good as the data they're trained on. Systems are only as reliable as the datasets underlying them.
At Drane Labs, we believe that investing in data curation—building robust methodologies, comprehensive documentation, rigorous quality control, and ongoing maintenance—is the most important thing we do. Everything else follows from this foundation.
I hope sharing our practices helps other organizations develop their own approaches to data curation. The specific techniques will vary by domain and application, but the underlying principles remain: be systematic, prioritize quality, document thoroughly, maintain continuously, and remember that data curation is a human endeavor requiring judgment and care.
The ML community has made enormous progress on models and algorithms. Let's match that progress with equally sophisticated approaches to the datasets that make those models work.
Priya Sandoval is Principal Research Scientist at Drane Labs, where she leads the AI Ethics and Safety team. She holds a PhD in Computational Ethics from MIT and has published extensively on algorithmic fairness, responsible AI development, and training data quality.