Summary of "How is data prepared for machine learning?"
How data is prepared for machine learning
Overview / motivation
- Real-world failure example: Amazon’s ML recruiting tool (reported by Reuters in 2018) was shut down after it learned to penalize resumes mentioning “women’s…” because the training data (10 years of resumes) was heavily male-dominated — a classic case of a faulty, biased dataset causing unfair outcomes.
- Key message: ML models learn from the data they’re given. Data quality, representativeness, and preparation determine success; bad data => bad models.
“Garbage in, garbage out”: models reflect the quality and biases of their training data.
High-level ML data-preparation workflow
-
Planning & problem formulation
- Define the business problem to solve with ML before collecting and preparing data.
-
Data collection — size & examples
- No one-size-fits-all for dataset size: collect as much relevant data as possible.
- Examples:
- Gmail smart-reply trained on ~238 million messages.
- Google Translate used very large corpora (trillions of examples across languages).
- Academic example: a Tamkang University professor used ~630 samples to predict concrete compressive strength.
- Dataset size depends on task complexity and chosen algorithms.
-
Quality & adequacy of data
- Accuracy depends on correctness and domain relevance of data.
- Domain-mismatch example: using Canadian Thanksgiving sales to predict U.S. Thanksgiving turkey demand would be inadequate.
-
Labeling & features (supervised learning)
- Labeling: provide “correct answers” so the model can learn (manual labeling or use existing labels).
- Features: measurable characteristics describing examples (e.g., for an apple: shape, color, texture).
- Common problem: mislabeled samples (e.g., peaches labeled as apples) — mitigate with cross-checking or multiple labelers.
-
Data reduction & cleansing
- Dimensionality reduction: remove irrelevant, nearly-constant, or redundant features (e.g., drop country if all rows are US; drop year-of-birth if age is present).
- Sampling: use subsets to speed prototyping and to rebalance classes.
- Cleaning: fix or impute missing values (fill blanks with constants or predicted values) and remove corrupted or inaccurate records.
-
Data wrangling & normalization
- Formatting: standardize file formats and categorical naming (e.g., “Florida” vs “FL”).
- Normalization/scaling: unify numeric feature scales so features with larger numeric ranges don’t dominate (example: min–max normalization maps values to 0–1).
-
Feature engineering
- Create new features from raw data to expose predictive signals (e.g., split datetime into day/month and hour to capture seasonality/time-of-day effects).
- Thoughtful feature design can substantially improve model performance.
Practical notes & risks
- Class imbalance should be addressed (resampling, stratified sampling) to avoid biased models.
- Use cross-labeling and quality checks to reduce mislabeled data.
- Data preparation can consume up to ~80% of a data science project’s time.
- There are no flawless datasets; aim to reduce bias, improve representativeness, and clean/transform data to match the problem.
Resources referenced / examples
- Amazon’s ML recruiting tool (biased outcomes due to historical resume data) — Reuters report.
- Google examples:
- Gmail smart-reply (~238 million messages)
- Google Translate (very large corpora)
- reCAPTCHA (used for label collection)
- Academic example: Tamkang University professor (used ~630 samples for a concrete strength prediction task)
Main speakers / sources
- Video narrator (overview and tutorial-style guide)
- Reuters (source for the Amazon recruiting tool story)
- Google (Gmail smart-reply, Translate, reCAPTCHA examples)
- Tamkang University professor (academic dataset example)
Note: the subtitles were auto-generated; some names/phrases may be slightly inaccurate.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...