Summary of "How is data prepared for machine learning?"

How data is prepared for machine learning

Overview / motivation

Real-world failure example: Amazon’s ML recruiting tool (reported by Reuters in 2018) was shut down after it learned to penalize resumes mentioning “women’s…” because the training data (10 years of resumes) was heavily male-dominated — a classic case of a faulty, biased dataset causing unfair outcomes.
Key message: ML models learn from the data they’re given. Data quality, representativeness, and preparation determine success; bad data => bad models.

“Garbage in, garbage out”: models reflect the quality and biases of their training data.

High-level ML data-preparation workflow

Planning & problem formulation
- Define the business problem to solve with ML before collecting and preparing data.
Data collection — size & examples
- No one-size-fits-all for dataset size: collect as much relevant data as possible.
- Examples:
  - Gmail smart-reply trained on ~238 million messages.
  - Google Translate used very large corpora (trillions of examples across languages).
  - Academic example: a Tamkang University professor used ~630 samples to predict concrete compressive strength.
- Dataset size depends on task complexity and chosen algorithms.
Quality & adequacy of data
- Accuracy depends on correctness and domain relevance of data.
- Domain-mismatch example: using Canadian Thanksgiving sales to predict U.S. Thanksgiving turkey demand would be inadequate.
Labeling & features (supervised learning)
- Labeling: provide “correct answers” so the model can learn (manual labeling or use existing labels).
- Features: measurable characteristics describing examples (e.g., for an apple: shape, color, texture).
- Common problem: mislabeled samples (e.g., peaches labeled as apples) — mitigate with cross-checking or multiple labelers.
Data reduction & cleansing
- Dimensionality reduction: remove irrelevant, nearly-constant, or redundant features (e.g., drop country if all rows are US; drop year-of-birth if age is present).
- Sampling: use subsets to speed prototyping and to rebalance classes.
- Cleaning: fix or impute missing values (fill blanks with constants or predicted values) and remove corrupted or inaccurate records.
Data wrangling & normalization
- Formatting: standardize file formats and categorical naming (e.g., “Florida” vs “FL”).
- Normalization/scaling: unify numeric feature scales so features with larger numeric ranges don’t dominate (example: min–max normalization maps values to 0–1).
Feature engineering
- Create new features from raw data to expose predictive signals (e.g., split datetime into day/month and hour to capture seasonality/time-of-day effects).
- Thoughtful feature design can substantially improve model performance.

Practical notes & risks

Class imbalance should be addressed (resampling, stratified sampling) to avoid biased models.
Use cross-labeling and quality checks to reduce mislabeled data.
Data preparation can consume up to ~80% of a data science project’s time.
There are no flawless datasets; aim to reduce bias, improve representativeness, and clean/transform data to match the problem.

Resources referenced / examples

Amazon’s ML recruiting tool (biased outcomes due to historical resume data) — Reuters report.
Google examples:
- Gmail smart-reply (~238 million messages)
- Google Translate (very large corpora)
- reCAPTCHA (used for label collection)
Academic example: Tamkang University professor (used ~630 samples for a concrete strength prediction task)