Summary of "How is data prepared for machine learning?"

How data is prepared for machine learning

Overview / motivation

“Garbage in, garbage out”: models reflect the quality and biases of their training data.


High-level ML data-preparation workflow

  1. Planning & problem formulation

    • Define the business problem to solve with ML before collecting and preparing data.
  2. Data collection — size & examples

    • No one-size-fits-all for dataset size: collect as much relevant data as possible.
    • Examples:
      • Gmail smart-reply trained on ~238 million messages.
      • Google Translate used very large corpora (trillions of examples across languages).
      • Academic example: a Tamkang University professor used ~630 samples to predict concrete compressive strength.
    • Dataset size depends on task complexity and chosen algorithms.
  3. Quality & adequacy of data

    • Accuracy depends on correctness and domain relevance of data.
    • Domain-mismatch example: using Canadian Thanksgiving sales to predict U.S. Thanksgiving turkey demand would be inadequate.
  4. Labeling & features (supervised learning)

    • Labeling: provide “correct answers” so the model can learn (manual labeling or use existing labels).
    • Features: measurable characteristics describing examples (e.g., for an apple: shape, color, texture).
    • Common problem: mislabeled samples (e.g., peaches labeled as apples) — mitigate with cross-checking or multiple labelers.
  5. Data reduction & cleansing

    • Dimensionality reduction: remove irrelevant, nearly-constant, or redundant features (e.g., drop country if all rows are US; drop year-of-birth if age is present).
    • Sampling: use subsets to speed prototyping and to rebalance classes.
    • Cleaning: fix or impute missing values (fill blanks with constants or predicted values) and remove corrupted or inaccurate records.
  6. Data wrangling & normalization

    • Formatting: standardize file formats and categorical naming (e.g., “Florida” vs “FL”).
    • Normalization/scaling: unify numeric feature scales so features with larger numeric ranges don’t dominate (example: min–max normalization maps values to 0–1).
  7. Feature engineering

    • Create new features from raw data to expose predictive signals (e.g., split datetime into day/month and hour to capture seasonality/time-of-day effects).
    • Thoughtful feature design can substantially improve model performance.

Practical notes & risks


Resources referenced / examples

Main speakers / sources

Note: the subtitles were auto-generated; some names/phrases may be slightly inaccurate.

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video