Summary of "Feature Scaling - Normalization | MinMaxScaling | MaxAbsScaling | RobustScaling"

Feature scaling / normalization — overview

Feature scaling changes numeric feature values to a comparable range while preserving relative differences. Proper scaling helps algorithms that depend on magnitudes or distances (gradient-based methods, many linear models, neural networks, KNN, SVM, PCA) perform better. Tree-based models usually do not require scaling.

Main ideas and practical workflow

Decide whether scaling is necessary for your problem and algorithm. Algorithms that commonly benefit include distance-based and gradient-based methods, many linear models, neural networks, KNN, SVM, and PCA.
Typical workflow:
- Split data into training and test (and validation) sets.
- Fit the scaler on the training set only.
- Transform the training set and then transform the test/validation sets with the fitted scaler.
- Use inverse_transform when you need to convert scaled values back to the original scale.
- Note: scikit-learn transformers convert DataFrames to NumPy arrays — keep track of column names if you need to restore a DataFrame.
Experimentation is important: no single scaler is best for every dataset. Try several and compare model performance.

Scaling methods — formulas, intuition, pros/cons

Min-Max Scaling (MinMaxScaler)

Formula: x' = (x - min) / (max - min)
Result: values mapped to [0, 1] (or another specified range)
Intuition: compresses each feature into a unit box (unit interval per feature)
Pros:
- Preserves the shape of the original distribution (mostly).
- Useful when you know true feature bounds (e.g., image pixels 0–255) or when a model requires bounded inputs.
Cons:
- Sensitive to outliers — extreme values will compress the rest of the data.
Use when: you know the true min/max bounds or model requires bounded input ranges.

Standardization / Z-score (StandardScaler) and Mean Normalization

Standardization (common form)
- Formula: x' = (x - mean) / std
- Result: zero mean and unit variance
- Use when: algorithms expect centered data (many linear models, PCA); often performs well in practice.
Mean normalization (less common)
- Formula (sometimes used): x' = (x - mean) / (max - min)
- Result: centers data around zero; range roughly between -1 and +1 depending on distribution.
- Note: nomenclature varies in tutorials — StandardScaler (z-score) is the most common “centered” approach.

MaxAbs Scaling (MaxAbsScaler)

Formula: x' = x / max(|x|)
Result: scales data to the range [-1, 1] without centering (zeros remain exactly zero)
Pros:
- Preserves sparsity — useful for sparse data with many zeros.
Use when: data is sparse and you do not want to shift the mean.

Robust Scaling (RobustScaler)

Formula: x' = (x - median) / IQR, where IQR = Q3 - Q1 (75th percentile − 25th percentile)
Result: centers on the median and scales according to the IQR
Pros:
- Much less sensitive to outliers than Min-Max or standard scaling.
Use when: data contains significant outliers.

Practical example (wine dataset demonstration)

Dataset: wine data (example features: alcohol and malic acid)

Steps in the demo:

Inspect feature distributions (histograms / distribution plots).
Split into train and test sets.
Fit MinMaxScaler on the training set, transform both training and test sets.
Convert results back to a DataFrame to inspect ranges and create plots.
Visualize scatter plots before and after scaling.

Observations:

After Min-Max scaling, both alcohol and malic-acid columns were mapped to [0, 1]; the scatter becomes compressed into the unit rectangle.
Distribution shapes are largely preserved by Min-Max scaling but can change slightly depending on how the transformation interacts with the original distribution; some distortions are possible.
Min-Max guarantees min → 0 and max → 1 (if using the default range), but intermediate distances and shapes can change.
RobustScaler is useful when outliers are present; MaxAbsScaler is useful for preserving sparsity.

Rules of thumb for choosing a scaler

MinMaxScaler: when feature values are naturally bounded and you know min/max (e.g., image pixels 0–255).
RobustScaler: when data contains outliers.
MaxAbsScaler: when data is sparse with many zeros and you want to preserve sparsity.
StandardScaler (z-score): when you need centered data (zero mean) and unit variance.
Unsure: try multiple scalers and compare model performance.
Always fit the scaler on training data only, then transform validation/test data.

Implementation notes and gotchas

Fit on train only; transform both train and test (and validation).
scikit-learn scalers return NumPy arrays — convert back to pandas DataFrame if you want column names preserved.
Use inverse_transform to map scaled data back to the original scale for interpretation or plotting.
Be aware that scaling changes distances and relations between features — interpret plots in scaled space with caution.

Sources and references

scikit-learn transformer classes: MinMaxScaler, MaxAbsScaler, RobustScaler, StandardScaler.
Wine dataset (used in the demo; likely the UCI Wine dataset).
Demonstration presenter / YouTube channel (unnamed) and general machine learning references.