Summary of "Handling Mixed Variables | Feature Engineering"

Handling mixed variables (feature engineering)

Problem: “Mixed” columns in tabular data where a single field contains both categorical/text and numeric information, or where some rows in a column are numeric and others categorical. These cause feature-engineering headaches and high-cardinality/noisy features.

What the video covers (high level)

Scope: two practical patterns of mixed data and how to clean/transform them.
Example dataset: a modified Titanic sample (columns shown in the video: cabin, ticket, number, survive, alone).

Two common patterns and recommended handling

1) Mixed within a single cell (concatenated category + number)

Example: values like C85 or A/5 21171.

Recommended steps:

Split the cell into two new columns:
- a categorical column for the alphabetic prefix (e.g., ticket_prefix, cabin_prefix)
- a numeric column for the extracted number (e.g., ticket_number, cabin_number)
Use regular expressions to capture the alphabetic prefix and the numeric part.
Benefits:
- reduces cardinality
- converts the numeric portion into a usable numeric feature for downstream models

2) Mixed by row type (same column contains either numeric rows or categorical rows)

Recommended approach:

Create two separate columns: numeric_column and categorical_column.
For each row, place the numeric value in numeric_column and the categorical value in categorical_column; set the opposite column to NaN.
Fill/encode missing values appropriately:
- numeric missing -> 0 or another meaningful fill
- categorical missing -> "missing" or a consistent token/category
Benefits:
- preserves both types of information
- makes downstream transformations (scaling, encoding) straightforward

Implementation notes & practical tips

Inspect unique values and value counts before splitting to understand distributions.
Use pandas + regex (or equivalent parsing) in a notebook to create new columns.
After extraction, reduce category cardinality by grouping or encoding (e.g., keep only prefixes).
Treat special flags (like an alone indicator) deliberately — they may be meaningful features on their own.
Keep these transformations in a reusable utility/library so you can apply them quickly to future datasets.
The presenter provides a runnable notebook/code — download and run it to see the transformations in action.

Tutorials / guides referenced

Short hands-on tutorial: live demonstration in a Jupyter notebook using a fabricated Titanic-like example.
Code/notebook available for download to reproduce the examples (extract, split, and fill strategies).