Summary of "Handling Mixed Variables | Feature Engineering"
Handling mixed variables (feature engineering)
Problem: “Mixed” columns in tabular data where a single field contains both categorical/text and numeric information, or where some rows in a column are numeric and others categorical. These cause feature-engineering headaches and high-cardinality/noisy features.
What the video covers (high level)
- Scope: two practical patterns of mixed data and how to clean/transform them.
- Example dataset: a modified Titanic sample (columns shown in the video:
cabin,ticket,number,survive,alone).
Two common patterns and recommended handling
1) Mixed within a single cell (concatenated category + number)
Example: values like C85 or A/5 21171.
Recommended steps:
- Split the cell into two new columns:
- a categorical column for the alphabetic prefix (e.g.,
ticket_prefix,cabin_prefix) - a numeric column for the extracted number (e.g.,
ticket_number,cabin_number)
- a categorical column for the alphabetic prefix (e.g.,
- Use regular expressions to capture the alphabetic prefix and the numeric part.
- Benefits:
- reduces cardinality
- converts the numeric portion into a usable numeric feature for downstream models
2) Mixed by row type (same column contains either numeric rows or categorical rows)
Recommended approach:
- Create two separate columns:
numeric_columnandcategorical_column. - For each row, place the numeric value in
numeric_columnand the categorical value incategorical_column; set the opposite column toNaN. - Fill/encode missing values appropriately:
- numeric missing ->
0or another meaningful fill - categorical missing ->
"missing"or a consistent token/category
- numeric missing ->
- Benefits:
- preserves both types of information
- makes downstream transformations (scaling, encoding) straightforward
Implementation notes & practical tips
- Inspect unique values and value counts before splitting to understand distributions.
- Use pandas + regex (or equivalent parsing) in a notebook to create new columns.
- After extraction, reduce category cardinality by grouping or encoding (e.g., keep only prefixes).
- Treat special flags (like an
aloneindicator) deliberately — they may be meaningful features on their own. - Keep these transformations in a reusable utility/library so you can apply them quickly to future datasets.
- The presenter provides a runnable notebook/code — download and run it to see the transformations in action.
Tutorials / guides referenced
- Short hands-on tutorial: live demonstration in a Jupyter notebook using a fabricated Titanic-like example.
- Code/notebook available for download to reproduce the examples (extract, split, and fill strategies).
What’s next
- The next video will cover handling date and time columns (feature engineering for datetime).
Main speaker / source
- Presenter: host of the YouTube channel “Feature Engineering in the Morning” (video author/presenter).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...