Summary of "PDS week 1.3"

Summary — main ideas, concepts and lessons

What data science is

Data science is an interdisciplinary field using scientific methods, processes and systems to extract knowledge or insights from data in various forms.

Key roles and skills of a data scientist

Realistic expectations and pitfalls

Common example applications

Methodology — typical data science process (iterative / non-linear)

  1. Set / identify the research goal

    • Clarify the business context, purpose, expected outcomes and how results will be used.
    • Ask: What? Why? How?
    • Define measures of success (metrics, improvement targets).
    • Agree deliverables, resources and timeline via a project charter (report, source code, prototype, deployment).
  2. Retrieve and acquire data

    • Identify required internal and external data sources.
    • Consider access, permissions, differing definitions across teams, and practical barriers.
    • Data may require collection (experiments) or access requests.
  3. Data preparation (pre-processing)

    • Data cleansing: remove errors, fix inconsistencies (e.g., gender labels, impossible ages), detect outliers.
    • Data transformation: convert variables for modeling needs (e.g., transforms to satisfy linear assumptions; compute per-capita from totals).
    • Data combining: join/merge datasets carefully and perform sanity checks.
    • Rule of thumb: garbage in → garbage out. Fix errors early to reduce downstream cost.
  4. Data exploration

    • Use summary statistics (mean, median, std) and visualizations (histograms, line/bar charts, scatter plots).
    • Identify distributions, outliers, missingness and trends (e.g., weekly patterns); generate hypotheses.
    • Iteratively return to data preparation if new issues are discovered.
  5. Data modeling

    • Choose model family driven by the research question and data type (classification, regression, clustering, recommendation).
    • Consider constraints: numerical vs categorical data, explainability requirements, production/deployment environment.
    • Train/evaluate using training/validation/test splits or held-out sets; use subsets when full data is huge.
    • Use appropriate evaluation metrics (classification: precision, recall, F1; regression: RMSE, MAE; business KPIs).
    • Balance performance, interpretability and deployability (e.g., deep learning may be less explainable).
  6. Presentation and automation

    • Storytelling: produce clear, stakeholder-focused reports and visualizations explaining methods, results, and recommended actions.
    • Report structure: cover page, abstract/executive summary (purpose, method, results, conclusions, recommendations), introduction, detailed methodology (reproducible level), results (facts), discussion/conclusion (interpretation, implications, future work).
    • Automation / productionization: package code/process so results can be re-run on new data and deliver artifacts (code, models, dashboards).
    • Use appropriate output forms: tables, figures and structured narratives depending on audience.

Training and evaluation practicalities

Reporting and reproducibility

Practical lessons and best practices

Sources, examples and references mentioned

Speakers / sources featured

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video