Summary of "But how do AI images and videos actually work? | Guest video by Welch Labs"

High-level idea

Modern text-to-image and text-to-video models synthesize pixels by repeatedly transforming pure noise into structured images/videos using diffusion-based processes. Mathematically this is like running Brownian motion (diffusion) backwards in very high-dimensional spaces.

Core components and concepts

CLIP (OpenAI, 2021)
- Two encoders: a text encoder and an image encoder, each outputting a 512‑dimensional embedding.
- Trained contrastively: matching image–caption pairs are pulled together in embedding space; non-matching pairs are pushed apart (cosine similarity).
- The learned embedding space encodes semantic directions (e.g., a “hat” vector ≈ difference between “with hat” and “without hat”).
- CLIP maps image/text → embeddings but cannot generate images from embeddings directly.
Diffusion models (DDPM — Berkeley, 2020)
- Forward process: gradually add Gaussian noise to training images until the image is destroyed (a random walk/Brownian motion).
- Reverse process: train a neural network to undo noise and recover images.
- Key points from DDPM:
  - Training target is the total added noise (ε) rather than only the immediate previous step — this reduces variance and makes learning more efficient.
  - During sampling, DDPM adds noise at each reverse step (stochastic sampling). Adding noise during generation surprisingly improves sharpness and diversity; removing it collapses outputs toward the mean (blurriness).
- Intuition: the model learns a time-dependent score function / vector field that points back toward higher-density (realistic) data.
Time-conditioning and the vector-field view
- Models are conditioned on a time variable t (amount of noise), so they learn coarse behavior at high noise and fine structure as t → 0.
- Diffusion learning can be seen as learning a time-varying vector field (flow) in data space that directs noisy points back to the data manifold.
Why noise during sampling matters
- Without the stochastic term, reverse dynamics push samples to the dataset mean → resulting images are blurry.
- The random term is required to sample the full reverse Gaussian distribution: the model predicts the mean, but adding Gaussian noise draws a full sample.
DDIM and deterministic sampling (Stanford / Google)
- The stochastic reverse SDE (DDPM) can be mapped to a deterministic ODE (DDIM) with the same endpoint distribution.
- DDIM sampling is deterministic and can produce high-quality images with far fewer network iterations (faster sampling) by changing step scaling — no retraining required.
- Flow-matching generalizes DDIM; some video models (e.g., WAN) use these generalized flows.
Conditioning on text and steering generation
- Straight conditioning: feed CLIP (or other text) embeddings into the diffusion model (e.g., via cross-attention or concatenation) so the denoiser uses text context during training/inference.
- unCLIP / DALL·E 2 (OpenAI): train diffusion to invert the CLIP image encoder, enabling stronger prompt adherence by generating images consistent with CLIP embeddings.
- Conditioning alone is often insufficient for strong prompt adherence; additional techniques are commonly used.
Classifier-free guidance (CFG)
- Technique: train the model sometimes without conditioning (unconditional) and sometimes with conditioning. During sampling, compute conditioned output minus unconditioned output and amplify that difference by a guidance factor α to push samples toward the condition.
- CFG effectively amplifies the semantic direction corresponding to the prompt, improving adherence and detail. Guidance scale α controls strength (higher α → stronger adherence, but can introduce artifacts).
- WAN extends this by using “negative prompts” (explicitly encode undesired features, subtract and amplify) to steer outputs away from unwanted attributes. WAN 2.1 is an open-source video model demonstrating these techniques.
Practical open-source models and tools mentioned
- WAN 2.1: open-source video model used in the video demos.
- Stable Diffusion (Heidelberg team): open-source image diffusion model used in examples; benefits from classifier-free guidance and DDIM sampling to reduce compute.
- DALL·E 2 (OpenAI, unCLIP): a closed commercial approach that inverts CLIP to achieve strong prompt adherence.

Performance and compute trade-offs

Early DDPMs required many denoiser steps; DDIM/flow methods dramatically reduce the required steps (faster sampling).
Guidance (CFG) and inversion of powerful text–image encoders greatly improve fidelity to prompts, but tuning guidance and handling negative prompts are practical levers that trade off diversity for adherence.
Deterministic ODE samplers (DDIM/flows) speed inference without retraining, but theoretical guarantees concern matching distributions, not individual samples.

Practical takeaways & intuition

Diffusion sampling is a controlled reverse random walk guided by a learned vector field; including or removing the stochastic term has clear theoretical and empirical consequences (diversity vs. mean-collapse).
Time-conditioning is essential: coarse structure is learned for large t; fine details are learned near t → 0.
Combining CLIP-style embeddings with diffusion models (conditioning + classifier-free guidance or unCLIP-style inversion) enables text-driven image/video generation from language prompts.
Deterministic ODE samplers (DDIM/flows) allow much faster inference without retraining, but they match distributions in aggregate rather than guaranteeing specific sample trajectories.

Guides, tutorials, and resources referenced

The video mentions deeper theory tutorials on diffusion SDE/ODE connections and the DDPM math (useful for formal proofs and derivations).
Suggested further viewing: WelshLabs’ content — detailed ML and math explainer videos (including a well-regarded series on complex topics).

Main speakers and primary sources cited

Guest presenter: Stephen Welsh (WelshLabs) — author of the guest video and detailed explanations/demos.
Key papers and teams referenced: OpenAI (CLIP, unCLIP / DALL·E 2), Berkeley (DDPM), Stanford & Google (DDIM / related work), Google Brain (Fokker–Planck / ODE mapping), Heidelberg (Stable Diffusion team), WAN team (WAN 2.1).

Further help (optional)

I can provide either of the following on request:

A list of the exact original papers (titles, years) with links.
A short step-by-step “how to” cheat-sheet for running/conditioning a diffusion model (Stable Diffusion or WAN-style pipelines).