Summary of "But how do AI images and videos actually work? | Guest video by Welch Labs"

High-level idea

Modern text-to-image and text-to-video models synthesize pixels by repeatedly transforming pure noise into structured images/videos using diffusion-based processes. Mathematically this is like running Brownian motion (diffusion) backwards in very high-dimensional spaces.

Core components and concepts

  1. CLIP (OpenAI, 2021)

    • Two encoders: a text encoder and an image encoder, each outputting a 512‑dimensional embedding.
    • Trained contrastively: matching image–caption pairs are pulled together in embedding space; non-matching pairs are pushed apart (cosine similarity).
    • The learned embedding space encodes semantic directions (e.g., a “hat” vector ≈ difference between “with hat” and “without hat”).
    • CLIP maps image/text → embeddings but cannot generate images from embeddings directly.
  2. Diffusion models (DDPM — Berkeley, 2020)

    • Forward process: gradually add Gaussian noise to training images until the image is destroyed (a random walk/Brownian motion).
    • Reverse process: train a neural network to undo noise and recover images.
    • Key points from DDPM:
      • Training target is the total added noise (ε) rather than only the immediate previous step — this reduces variance and makes learning more efficient.
      • During sampling, DDPM adds noise at each reverse step (stochastic sampling). Adding noise during generation surprisingly improves sharpness and diversity; removing it collapses outputs toward the mean (blurriness).
    • Intuition: the model learns a time-dependent score function / vector field that points back toward higher-density (realistic) data.
  3. Time-conditioning and the vector-field view

    • Models are conditioned on a time variable t (amount of noise), so they learn coarse behavior at high noise and fine structure as t → 0.
    • Diffusion learning can be seen as learning a time-varying vector field (flow) in data space that directs noisy points back to the data manifold.
  4. Why noise during sampling matters

    • Without the stochastic term, reverse dynamics push samples to the dataset mean → resulting images are blurry.
    • The random term is required to sample the full reverse Gaussian distribution: the model predicts the mean, but adding Gaussian noise draws a full sample.
  5. DDIM and deterministic sampling (Stanford / Google)

    • The stochastic reverse SDE (DDPM) can be mapped to a deterministic ODE (DDIM) with the same endpoint distribution.
    • DDIM sampling is deterministic and can produce high-quality images with far fewer network iterations (faster sampling) by changing step scaling — no retraining required.
    • Flow-matching generalizes DDIM; some video models (e.g., WAN) use these generalized flows.
  6. Conditioning on text and steering generation

    • Straight conditioning: feed CLIP (or other text) embeddings into the diffusion model (e.g., via cross-attention or concatenation) so the denoiser uses text context during training/inference.
    • unCLIP / DALL·E 2 (OpenAI): train diffusion to invert the CLIP image encoder, enabling stronger prompt adherence by generating images consistent with CLIP embeddings.
    • Conditioning alone is often insufficient for strong prompt adherence; additional techniques are commonly used.
  7. Classifier-free guidance (CFG)

    • Technique: train the model sometimes without conditioning (unconditional) and sometimes with conditioning. During sampling, compute conditioned output minus unconditioned output and amplify that difference by a guidance factor α to push samples toward the condition.
    • CFG effectively amplifies the semantic direction corresponding to the prompt, improving adherence and detail. Guidance scale α controls strength (higher α → stronger adherence, but can introduce artifacts).
    • WAN extends this by using “negative prompts” (explicitly encode undesired features, subtract and amplify) to steer outputs away from unwanted attributes. WAN 2.1 is an open-source video model demonstrating these techniques.
  8. Practical open-source models and tools mentioned

    • WAN 2.1: open-source video model used in the video demos.
    • Stable Diffusion (Heidelberg team): open-source image diffusion model used in examples; benefits from classifier-free guidance and DDIM sampling to reduce compute.
    • DALL·E 2 (OpenAI, unCLIP): a closed commercial approach that inverts CLIP to achieve strong prompt adherence.

Performance and compute trade-offs

Practical takeaways & intuition

Guides, tutorials, and resources referenced

Main speakers and primary sources cited

Further help (optional)

I can provide either of the following on request:

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video