Summary of "Scaling LLM-Based Vulnerability Research via Static Analysis and Document Ranking"

Core idea

Use large language models (LLMs) as scalable, low-cost rankers/filters to find high‑value items (vulnerabilities, suspicious functions, etc.) inside very large data sets (patch diffs, binaries, source trees, alert queues). The approach applies document-ranking + Monte Carlo sampling: repeatedly sample small batches, have an LLM rank them, keep the best, and iterate to narrow down to high-value items.

Analogy: sift a huge pile of sand by repeatedly sampling small handfuls, ranking them, keeping the best, and iterating — applied algorithmically with LLMs.


Key tools & techniques demonstrated

  1. rank (command-line)

    • Purpose: document-ranking over large lists of items (strings, functions, call chains, TLD list demo).
    • Method:
      • Repeated small-batch sampling (e.g., 10 items).
      • Use an LLM to rank items in each batch.
      • Collect statistics (mean, stddev), detect an “inflection point,” keep top items, iterate to narrow to a high-value subset.
    • Features:
      • Outputs JSON with scores and LLM reasoning (explainability).
      • Works with remote APIs (OpenAI) or local models (Ollama, Qwen).
      • Token-efficient.
    • Example demos:
      • Fun demo: rank ~600 TLDs for “mathiness.”
      • Security demo: rank ~1,400 call-chains from a SonicOS patch (bindiff + Binary Ninja HLIL) to find the changed function cluster that fixed an off‑by‑one/SSLVPN bypass — the top result matched the actual fix in minutes and at low cost (~$1).
    • Practical tips: include a reasoning flag to get LLM explanations; cluster functions by call graph to catch interprocedural fixes.
  2. slice

    • Purpose: scale static-analysis + LLM triage for source-code vulnerability classes (example: use-after-free).
    • Method:
      • Run a broad/static query (CodeQL) that deliberately casts a wide net (loose interprocedural matching) to generate many candidates.
      • Use tree-sitter to validate call relationships.
      • Feed query results into slice filter/triage/analyze templates that use LLMs to:
        • Triage cheaply (e.g., a smaller model) to eliminate false positives.
        • Analyze remaining candidates more deeply with a stronger model for exploitability, execution path, and recommended next steps.
    • Example: CodeQL returned ~217 use-after-free candidates; slice triage reduced to 7, and analysis identified the known exploit (the published SMB2 use-after-free).
  3. LLM-assisted sandboxed triage (agents/subagents)

    • Workflow: take top-ranked code snippets/functions and pass them to a sandboxed LLM agent (“cloud code” / subagents) that mounts the codebase and traces reachability, impact, and behavior.
    • Use-case shown: IoT device binary decompilation — rank functions for business-logic smell, then agent triage found a diagnostic/login endpoint (possible backdoor) not obviously documented.

Other tooling & integrations


Operational & engineering notes


Practical tips and recommended patterns


Limitations, concerns, and community/operational points


Guides, resources, and demos mentioned


Main speakers / sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video