Summary of "Scaling LLM-Based Vulnerability Research via Static Analysis and Document Ranking"

Core idea

Use large language models (LLMs) as scalable, low-cost rankers/filters to find high‑value items (vulnerabilities, suspicious functions, etc.) inside very large data sets (patch diffs, binaries, source trees, alert queues). The approach applies document-ranking + Monte Carlo sampling: repeatedly sample small batches, have an LLM rank them, keep the best, and iterate to narrow down to high-value items.

Analogy: sift a huge pile of sand by repeatedly sampling small handfuls, ranking them, keeping the best, and iterating — applied algorithmically with LLMs.

Key tools & techniques demonstrated

rank (command-line)
- Purpose: document-ranking over large lists of items (strings, functions, call chains, TLD list demo).
- Method:
  - Repeated small-batch sampling (e.g., 10 items).
  - Use an LLM to rank items in each batch.
  - Collect statistics (mean, stddev), detect an “inflection point,” keep top items, iterate to narrow to a high-value subset.
- Features:
  - Outputs JSON with scores and LLM reasoning (explainability).
  - Works with remote APIs (OpenAI) or local models (Ollama, Qwen).
  - Token-efficient.
- Example demos:
  - Fun demo: rank ~600 TLDs for “mathiness.”
  - Security demo: rank ~1,400 call-chains from a SonicOS patch (bindiff + Binary Ninja HLIL) to find the changed function cluster that fixed an off‑by‑one/SSLVPN bypass — the top result matched the actual fix in minutes and at low cost (~$1).
- Practical tips: include a reasoning flag to get LLM explanations; cluster functions by call graph to catch interprocedural fixes.
slice
- Purpose: scale static-analysis + LLM triage for source-code vulnerability classes (example: use-after-free).
- Method:
  - Run a broad/static query (CodeQL) that deliberately casts a wide net (loose interprocedural matching) to generate many candidates.
  - Use tree-sitter to validate call relationships.
  - Feed query results into slice filter/triage/analyze templates that use LLMs to:
    - Triage cheaply (e.g., a smaller model) to eliminate false positives.
    - Analyze remaining candidates more deeply with a stronger model for exploitability, execution path, and recommended next steps.
- Example: CodeQL returned ~217 use-after-free candidates; slice triage reduced to 7, and analysis identified the known exploit (the published SMB2 use-after-free).
LLM-assisted sandboxed triage (agents/subagents)
- Workflow: take top-ranked code snippets/functions and pass them to a sandboxed LLM agent (“cloud code” / subagents) that mounts the codebase and traces reachability, impact, and behavior.
- Use-case shown: IoT device binary decompilation — rank functions for business-logic smell, then agent triage found a diagnostic/login endpoint (possible backdoor) not obviously documented.

Other tooling & integrations

Binary analysis: Binary Ninja (HLIL decompilation), bindiff to extract changed functions/call-chains from patches.
Static query: CodeQL for initial wide-net searches.
AST/relationship validation: tree-sitter.
LLM providers & models:
- OpenAI GPT family for higher-end analysis.
- Local hosting via Ollama / Qwen-3 on GPU (good cost/perf); approach is token-efficient so local models are feasible.
Agents: cloud code / codec-style agents and subagents for per-finding validation to avoid blowing the global context window.

Operational & engineering notes

The approach tolerates many false positives early because LLMs are used as filters to reduce noise to a manageable triage set.
Cost & speed examples:
- Ranking 1,400 call chains took minutes and a few dollars (often cents).
- Triage of 217 candidates to 7 cost roughly $1.50 in one run.
Works purely statically (no build/run required), enabling broad scaling across many repos and firmware images.
Useful to both offensive researchers and defenders:
- Defenders can use ranking to prioritize backlog / bug reports.
- Researchers can scale discovery across many large codebases or closed-source firmware (after extracting binaries/patch diffs).

Practical tips and recommended patterns

Use small-batch sampling + repeated ranking (Monte Carlo style) to compute robust relevance scores.
Cluster by call graph to capture interprocedural vulnerabilities rather than analyzing single functions in isolation.
Add vendor documentation / feature docs to prompts to boost contextual signals.
Use cheaper models for bulk triage and stronger models for final analysis.
Use subagents/subtasks per finding to keep context windows small and focused.
Prefer explicit reasoning output from LLMs for explainability and analyst trust.

Limitations, concerns, and community/operational points

False positives are common from wide static queries — the LLM triage step is essential.
Tooling must be transparent to be trusted by practitioners (they need to understand how it works).
Economic and ethical debates:
- Concerns about volume of reports for low‑resourced open‑source projects (e.g., debates around FFmpeg and automated reports).
- Suggested mitigation: defenders should adopt similar ranking/triage tooling to prioritize fix efforts.
Uncertainty about how quickly vendor-side adoption of automated search will change the bug bounty ecosystem; there remains impactful work available today.

Guides, resources, and demos mentioned

Caleb’s talk/transcript on operator.dev (the Offensive AIcon talk) for a deeper primer and background.
rank (CLI) — demonstration and usage patterns shown live.
slice — released in August; includes an example reproducing the Linux kernel SMB use-after-free.
Upcoming blog post about locally hosted models and JSON schema adherence.

Main speakers / sources

Caleb Gross — presenter, developer of rank & slice, operator.dev blog.
Steve — host (off by1 security stream).
Jonathan — co-host/participant.
Referenced tools/communities and researchers: Bishop Fox, Binary Ninja, bindiff, CodeQL, tree-sitter, OpenAI, Ollama / Qwen local models, cloud code/agent systems, and public researchers (Sean, Thomas Shadwell, Daniel Mesler).