Summary of "Scaling LLM-Based Vulnerability Research via Static Analysis and Document Ranking"
Core idea
Use large language models (LLMs) as scalable, low-cost rankers/filters to find high‑value items (vulnerabilities, suspicious functions, etc.) inside very large data sets (patch diffs, binaries, source trees, alert queues). The approach applies document-ranking + Monte Carlo sampling: repeatedly sample small batches, have an LLM rank them, keep the best, and iterate to narrow down to high-value items.
Analogy: sift a huge pile of sand by repeatedly sampling small handfuls, ranking them, keeping the best, and iterating — applied algorithmically with LLMs.
Key tools & techniques demonstrated
-
rank (command-line)
- Purpose: document-ranking over large lists of items (strings, functions, call chains, TLD list demo).
- Method:
- Repeated small-batch sampling (e.g., 10 items).
- Use an LLM to rank items in each batch.
- Collect statistics (mean, stddev), detect an “inflection point,” keep top items, iterate to narrow to a high-value subset.
- Features:
- Outputs JSON with scores and LLM reasoning (explainability).
- Works with remote APIs (OpenAI) or local models (Ollama, Qwen).
- Token-efficient.
- Example demos:
- Fun demo: rank ~600 TLDs for “mathiness.”
- Security demo: rank ~1,400 call-chains from a SonicOS patch (bindiff + Binary Ninja HLIL) to find the changed function cluster that fixed an off‑by‑one/SSLVPN bypass — the top result matched the actual fix in minutes and at low cost (~$1).
- Practical tips: include a reasoning flag to get LLM explanations; cluster functions by call graph to catch interprocedural fixes.
-
slice
- Purpose: scale static-analysis + LLM triage for source-code vulnerability classes (example: use-after-free).
- Method:
- Run a broad/static query (CodeQL) that deliberately casts a wide net (loose interprocedural matching) to generate many candidates.
- Use tree-sitter to validate call relationships.
- Feed query results into slice filter/triage/analyze templates that use LLMs to:
- Triage cheaply (e.g., a smaller model) to eliminate false positives.
- Analyze remaining candidates more deeply with a stronger model for exploitability, execution path, and recommended next steps.
- Example: CodeQL returned ~217 use-after-free candidates; slice triage reduced to 7, and analysis identified the known exploit (the published SMB2 use-after-free).
-
LLM-assisted sandboxed triage (agents/subagents)
- Workflow: take top-ranked code snippets/functions and pass them to a sandboxed LLM agent (“cloud code” / subagents) that mounts the codebase and traces reachability, impact, and behavior.
- Use-case shown: IoT device binary decompilation — rank functions for business-logic smell, then agent triage found a diagnostic/login endpoint (possible backdoor) not obviously documented.
Other tooling & integrations
- Binary analysis: Binary Ninja (HLIL decompilation), bindiff to extract changed functions/call-chains from patches.
- Static query: CodeQL for initial wide-net searches.
- AST/relationship validation: tree-sitter.
- LLM providers & models:
- OpenAI GPT family for higher-end analysis.
- Local hosting via Ollama / Qwen-3 on GPU (good cost/perf); approach is token-efficient so local models are feasible.
- Agents: cloud code / codec-style agents and subagents for per-finding validation to avoid blowing the global context window.
Operational & engineering notes
- The approach tolerates many false positives early because LLMs are used as filters to reduce noise to a manageable triage set.
- Cost & speed examples:
- Ranking 1,400 call chains took minutes and a few dollars (often cents).
- Triage of 217 candidates to 7 cost roughly $1.50 in one run.
- Works purely statically (no build/run required), enabling broad scaling across many repos and firmware images.
- Useful to both offensive researchers and defenders:
- Defenders can use ranking to prioritize backlog / bug reports.
- Researchers can scale discovery across many large codebases or closed-source firmware (after extracting binaries/patch diffs).
Practical tips and recommended patterns
- Use small-batch sampling + repeated ranking (Monte Carlo style) to compute robust relevance scores.
- Cluster by call graph to capture interprocedural vulnerabilities rather than analyzing single functions in isolation.
- Add vendor documentation / feature docs to prompts to boost contextual signals.
- Use cheaper models for bulk triage and stronger models for final analysis.
- Use subagents/subtasks per finding to keep context windows small and focused.
- Prefer explicit reasoning output from LLMs for explainability and analyst trust.
Limitations, concerns, and community/operational points
- False positives are common from wide static queries — the LLM triage step is essential.
- Tooling must be transparent to be trusted by practitioners (they need to understand how it works).
- Economic and ethical debates:
- Concerns about volume of reports for low‑resourced open‑source projects (e.g., debates around FFmpeg and automated reports).
- Suggested mitigation: defenders should adopt similar ranking/triage tooling to prioritize fix efforts.
- Uncertainty about how quickly vendor-side adoption of automated search will change the bug bounty ecosystem; there remains impactful work available today.
Guides, resources, and demos mentioned
- Caleb’s talk/transcript on operator.dev (the Offensive AIcon talk) for a deeper primer and background.
- rank (CLI) — demonstration and usage patterns shown live.
- slice — released in August; includes an example reproducing the Linux kernel SMB use-after-free.
- Upcoming blog post about locally hosted models and JSON schema adherence.
Main speakers / sources
- Caleb Gross — presenter, developer of rank & slice, operator.dev blog.
- Steve — host (off by1 security stream).
- Jonathan — co-host/participant.
- Referenced tools/communities and researchers: Bishop Fox, Binary Ninja, bindiff, CodeQL, tree-sitter, OpenAI, Ollama / Qwen local models, cloud code/agent systems, and public researchers (Sean, Thomas Shadwell, Daniel Mesler).
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.