Security research isn’t about finding bugs—it’s about deciding where to look for them.
Whether you are triaging scanner alerts, analyzing a massive firmware patch, or hunting for zero-days, the bottleneck is almost always resource constraints. You have 2,000 changed functions, 500 alerts, and only enough coffee to analyze a fraction of them.
A recent paper by Caleb Gross, “Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank,” introduces a novel solution: treat security prioritization not as a guessing game, but as an Information Retrieval (IR) problem.
The result is SiftRank, an algorithm that uses Large Language Models (LLMs) to rank thousands of items in linear time. It found a specific vulnerability function among 2,197 candidates in 99 seconds for $0.82.
The Problem: “Score This Function”
The author initially tried to solve the patch-diffing problem by feeding decompiled functions to an LLM and asking for a relevance score (1-10).
The failure modes were predictable:
- Inconsistent Scoring: LLMs struggle to produce calibrated, absolute numbers. Is this function an “8” or a “7”? The distinction is arbitrary to the model.
- Context Loss: Compressing a complex function into a single integer loses nuance.
The Solution: Switch from pointwise scoring to listwise ranking. Instead of asking, “How relevant is this?”, ask, “Which of these 10 items is most relevant?”
The SiftRank Algorithm
SiftRank is a Listwise Ranking Algorithm. It doesn’t rank items one by one (O(n)) or in pairs (O(n²)). It ranks small batches of roughly 10 items at a time, iterating until it converges on the winner.
It relies on three mechanisms to achieve O(n) complexity:
- Stochastic Sampling: Randomly sample small batches to average out positional bias.
- Inflection-Based Convergence: Detect when the “relevant” items separate from the “irrelevant” ones automatically.
- Iterative Refinement: Filter out the junk and re-rank the survivors.
How It Works (The Flow)
Imagine you have 100 documents and a query. You don’t ask the LLM to rank all 100 (it will hallucinate). You run trials.
+---------------------------------------------------------------+
| ITERATION 1 (k=1) |
| |
| Trial 1: Shuffle -> [Batch A] [Batch B] [Batch C] |
| LLM Ranks -> Position scores updated. |
| |
| Trial 2: Shuffle -> [Batch D] [Batch E] [Batch F] |
| LLM Ranks -> Position scores updated. |
| ... |
| Check: Has the score distribution stabilized? |
+--------------------------|------------------------------------+
| (Convergence Detected at Inflection Point)
v
+---------------------------------------------------------------+
| REFINEMENT |
| Partition the list at the "Elbow" (Inflection Point). |
| - Top items -> Proceed to Iteration 2. |
| - Bottom items -> Frozen (Discarded). |
+---------------------------|-----------------------------------+
v
ITERATION 2 (on reduced set)
...
v
FINAL RANKING
Comparison of Ranking Approaches
Why is SiftRank different? Here is a comparison of standard methods:
| Approach | Description | Complexity | Pros | Cons |
|---|---|---|---|---|
| Pointwise | Score each doc individually. | O(n) | Fast. | LLMs are bad at absolute scoring; loses nuance. |
| Pairwise | Compare Doc A vs Doc B. | O(n²) or O(n log n) | Good relative judgment. | Slow for large datasets; one error ruins the chain. |
| Listwise | Rank a list of 10 docs at once. | Potential O(n) | Preserves context; efficient. | LLMs struggle with long lists; positional bias. |
| SiftRank | Listwise + Stochastic Sampling + Convergence. | O(n) | Scalable; robust to noise; detects cutoff points. | Requires multiple API calls (mitigated by concurrency). |
The “Magic” of Inflection-Based Convergence
One of the smartest features in SiftRank is how it decides when to stop. It doesn’t use a fixed number of trials. It looks for the Inflection Point.
As the LLM ranks batches, items accumulate scores.
- Relevant items consistently rank #1 or #2 (Low scores).
- Irrelevant items bounce around #5-#10 (High scores).
The algorithm plots the scores and finds the “elbow”—the point where the curve bends sharply. This separates the candidates from the noise.
ASCII Visualization of Score Distribution:
Score Distribution (Lower is better)
Score 1.0 | ** ** ** ** ** ** ** **
| ** ** ** ** ** ** ** **
Score 2.5 | ** ** ** ** ** **
| ** ** **
Score 4.0 | <-- Inflection Point
| (The "Gap")
Score 6.0 | ** ** ** ** ** ** **
| ** ** ** ** ** ** ** **
Score 8.0 | ** ** ** ** ** ** ** ** **
+----------------------------------------------------
[Highly Relevant] [Irrelevant / Noise]
Case Study: SonicWall Firewall Firmware
The author tested SiftRank on a real-world N-day vulnerability: CVE-2024-53704 (SonicWall SonicOS Authentication Bypass).
The Challenge:
- A firmware patch with 2,197 changed functions.
- The binary was stripped (no function names).
- The security advisory only gave vague hints (“Base64-encoded session cookies”).
The Workflow:
- Decompile functions and generate call chains.
- Query: The text of the security advisory.
- Run SiftRank to sort function chains by relevance.
The Code (Pseudocode):
def siftrank(corpus, query, batch_size=10, max_trials=50):
current_corpus = corpus
# Iterate until we narrow it down to 1 item
while len(current_corpus) > 1:
scores = {doc: 0 for doc in current_corpus}
# Stochastic Trials
for t in range(max_trials):
# 1. Shuffle and Batch
shuffled = random.shuffle(current_corpus)
batches = create_batches(shuffled, batch_size)
# 2. Rank batches concurrently
for batch in batches:
ranked_batch = llm_rank(batch, query) # LLM returns ordered list
for position, doc in enumerate(ranked_batch):
# Update running average score (lower is better)
scores[doc] = update_average(scores[doc], position)
# 3. Check for Convergence
if has_inflection_point_stabilized(scores):
break
# 4. Refinement Step
sorted_corpus = sort_by_score(scores)
inflection_index = detect_elbow(sorted_corpus)
# Keep top portion for next iteration, freeze bottom portion
current_corpus = sorted_corpus[:inflection_index]
return current_corpus
The Results:
SiftRank identified the specific function cluster responsible for the vulnerability fix.
| Metric | Result |
|---|---|
| Total Functions | 2,197 |
| Call Chains Analyzed | 2,713 |
| Execution Time | 99 seconds |
| Inference Cost | $0.82 |
| Model Used | gpt-5-nano-2025-08-07 |
| Accuracy | Correct cluster ranked #1 |
The algorithm successfully surfaced a function containing a vulnerable string comparison loop that exited early on a null byte—the exact mechanism for the auth bypass described in the advisory.
Why This Matters
SiftRank represents a shift in how we apply LLMs to security. We often ask LLMs to be “experts” that output the final answer. SiftRank asks the LLM to be a “comparator”—a judge that sorts options.
This approach is:
- Domain Agnostic: It works on code, text, JSON, or web data.
- Zero-Shot: No fine-tuning required. Just provide the query (advisory) and the documents (functions).
- Scalable: It handles datasets that exceed context windows by breaking them into batches.
References
- Gross, C. (2025). Sift or Get Off the PoC: Applying Information Retrieval to Vulnerability Research with SiftRank. arXiv:2512.06155.
- Sun et al. (2023). Is ChatGPT good at search? investigating large language models as re-ranking agents. (RankGPT).
- Tang et al. (2024). Found in the middle: Permutation self-consistency improves listwise ranking in large language models.
- Zhuang et al. (2024). A setwise approach for effective and highly efficient zero-shot ranking with large language models.
- You can find the open-source implementation of SiftRank at: https://github.com/noperator/siftrank