Scaling Security Testing: Closing the Reachability Gap with LLM Agents

Fuzzing is arguably the most successful technique we have for discovering security vulnerabilities. By bombarding software with mutated inputs and monitoring for crashes, tools like Google’s OSS-Fuzz have found tens of thousands of bugs. But despite its success, fuzzing has a massive, often-overlooked bottleneck: The Reachability Gap.

The Problem: The Reachability Gap

Traditional fuzzing assumes you have a way to talk to the software. If you’re testing a command-line utility (like a JPEG converter), that’s easy—you just pipe in a file.

However, modern software is complex. The code you want to test is often buried deep inside an application, locked behind configuration files, runtime environments, and network protocols. We call this the Reachability Gap.

           THE REACHABILITY GAP

Most fuzzing tools assume a direct path to the code:

    +-----------+                 +--------------+
    |  Fuzzer   | --Input--------> |   Target Fn  |
    +-----------+                 +--------------+

But in real-world systems, the target is buried behind barriers:

    +-----------+
    |  Fuzzer   |
    +-----------+
          |
          |  (Input cannot reach directly)
          v
    +-----------------------------------------+
    |            System Barriers              |
    |                                         |
    |  1. Configuration (Files, Settings)     |
    |  2. Environment (Dirs, State)           |
    |  3. Interaction (Network, System Calls) |
    |                                         |
    |              +-------------+            |
    |              |  Target Fn  |            |
    |              +-------------+            |
    +-----------------------------------------+

    **Result:** Without a manual harness, the fuzzer cannot trigger the code.

Existing automated solutions often fail here because they rely on syntactic approaches or require humans to manually write “fuzz drivers”—essentially hacking together a harness just to get the fuzzer to the front door.

A Motivating Example: Nginx SSI

To illustrate the complexity of this gap, imagine you want to test the Server Side Includes (SSI) module in the Nginx web server. The SSI code isn’t exposed via a clean command-line interface. To reach it, you can’t just “fuzz it.” You have to navigate a chain of dependencies:

GOAL: Reach the SSI Parser Function

Step 1: Configuration
   [Edit Nginx.conf] --> Enable SSI Module
                           |
Step 2: Environment        v
   [Create File] ------> Place HTML with SSI syntax in Web Root
                           |
Step 3: Execution          v
   [System Call] ----> Launch Nginx Server
                           |
Step 4: Interaction        v
   [C Program] ------> Send specific HTTP Request
                           |
                           v
                   [ SSI PARSER TRIGGERED! ]

Even in high-profile competitions like the AI Cyber Challenge (AIxCC), participants are often provided with manually written drivers to bypass this exact problem. But to scale security testing to the vast universe of software, we cannot assume a human is available to write a harness for every feature.

Our Solution: LLM Agents + In-Vivo Fuzzing

Our hypothesis is simple: LLMs are excellent at understanding semantic context and “how to use” software, while fuzzers are excellent at finding bugs. We built a system that combines these two strengths into a fully automated, end-to-end testing methodology.

The Workflow

The process works in two distinct phases, visualized below.

PHASE 1: LLM AGENT DISCOVERY
+-------------------+       +-----------------------+
|   LLM Agent       |       |  Target Environment   |
| (Reasoning Engine)|       |   (QEMU VM)           |
+--------+----------+       +-----------+-----------+
         |                              ^
         |  1. Action                  | 2. Feedback
         | (Edit Config,               | (Coverage data)
         |  Write C Client,            |
         |  Send Request)              |
         +------------->---------------+
                      |
                      v
              Is Target Reached? ---- NO --> Loop back to Agent
                      |
                     YES
                      |
                      v
PHASE 2: IN-VIVO AMPLIFICATION
              +-----------------+
              |  Snapshot State |
              +--------+--------+
                       |
             [Fork Server / In-Vivo Fuzzer]
                       |
           Mutate Inputs (e.g., buffer size)
                       |
           Explore deep bug variants

Phase 1: The LLM Agent

We designed an LLM agent (built on top of SWE-agent and EnIGMA) that acts as an autonomous security tester. We give it a target—usually a complex function inside a codebase (e.g., ngx_http_ssi_parse in Nginx)—and let it figure out how to get there.

The agent operates in a fully isolated environment and has access to a suite of tools:

Code Browsing: It can search for function, struct, and macro definitions to understand the codebase.
Environment Interaction: It can edit config files, create auxiliary files, and compile code.
Execution Feedback: It receives code coverage feedback after every interaction to see if it’s getting “warmer” or “colder.”

The agent writes and executes C programs to interact with the target software. It iterates—reasoning, acting, observing feedback—until it successfully triggers the target function.

Phase 2: In-Vivo Fuzzing

Once the agent successfully reaches the target function, we don’t just stop. We invoke in-vivo fuzzing.

We identify the “Amplification Point”—usually the system call handling the input (like recv() or read()). The program state is then forked, and a coverage-guided fuzzer takes over, mutating the input in-place to explore the code paths surrounding that target function.

This allows us to find bugs deep inside the software without ever writing a manual harness.

The Results: Does It Work?

We put our methodology to the test against real-world software, including Nginx, Dnsmasq, Janus, ProFTPD, and the Linux kernel Btrfs driver.

1. Effectiveness of the Agent

Can an LLM actually drive complex software to specific states?

Success Rate: In our first experiment, the agent successfully reached the target function in 56% of the tasks across five diverse projects.
Beyond Input Generation: In the vast majority (71%) of successful cases, the agent had to do much more than just generate an input. It autonomously edited configuration files, set up runtime environments, and launched background services.

We also found that Coverage Feedback outperformed Debugger access. When agents had access to GDB, they often got lost in low-level details. When they received structured code coverage feedback, they reached targets more efficiently.

2. End-to-End Security Testing

In our second experiment, we targeted four open-source projects (Nginx, Dnsmasq, Lighttpd, Mosquitto). We selected the 20 most complex functions in each, tasked the agent with reaching them, and then applied in-vivo fuzzing.

Coverage Gains: Our automated approach achieved significantly higher code coverage than the manually written, expert-crafted fuzz drivers in OSS-Fuzz. In three out of four cases, the coverage increase was substantial.
Real-World Vulnerability: Most importantly, the system discovered a previously unknown vulnerability in Dnsmasq (an out-of-bounds read). The bug was confirmed, patched by the maintainers, and assigned CVE-2025-54318.

Conclusion

The reachability gap has been a silent ceiling limiting the scalability of automated security testing. By shifting the burden of “how to run the software” from human engineers to LLM agents, we can finally apply the power of fuzzing to arbitrary software systems.

Our results show that this isn’t just theoretical—it works. LLM agents can navigate complex system configurations to trigger deep internal states, and when paired with in-vivo fuzzing, they can outperform manual efforts and find real bugs.

To explore our tool and data, visit our GitHub repository: GPSapia/ReachabilityAgent_ICSE

Paper Reference: Sapia, G., & Böhme, M. (2026). Scaling Security Testing by Addressing the Reachability Gap. In Proceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26).