Modern software systems are developed and deployed at a pace that traditional security testing struggles to match. With constant updates, rapid release cycles, and increasing complexity, there’s a growing need for better, faster, and continuous testing to identify vulnerabilities before they can be exploited.
AI-driven Hackbots, autonomous penetration testing agents, offer a promising solution. These systems can simulate human-like ethical hacking, automate vulnerability discovery, and scale security testing far beyond manual methods. By leveraging advanced reasoning and automation, hackbots can keep up with the speed and scale of modern development.
While traditional scanners usually go for breadth, hackbots can also go for depth, making them an attractive addition to the cybersecurity arsenal. However, these systems are not without risks. A primary concern is hallucinations, where AI generates inaccurate or misleading results. In security contexts, this could mean false positives, missed threats, or flawed guidance. To be truly effective and trustworthy, hackbots must themselves be rigorously tested and validated to ensure their outputs are reliable and actionable.
Evaluating AI Pentesting
At Ethiack, we place reliability at the core of our work. That’s why we approach the development of our own multi-agent AI pentesting system - a.k.a. Hackbot - with a rigorous, scientific mindset.
To support this, we rely on robust evaluation frameworks, and the OWASP Benchmark [1] stands out as especially valuable. Unlike many datasets, it offers a comprehensive set of test cases with clearly defined ground truth labels - each example explicitly marked as either vulnerable or secure. This allows for an accurate evaluation of the Hackbot’s precision: its ability to correctly identify real vulnerabilities while avoiding false alarms.
Our experiments are conducted from a dynamic analysis perspective, i.e., without access to source code. Furthermore, since rigor is one of our core principles, we evaluate our Hackbot using the following key guidelines:
- Unbiased Assessment
The Hackbot receives no information about the presence or type of vulnerability in any sample. Input prompts are neutral and not tailored to this specific evaluation, ensuring an unbiased testing process. - Realistic and Sustainable Operation
The number of recursive reasoning cycles (agentic recursions) the Hackbot is allowed per sample is deliberately limited. This constraint reflects a realistic and sustainable usage scenario, ensuring that performance reflects practical, real-world conditions.
With this rigorous and grounded evaluation setup in place, we can objectively measure our Hackbot’s performance against traditional vulnerability scanners. The following results highlight its effectiveness, particularly in detecting Command Injection and Path Traversal vulnerabilities - two categories that may be challenging for conventional dynamic analysis tools, like the OWASP ZAP.
What’s particularly impressive to us is that the Hackbot is capable of accomplishing these results while maintaining a false positive rate below 1% on both vulnerability types (actually 0.0% on Path Traversal!). This is on par with our non-Hackbot Machine testing, which sees 0.5% false positives.
Introducing the Verifier
Throughout its operation, Hackbot leverages a suite of custom tools specifically built to verify the existence of vulnerabilities. For example, Cross-Site Scripting (XSS) vulnerabilities can be confirmed using a headless browser to detect whether a dialog was executed, while Remote Code Execution (RCE) may be verified via interactions with a collaborator server (such as interact.sh from Project Discovery). But beyond these granular verification steps, the Hackbot has a secret weapon: a specialized sub-agent known as the Verifier.
The Verifier plays a critical role within the Hackbot’s architecture. It is designed to verify the Hackbot’s assessments by performing a deeper, more focused reflection on the system’s reasoning process. Acting as a post-processing step, the Verifier re-evaluates conclusions drawn during the Hackbot’s iterative analysis. This additional layer of reasoning effectively reduces false-positive and false-negative predictions, enhancing our confidence in the final output.
For example, on SQL Injection evaluation samples, we see how the Verifier is capable of improving performance by increasing the rate of true positives while decreasing the false positives.
Verifier vs Hallucinbot
To better understand the impact of the Verifier, we conducted a set of experiments comparing it against a deliberately altered version of our Hackbot, nicknamed Hallucinbot. This version is engineered to simulate less reliable behavior through an increasing rate of false positives, mimicking what happens when an AI system draws premature or overconfident conclusions without sufficient validation.
To induce this effect, we applied three types of modifications: (i) base-model changes, by using language models with less stable behavior; (ii) prompt alterations, by intentionally degrading key instructional cues from the original Hackbot; (iii) parameter tuning, such as increasing model temperature to introduce more randomness. By applying different combinations of these changes, we created three distinct Hallucinbot variants. This diversity allowed us to test the Verifier’s robustness in filtering out unreliable conclusions across a range of challenging and inconsistent behaviors.
Importantly, the Verifier used in this experiment remains exactly the same as in our earlier evaluations - no modifications, tuning, or optimizations were made to accommodate the Hallucinbot. This ensures a fair and unbiased test of the Verifier’s robustness in filtering out inaccurate or overconfident conclusions.
The evaluation focused on the Cross-Site Scripting (XSS) samples from the OWASP Benchmark. By contrasting Hallucinbot’s performance with and without the Verifier in place, we can observe how post-hoc reasoning improves overall reliability. The Verifier’s ability to critically assess and refine the Hallucinbot’s conclusions proves especially valuable in filtering out hallucinated vulnerabilities and preserving trust in the system’s output.
The results below illustrate just how effective this Verifier can be in mitigating one of the core challenges in AI-driven security testing.
Real-World Corrections: The Verifier in Action
Finally, we present real-world examples where the Verifier successfully corrected false-positive vulnerability detections made by the Hackbot. These cases highlight the practical value of the Verifier beyond synthetic benchmarks, demonstrating how it refines results even in complex, ambiguous scenarios.
Each example shows how initial detections, while technically plausible, were ultimately rejected through deeper analysis by the Verifier. These real-world corrections underscore the importance of validation in AI-driven security and showcase how the Verifier plays a key role in building trust and reliability into the AI-driven pentesting systems.
Example 1: Stored XSS in Jenkins Job Description
In this first case, the Hackbot flagged a potential Stored Cross-Site Scripting (XSS) vulnerability after detecting a base64-encoded payload in a Jenkins job description. At first glance, the finding seemed legitimate. However, upon reviewing the full execution logs, the Verifier determined that the Hackbot had merely observed existing content rather than injecting or confirming any executable payloads. Lacking concrete evidence of exploitation, the Verifier correctly dismissed the report as a false positive.
Example 2: SSRF via Redirect Functionality on HackerOne
In this second example, the Hackbot reported a possible Server-Side Request Forgery (SSRF) vulnerability based on response time variability and permissive redirect behavior. While the surface indicators suggested suspicious activity, the Verifier’s analysis of the complete logs found no proof of internal resource access or data leakage. Without evidence of actual exploitation, the Verifier invalidated the report, reinforcing accuracy by filtering out an overreaching conclusion.
Final Thoughts
As AI becomes more involved in security testing, ensuring the reliability of its outputs is essential. The Verifier is a good example of how thoughtful design can address one of AI’s core challenges: knowing when it’s wrong. While not foolproof, this kind of post-hoc verification helps reduce false conclusions and brings us closer to more dependable automation in pentesting.
This is just one of the innovations needed to create reliable Hackbots that help us secure the internet. We’ll be discussing what’s next in hackbots at HackAIcon, with speakers from HackerOne, Tenable, Bedrock, and more confirmed.
See you there?
References
[1] “OWASP Benchmark owasp benchmark project”, https://owasp.org/www-project-benchmark/
[2] Potti, U., H. Huang, H. Chen, and H. Sun. 2025. "Security Testing Framework for Web Applications: Benchmarking ZAP V2.12.0 and V2.13.0 by OWASP as an example." ArXiv abs/2501.05907. https://doi.org/10.48550/arXiv.2501.05907
[3] https://www.zaproxy.org/docs/scans/benchmark/