PentestJudge: Judging Agent Behavior Against Operational Requirements -arxiv.org/abs/2508.02921 by @dreadnode
Introducing PentestJudge, an LLM-as-judge system for evaluating the operations of pentesting agents. The scores are compared to human domain experts as a ground-truth…
What's after programmatic verification for offsec?
As we deploy these systems, there's a lot about pentesting we'll want to treat as eval metrics or training objectives that are difficult to verify.
Judges for non-verifiable tasks present a way forward: are they any good?
What's after programmatic verification for offsec?
As we deploy these systems, there's a lot about pentesting we'll want to treat as eval metrics or training objectives that are difficult to verify.
Judges for non-verifiable tasks present a way forward: are they any good?
Incoming: Dreadnode paper drop from @shncldwll and the crew 🏴☠️
PentestJudge—Judging Agent Behavior Against Operational Requirements: arxiv.org/abs/2508.02921
Explore how we built an LLM-as-judge system for evaluating the operations of pentesting agents [inspired by @OpenAI's…
Read "Spain’s Huawei Deal Is a Wake-Up Call for U.S. Federal Procurement Reform" in @WarOnTheRocks, written by our very own Head of Policy @velvethamm3r.
Read "Spain’s Huawei Deal Is a Wake-Up Call for U.S. Federal Procurement Reform" in @WarOnTheRocks, written by our very own Head of Policy @velvethamm3r.
✍ After talking AI Action Plan on @CyberScoopNews, wrote up @dreadnode thoughts on implementation ➡️ dreadnode.io/blog/five-take…
‼️ While we debate frameworks, adversaries build AI attack capabilities. We need: evaluation ecosystems, red teaming, and procurement standards.
✍ After talking AI Action Plan on @CyberScoopNews, wrote up @dreadnode thoughts on implementation ➡️ dreadnode.io/blog/five-take…
‼️ While we debate frameworks, adversaries build AI attack capabilities. We need: evaluation ecosystems, red teaming, and procurement standards.
Will be hanging out at the Agentic Summit this Saturday. Happy to meet up and talk agent observability, evals, and deployment for cyber security.
rdi.berkeley.edu/events/agentic…
Wrote about evals at Dreadnode. This one is for hackers getting up to speed on agents for their use cases. How do you go from PoC to prod?
Don't wait for a lab to build benchmarks that measure what you care about. Do it yourself. Here's how:
Wrote about evals at Dreadnode. This one is for hackers getting up to speed on agents for their use cases. How do you go from PoC to prod?
Don't wait for a lab to build benchmarks that measure what you care about. Do it yourself. Here's how:
Tune in to @CyberScoopNews SafeMode podcast for an in-depth exploration of the new AI Action Plan and its sweeping implications for critical infrastructure security—featuring our very own @velvethamm3r! cyberscoop.com/radio/daria-ba…
We're heading to Vegas August 5-10! Send us a DM if you'd like to meet up onsite.
Happy to share our latest offensive agents, AI red team tooling, custom evals, and training capabilities on the Strikes platform. Plus, "shiny rocks"??
At #CriticalEffectDC, Daria Bahrami presented her pitch for an AI security roadmap to a panel of Congressional staffers in @beauwoods' Cyber Policy Shark Tank and took home first place. In a blog for @dreadnode, Daria outlines her recommendations and next steps for…
Just presented "AI at the Edge: Advancing the State of Offensive Security" with @bradpalmtree at #HammerCon 2025! Watch here: youtube.com/watch?v=JTQ6Fj…. Thread on how we got here and why this work matters for the cyber community 👇🧵 1/3
13 Followers 202 Following💀 Cyber Threat Hunter | AI Explorer | Malware Whisperer
🧠 Merging machine intelligence with human grit to outsmart digital chaos.
#CTI #Malware #OSINT #AI #Th
3K Followers 1K FollowingCTO at Robust Intelligence. Formerly, Microsoft, Endgame/Elastic, Mandiant/FireEye, Sandia & MIT Lincoln Labs.
'He who forgives ends the quarrel'
535 Followers 905 FollowingInterested in social & technical development, partner at https://t.co/xopzu9qHn1, founder of https://t.co/t4DzaJJgY2. Views my own.
6K Followers 602 FollowingCEO and founder of XBOW. Previously: Founder of GitHub Next, founder of GitHub Copilot, CEO and founder of Semmle (GitHub Advanced Security), prof at Oxford.
3K Followers 1K FollowingCTO at Robust Intelligence. Formerly, Microsoft, Endgame/Elastic, Mandiant/FireEye, Sandia & MIT Lincoln Labs.
'He who forgives ends the quarrel'
130K Followers 985 Following⊰•-•⦑ latent space steward ❦ prompt incanter 𓃹 hacker of matrices ⊞ breaker of jails ☣︎ ai danger researcher ⚔︎ red team bt6 ⚕︎ architect-healer ⦒•-•⊱
93K Followers 3K FollowingJournalist - cyber/national security. Author - COUNTDOWN TO ZERO DAY: Stuxnet and the Launch of the World's First Digital Weapon. https://t.co/334DzfSL1f
68K Followers 586 FollowingHigh Queen of the Cybers | Educator | Content Creator | UwU-Anointed Wapp King | Ex-Brit | https://t.co/04RRExvxXj (he/him) 🇺🇸 I run gameshows at DEF CON.
717 Followers 260 FollowingMilitary Cyber Professionals Association is dedicated to developing American military cyber professionals & investing in our nation's future through STEM ed.
34K Followers 832 FollowingProfessor in Computer Science at UC Berkeley, co-Director of Berkeley RDI Center; Building safe, secure, decentralized AI; Serial entrepreneur
87K Followers 6K Followingsecuring what matters | 🎙 pod TO CATCH A THIEF | ✍️ book THIS IS HOW THEY TELL ME THE WORLD ENDS | ex cyber @nyt | backing digital heroes @silverbuckshot 🚀
636K Followers 35 FollowingWe're an AI safety and research company that builds reliable, interpretable, and steerable AI systems. Talk to our AI assistant @claudeai on https://t.co/FhDI3KQh0n.
16K Followers 2K FollowingTargeted Ops Red Team @ TrustedSec | Hacking since Renegade BBS backdoors | Prior CrowdStrike/BHIS | In Christ's grip | I speak for myself only | K1HAQ
45K Followers 64 FollowingStudent of mind and nature, libertarian, chess player, cancer survivor. @ Keen, UAlberta, Amii, https://t.co/u8za2Kod54, The Royal Society, Turing Award
6K Followers 1K FollowingReporter @Reuters covering Google/Alphabet and AI. Formerly @Forbes. [email protected]. kenrick.01 on Signal (no PR pitches). It makes sense dramaturgically.
56K Followers 853 FollowingFiguring out AI @allen_ai, open models, RLHF, fine-tuning, etc
Contact via email.
Writes @interconnectsai
Wrote The RLHF Book
Mountain runner
602K Followers 5K FollowingPresident & CEO @ycombinator —Founder @Initialized—designer/engineer who helps founders—San Francisco Dem accelerating the boom loop—e/acc—technology brother
550 Followers 68 FollowingOpen Source community researching AI Vulnerabilities.
Report an AI Vuln: https://t.co/2sSxAZRcQo…
Join us on discord: https://t.co/gCtRKg1Z4J
1.4M Followers 1K FollowingBuilding @EurekaLabsAI. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets.
20K Followers 97 FollowingThe #1 AI Engineering podcast & newsletter. Technical insights and news today you will use at work tomorrow! Hosted by @swyx and @fanahova
488K Followers 146 FollowingNobel Laureate. Co-Founder & CEO @GoogleDeepMind - working on AGI. Solving disease @IsomorphicLabs. Trying to understand the fundamental nature of reality.
71K Followers 1K FollowingWIRED writer, author of SANDWORM and now TRACERS IN THE DARK: The Global Hunt for the Crime Lords of Cryptocurrency. Andy.01 on Signal. [email protected]
13K Followers 3K FollowingSecurity reporter @WIRED. she/her/my man. Well of course, everything looks bad if you remember it. Signal +1 (347) 722-1347 @[email protected]
No recent Favorites. New Favorites will appear here.