Palo Alto: Using AI to Win DARPA's AI Cyber Challenge

Name: Palo Alto: Using AI to Win DARPA's AI Cyber Challenge
Uploaded: 2026-06-25T20:31:09-04:00
Duration: 49 min 35 s
Description: How do you secure a podium finish in one of the world's most grueling hacking competitions? In this deep dive, Michael Brown from Trail of Bits reveals the strategy behind their success in the DARPA AI Cyber Challenge. The challenge wasn't just about...

Palo Alto Networks

06/25/2026

0 (0%)

Report Like Favorite

Transcript

Welcome to the AI Security Nexus podcast, the definitive bridge between artificial intelligence and cybersecurity. I'm your host, Charlie McCarthy, and each episode we take a deep dive into the ever-growing AI security threat landscape. This time, I sat down with Michael Brown, head of AIML Security Research at Trail of Bits. We touched on a wide variety of things, but most impressively, their team's recent success at DEF CON 33 in the AI Cyber Challenge presented by DARPA. This is a good one. Let's get into it. Before we dive into the meat of this interview, do you mind getting the audience a little bit familiar with your professional journey and kind of what brought you to this point in your career working on AIML with Trail of Bits? Yeah, sure. It's been a bit of a ride, that's for sure. So I've always been interested in computing as a profession. So I went to the University of Cincinnati for my undergrad, got my undergrad in computer science. Then went and did something completely different for like eight years after that. I joined the army and flew helicopters. As I was getting out of the army, I was getting married, having kids, so I decided, you know, I probably need to get back into a line of work that leaves me home a little bit more. So I got my master's degree from Georgia Tech online while I was still in the military, got out, started working for them, and kind of found my way into doing security research just through the work that I had studied as a graduate student. Did most of the PhD at Georgia Tech and then eventually moved on to come work at Trail of Bits here. So I started as a senior security engineer and worked myself up to a principal security engineer. Now I lead a team of about seven or eight full-time employees who do work on AIML security research. So it really has kind of two primary flavors. The one that we're going to talk about today is more using AI and ML to achieve security objectives. So this means trying to solve problems that conventional analysis techniques for cybersecurity is kind of plateaued, but now there's all kinds of opportunities that we have to really push the boundary using fuzzier techniques that you get with AIML. And then we also do a little bit of work in securing AIML systems. We're currently doing some work building out AI build materials tools, but the work that we did on the AI Cyber Challenge definitely falls into like, how do you take an AIML tool and use it to help you solve conventional cybersecurity problems that we've been stymied with for the last several years? Right. Awesome. Okay. Thank you for that. And also thank you for your service. What an interesting career that spans, yeah, a couple of different categories. That's pretty cool. So to your point on the MLSecOps podcast and within the community, we, for the most part, talk about AI security on the side of the coin that talks about securing the AI ecosystem. But today we're going to be talking more about the other side of the coin, using AI to secure more traditional software and protect it. Before we get into the details of Buttercup, which was super innovative, can you give the audience a little bit of the origin story of the DARPA AI Cyber Challenge and maybe like what the fundamental problem was that was trying to be solved and the stakes for the teams that were involved? Yeah, sure. So I'll kind of go all the way back to the beginning. So DARPA is the Defense Advanced Research Projects Agency. They exist as a funder of research for which there's no immediate commercial market available for it. So this is kind of like, you know, medical research gets done in a lot of universities and that's because there's not necessarily a commercial demand for this, but it lifts up humanity when we do these kinds of research projects. So in this case, one of the things that DARPA had recognized is that there was a serious security gap with open source security software. A lot of the software libraries that make up applications, both in commercial, government and personal use, a lot of these are open source software packages that are kind of thanklessly and without pay being maintained by really passionate maintainers from all over the world. And when we find vulnerabilities in these systems, we don't really have the infrastructure in place to go out there and remediate them the way that we would probably expect it to, given how important they are to how we build software systems these days. So a good example of this is Log4J. You know, when the vulnerabilities were discovered in Log4J, they became a huge issue because it turned out that this this open source package that somebody had created wasn't getting paid for, was maintaining in their own volunteer hours, was now pervasive is everywhere. Everyone was vulnerable and they had to go back and kind of take this feature out that was that was vulnerable. So DARPA has recognized that securing this giant ecosystem of open source software, it's really tilted in the advantage of attackers. Attackers have to be right one place at one time, and defenders have to be right all the time everywhere. And organizations are doing a good job of defending themselves. They have IT staff, they have security engineering staff, and they have their own calculus for how they want to defend their software, but they don't look at open source software. You know, some of the big players in tech will invest via the open source security foundations and trying to improve the security posture of these various tools and various like ecosystems. But that's that's kind of that's few and far between. So they came up with this idea. What if we challenge the research staff that will that exists to kind of bid and propose on DARPA projects, these big swing, high risk, high reward type projects? Let's challenge them with an open competition to build a fully automated, AI driven cyber reasoning system that can find and patch vulnerabilities within the open source ecosystem. So the idea is that we take this and we kind of we tip the scales back in favor of defenders, where now they can use tools like AI that give us the promise of massive scale, working while we sleep. And now with the advent of large language models and artificial intelligence more generally, the ability to go out and actually solve some of these really challenging problems. So the AI cyber challenge started two years ago. Consisted of three phases. The first was concept competition. The second was an actual competition round that took place last year at DEF CON. And then the finals took place at DEF CON just a couple of months ago. And so Trailbiz competed in all three of those. We're a longtime DARPA performer. So we typically work on these big, high risk, high reward type projects, the most challenging problems and security. So, yeah, and we did really well. We took second place and we won about six million dollars in prizes along the way for our original concepts for a win in the semifinals and then a win in the finals. Wow. Sorry, did you say, Michael, that the first phase of the competition started two years ago? Yeah, that's correct. Wow, that's crazy. And so your entire team has worked through these three phases over the last couple of years. That's a lot of time and energy as you got into the final phases. I mean, it's a pretty high stakes competition with significant prizes. What did that feel like, especially as you moved into like phase three for you and the team? You know, being in that final round and competing against, I'm sure, a lot of other super innovative solutions. What was that like for you guys, just from like a personal and emotional standpoint? It was it was a ride, I'm not going to lie. So. So typically when we when we work on DARPA projects, we you know, they'll put out a solicitation, they'll give us this big problem that they want to go solve, and then, you know, us and our competitors, whether it be universities, whether it be other small businesses like Trail of Bits, we all put our heads together and then we kind of come up with an idea for what we think will work. We propose it and then we hope that we get our our bid accepted. And if we do, then we typically work on the project for anywhere between 18 and 36 months. But we know up front that, you know, like we won the contract, like we're going to go execute on a contract. This competition was different because there was no guarantee of payoff ever. So it was much more nerve wracking, certainly much more challenging. And the fact that it was an open competition, that anyone could go in and anybody could try to win this whole thing. There's a lot of unknowns, you know, usually like the competitive aspect of working for, you know, on federal research programs. Usually it stops once you put a proposal in, you either got accepted to work on it or you didn't. This, you're constantly dealing with, there's a there's a cut down at the next level. So, for example, the first phase, the concepts white paper phase, DARPA awarded $1 million in seed funding to seven small businesses to help them kind of offset these like sort of weird financial restrictions that keep them from participating in competitions like these, you know, universities a little bit easier. They have different kind of economics around grad students versus like professional researchers. So, you know, there was there was a time where we were like, what do we want to have is if we if our concept paper doesn't win seed funding, like, do we still compete at risk or or do we, you know, maybe, you know, look at teaming up with someone else who did get seed funding? And unfortunately, we didn't have to broach that plan B. You know, we were we were basically we were seed funded. We got a two million dollar prize in the semifinals for being one of the top seven teams and advancing to the finals. So we even though it was really nerve wracking, we still kind of always had the financial aspect of it taken care of because we were successful at each stage. But that wasn't the case for a lot of our competitors. So when I say it was nerve wracking and kind of a departure from how we usually work for us, it was we had it easy compared to what a lot of our other competitors had to deal with. Yeah, that makes sense. Having that financing in the background to help. Lessen the justification that you have to make for doing the work and having the people resources and stuff is very nice. OK, so Trail of Bits system that took second place again for the audience is called Buttercup. I want to pause there because that's a memorable name. First thing it makes me think of is the Princess Bride. So I have to ask, like, how did it get its name? I wish we had a really great story for this, but we kind of don't. So when I when I wrote the original concept for Buttercup for the competition, my kids were obsessed with the movie WALL-E, so I actually named it Patchy to start. But then a little bit further down the line, we decided that, you know, like we wanted something a little bit more marketable, a little bit easier to remember. And and then somebody suggested Buttercup. We were all fans of the Princess Bride. So the name kind of is a Princess Bride reference. Yeah. Yeah. So I mean, there was a couple of names that were being floated around that one. That one became the biggest snowball rolling downhill and eventually it took over. So it is much easier to to remember and say than Patchy. And you don't have to worry about like, you know, not having an obscure Disney reference. I think more people have more people have seen and appreciated the Princess Bride, you know, perhaps in the small, small Disney robot. I don't know. Well, I mean, it's probably the other way around, actually, but I don't know. It resonated with me and I've seen practically nothing. I am not on pop culture. So there we go. OK, can you talk to us, Michael, about the design philosophy behind Buttercup? There's a presentation, I think, on the Trail of Bits website, which I'll share a link to in the transcript for the show for the audience that mentions a guiding principle of breaking down the problem and then using the best technique for each sub problem within that larger problem. What did that mean for y'all in practice? Yeah, so when we started off with this project, we like these are these DARPA programs. They're they're massive. And most of the time when you when you read a problem that DARPA lays out in front of us, the first thing we say is, OK, that's impossible. And then we think about it a little bit more like, OK, maybe it'll be possible if and eventually it's a point where like, OK, this has a pretty decent chance of working if these different things all work out. So very rarely do you get a problem put in front of you that is so massive, comprised of so many sub problems where everything has to go right in sequence for things to work. So so right off the bat, these these problems are they're among the most challenging problems in research that like you can't look up a solution for these because no one's even trying to do them before. If it's if it's something that's been done before, it's not something that DARPA or ARPA-H or any of the government funding agencies, that's just not what they're what they're there to do. So at the risk of, you know, I don't know, sounding too cool for I've been working at the intersection of AIML, compilers, software analysis and security research for the last eight or nine years now. So I was working with how do you solve conventional security problems with AIML solutions before the large language model kind of became the predominant form of the technology. So over the last couple of years, some of the things we've been helping our our clientele, various government organizations and understanding is, you know, what are the actual capabilities of large language models with respect to traditional security problems? So when this competition came around, we had kind of some unique insights and frankly, probably some more pragmatic insights as to what these models were actually capable of doing versus what they actually were not capable of doing. So when we took a look at how we would build the system, we said, OK, we've got to do you know, these five things and some of these problems are really kind of already solved by conventional techniques. Some of these are still very open, but they lend themselves towards an AIML type solution. So from the beginning, we wanted a best of both worlds approach. We wanted to combine what would work best from the AIML kind of collection of problem solving techniques and the conventional software analysis, a collection of problem solving techniques and kind of piece them together and chain them together. So we wanted to use, you know, a component where it was strong and we wanted to minimize the number of kind of compounding errors that would occur over the course of the of the pipeline trying to solve this long end to end problem. So, yeah, we were from the very beginning, we were never going to use AI for everything. And from the very beginning, we were never going to ignore the fact that it was the AI cyber challenge and just try to do it with with with just all conventional techniques. We knew both of those kind of extremes were just never going to work. So we took a very hybrid, very pragmatic approach to to going about this competition. Got it makes sense. One of the things that really stood out to me within the design framework when Got it. I was looking through that PDF presentation on the website was the patcher portion of the design and the reason that it stood out is because I have been seeing within the open source world some LLM based vulnerability detection systems but that's all that they do is they detect vulnerabilities there's not like a remediation piece necessarily on the back end and definitely not patching at least not something that comprehensive that I've seen thus far. So the multi-agent system approach that y'all took was particularly fascinating. You had three I think correct me if I'm wrong software security and quality engineers agents. Can you talk about what each of those agents did why you chose a multi-agent approach and kind of how they work together to create that patch? Yeah it's kind of a it's a great question. I think so first off I want to I want to give kudos to DARPA and ARPA-H for the way they structured the competition. They made patching the most important part of this competition. Submitting a patch for a vulnerability was worth three times as much as discovering the vulnerability and that's really reflective of the actual priorities we have when it comes to fixing open source software or fixing really any kind of software. I mean honestly it's kind of just like a general like thing that affects humanity. It's much easier to point out a problem that is actually like suggest a solution that'll work. Exactly. So open source maintainers if you think about it they're constantly being told their software is broken but no one is stepping up to help them fix it. Right what are you supposed to do about it? Yeah so you know the goal of this program is to take this one step further. So you know there's some great work that Google has done supporting the open source community with their tool OSS Fuzz. It does a great job of them for finding vulnerabilities in software. Really all they ask is hey if you build software packages give us some shim code or some interface for us to be able to use our fuzzer against it and we'll fuzz it and help you find the vulnerabilities. Well this still doesn't handle the remediation part like what do we actually do with it? So even though like you know as a community we've made you know tremendous investments and there's lots of work being done to find vulnerabilities patching them is still the most important part because then they just you know they go from being a zero day to an end day but that doesn't really change anything for us. We actually need to fix these to be more secure. So really like I said kudos to DARPA and ARP-H for making that really the centerpiece of this. That's part of the reason why our patcher was the most complex component of buttercup because that was the most important part. That was the part that we really wanted to knock out of the park. So when we first started this project it was two years ago which is like five lifetimes in the development of AI technology these days. We had drawn some inspiration from some very early papers that weren't even peer-reviewed they were showing up on archive talking about how to solve really complex problems because at the time you know you would ask GPT-3.5 or Claude Sonnet 3 hey find a vulnerability in the software and patch it and it just absolutely would not work. You could give it a lot of contexts like here's all the code and it still would have trouble. That was because the problem wasn't being broken down into smaller chunks that were solvable by the model. Later models from OpenAI and Anthropic like the reasoning or thinking models they're trying to do this on their own but the biggest gap between problem solving for these long complex end-to-end problems are that they're really hard to encapsulate into a single prompt and get a single answer from. You know if I need to ask the model hey give me advice on what outfit I should wear this evening it's a pretty constrained problem and you can constrain it even further by saying I have a blue shirt I have a green shirt and I have a pair of jeans and a brown pair of pants and this is the event that I'm going to what should I wear you know. Asking it to find a patch of vulnerability is incredibly incredibly context heavy to get a good answer for. So we had found in the research initial research had shown that if you break the problem up for a model and you really constrain the problem you're asking them to solve to something where you can provide really rich context you're much more likely to get a better result but what that means is you have to kind of chain together the sequence of problem solving. So we went with a multi-agent patching system because it let us do that for what is basically a really complex problem. So we had three or four distinct personas I think it was three for the semi-finals and kind of four for the finals. By personas I mean we had separate LLMs that were given separate prompts and they were given separate bits of context and asked to solve a small subset of the problem. So we would go into the patcher knowing that we had discovered a vulnerability and we'd ask a agent that had been prompted to act as a quality or as a software engineer and say hey thanks for the program it's great but it has this bug on this line they need you to go fix it. So they would go at it with a okay I'm going to generate code type mindset and it would generate code to fix a problem in the code. Then it would be handed off to a quality assurance agent that's been prompted to say hey your job is testing this code to make sure it works. So that agent was configured with the ability to compile the code, the ability to test it against the original crashing test case that we found to make sure that the bug had been and the ability to run unit tests integration tests functionality tests that were provided along with the program to make sure that we didn't break anything else while we tried to fix this one problem. So if the quality assurance agent found issues it would know okay I've got to go report this back to the software engineer. Now occasionally these two would get stuck and we'd have to reflect on the reasoning behind the problem. So this is where we had a third persona that was how do I actually have to patch this because it might not it's a little bit more complicated than fixing a bug. So ultimately by being able to pass around different like subsets of the problem to these three different agents we additionally added some other agents that did things like adding you know better context retrieval better reflection that kind of thing. But for the most part we really had these three kind of personas that were working in tandem. None of them had to solve the whole problem themselves. We didn't ask this we didn't ask like hey generate code but also think about the security problems and also make sure it passes these tests. Yeah which I imagine why it's so successful because they each got their own narrow scope within those user personas. Yeah. Yeah it really allowed us to also validate the outputs much more easily from each component. So when the software engineering agent like produced code like the quality assurance agent acted as basically a check to make sure that actually passed these various tests and if it failed it had its own context for assessing the reason why. The same with kind of the security engineering agents. So it made it a lot easier for us to avoid having this like one giant black box of prompt goes in vulnerability goes in and we just hope a good patch comes out. Right. Yeah because we could have just tested the patch afterward but then say you know you go back to the model say hey it didn't work. You have to hope that it kind of. Go back and begin to identify where where things went you know. Yeah you have to hope that it kind of reproduces this process that would actually take place in a real software engineering company where you have people who have different jobs and they do different things and you need them to solve different subsets of problems and you don't just you know ask one 10x engineer go you know handle every single issue that comes up you know. It's actually makes it's good to have specialists and we recreated that more or less with our patcher. Awesome yeah that's good logic. You mentioned the software engineer agents or persona getting kind of tangled up or stuck with the quality engineer agents occasionally. Did that have anything to do with the hang up that y'all called out during round two of the finals? I read something about y'all experienced a little bit of a big crash in round two. What happened there and how did you to bounce back for the third round if you can share? Yeah no I can't so fortunately that was not the problem because if it was that would have been a harder problem to solve because they were using third-party large language models. We don't really have control over how their inner workings go or really you know these the models are you know pretty non-deterministic so we wouldn't really be able to kind of reliably solve that problem but yeah we yeah so for for context for the finals we had one final scored round that took place right before Defcon but before that we had three practice rounds. This was to make sure that your your cyber reasoning system would work when it's deployed at scale and running so there's you know even though this was a competition and it was research focused there was a tremendous amount of engineering of time reliability that went into this. It was also a really big focus for us as we went about solving this problem and you know we thought we were doing really well. The first exhibition round there's only a couple of challenges only a couple vulnerabilities to find. We found all of them and we did it with 100 accuracy so like okay you know and that was better than any of the other teams have done so we're like oh man we're doing great. We go into the second round and then we process a couple of challenges and then like the system just puked entirely just like so it was actually the worst possible case because if it had just bombed out at the very beginning we would have thought okay something bad happened let's go fix this. Yeah. But it processed a couple it processed basically the first three of 18 total challenges and then just like stopped. We were like oh man. You got all sense of security no pun intended. Yeah no we all started like polishing our resumes and we're like we're getting fired for this like there's no way we come out of this and then you know we looked at it for a day or two and you know fortunately found out that it was like it was just like a it was just like a system a very basic system issue so. Oh good. Basically what had happened was is we had to store information about crashes and we were using information from the crash itself to make the file name to make sure the file names would be unique and then we had an assumption that that would always be a short enough file name and it turned out to be a bad assumption. So what we had happen was we after like basically the fourth challenge we're producing crashes that were causing us to try and give something a file name on the system that was too long and that happened to just be the right type of crash to make the component that was in charge of vulnerability discovery just completely stopped working and it stopped working for the rest of the competition time period. So we fixed that very easily we reran everything and then you know we did quite well when we looked at like the 18 challenges that they had sent us. We basically redid our own run of round two using our own resources and found that we had actually found and patched vulnerabilities in all 18 challenges. So that was um you know that what I said earlier it's been a ride that was the ride where it's like you know your success is never you know your success is never guaranteed and it's you're only a year away at best from a check on how good you actually are at doing your job. Also usually like when we the open competition part of this is was super nerve-wracking because usually like if you fail on research projects it's pretty quiet nobody knows about it except for you and you know the people who are funding you. This was like you had the potential to bomb and bomb very publicly and nobody wanted that. Oh my gosh what a roller coaster. But it all turned out okay so the final scored round was obviously a huge success for y'all. Are there any key metrics Michael or accomplishments from that final scored round that you can call out for us that maybe you were most proud of? Yeah so one of the things I was really proud of our team and our cyber reasoning system for doing was was we did everything well. So several of the teams so we didn't come in first place the team that came in first place they also used a hybrid approach. They had a slightly larger team and they used some more resources so there's a little bit of like kind of economies of scale working there. But they're they're from Georgia Tech so like they're from my alma mater so I've got to get beat by anybody I'm gonna get I'm happy to get beat by the people that I used to you know that I used to go to grad school with. So yeah yeah yeah they're great folks and really really smart but yeah they it was kind of vindicating because they use a really similar approach to ours. So I think in kind of like the ideological battle and should this all be IA or should this all be AI or should this all be conventional stuff is it all hype or is it like a combination of both the combination of both story really won out for first and second place. So you know getting back I guess kind of the original question some of the other competitors that we beat out some of them did really really good at finding vulnerabilities. They found more vulnerabilities than we did but they had issues with engineering and they weren't able to patch as many of their vulnerabilities. There was another competitor that had also reported that they had kind of like a bug that got introduced really late. So there ended up being like this kind of hidden component which was engineering discipline and that's one of the things we pride ourselves that is a small business that we we produce open source tools and we produce you know research grade tools but we produce them at a high enough standard where people can reuse them they don't kind of bit rod immediately. So that the engineering discipline that we brought to the problem meant that you know when we did everything we did everything well we did with high accuracy too. You know I mentioned before that the the end the people that were trying to benefit the people that were trying to change the game for are these open source maintainers that they get told all the time their stuff is broken and they just have to hope they have enough volunteer time to fix it. We want to be able to change the game to being like people are like whether it's an AI system that's validated by a human or just the AI system itself we want to be saying here's a vulnerability here's how we know it exists so it's not a false positive and here's a patch that will fix it and that to the best of our knowledge doesn't break anything else. So now you don't have to solve the problem you just have to approve the pull request and that's still that's still that's still not nothing and that's not trivial but it's a lot closer to you know automating the process of helping these people out and maintain their software. So if your patches don't work or they're broken or you can't prove that the actual vulnerability exists even though you're submitting even though you're submitting a pull request, now you still have a lot of burden that goes on that open source maintainer. a pull request now you still have a lot of burden that goes on that. So if you actually take a look at us in second place, we had about a 90% accuracy, and basically you would have reduced accuracy if you had patches that didn't work. Our accuracy dropped from 100% down to 90% was actually because we would find out as the competition was going on that we had created a patch that didn't fully solve the problem, and we would throw it out and go make a better version of the patch. So that's where our accuracy drop came from. But ultimately, our design philosophy was we are not submitting anything unless we know it to be true, we know it to be a true positive. False positives are such a problem with AI, and alert fatigue is a real thing. So we knew if we went out there and produced something that had low accuracy, that one, we weren't going to win the competition because accuracy was part of the scoring, but two, also nobody would use our tool afterward. So the third place finisher, they relied on AI a bit more up front, and they were a bit more speculative, and they had an accuracy of about 45%. And we beat them by nine points. So it ended up being where, it's kind of hard to tell, I don't have access to their data, so I don't know exactly how it happened. It's possible that by being speculative, they kind of caught up and got into third place, but it's also possible that by allowing or being okay with false positives, they actually took themselves out of the running and let us kind of jump them for second place. I won't know until DARPA releases all the data, but I can say that by being good at everything, having solid engineering discipline, and by having high accuracy, which is reflective of the needs of real world maintainers, that's what put us in second place. One of the other things that really stood out to me as I was reading about the project was Trail of Bits seemed to be very cost effective. Especially toward the end, I think the number was y'all used only like 40% or excuse me, $40,000 of the available budget in the final round. And that seems important when you think about moving out of like a competition environment. Can you talk about that efficiency and why that might be an important thing down the road? Yeah, I mean, so even in the semifinals, Buttercup was always one of the cheapest tools to run. At Trail of Bits, we work on these government research projects, but to the maximum degree possible, and this is determined by the US government on the contract, they get the final say, we make our tools open source and make them freely available. We believe that security, the work that we do, particularly that we do on taxpayer dollars, it should be a rising tide that lifts all ships. It shouldn't be like some tool that we keep inside and hope that we can make money off of later. We hire smart people and we do really hard work. That's how we make money. We don't need to worry about trying to productize or shrink wrap this piece of software that we built. We want people to use it. And if we create a system that requires $100,000 and 15 days to run, then people aren't able to use it. So we always kind of went into this, particularly our best of both worlds approach, between conventional and AI-driven solutions to problems. This was always because we had costs in mind as a factor for us in terms of winning the competition and creating a tool that would actually be used after the competition. In the semifinals, we got a shout out from Andrew Carney, the program manager who ran the Cyber Challenge. He's a program manager at ARPA-H and also a former seat at DARPA. He actually said, hey, Trail of Bits, we had a minuscule budget. We had like $500 in the LLM credits for the first, for the semifinals. And they told us that, I said, you know, Trail of Bits, we were the most judicious with the use of our LLMs. I think we used something like $13 out of all of it. And we were one of the top seven teams. When we were participating in these exhibition rounds, this first exhibition round that we did well in, if you remember, I said, you know, we found all the vulnerabilities, we patched all of them. We did it for about $1,000 and we had about a $30,000 budget available. So getting to the finals, the scale in terms of like how much compute we could use, how many LLM credits we had available to us, it was massive, absolutely massive. I think, if I remember correctly, the final budget was something like $130,000. You had like $80,000 in compute that you could, you know, just burn cycles on Azure. And then you had like something like $40,000 or $50,000 in LLM credits that you could use as well. We only used $40,000. If you remember, you know, the team that came in first, they found a lot more vulnerabilities and they patched more vulnerabilities than we did, but they also spent two and a half times as much resources. They spent something close to $120,000 out of the budget. Now, to be fair, we did try to use that money. We didn't want to leave anything on the table. So we did try to scale our solution up to the maximum level possible. We also had to be conservative. We want to make sure we didn't run out of budget before we had processed all the challenges. So we had to be a little conservative and ultimately that conservatism, you know, meant we only spent about a third of the budget that was available, but we still came in second place and we still did really, really well. And part of that is because, I mean, honestly, we probably could have gotten the same performance and spent maybe $20,000 or maybe $10,000. A lot of that was resources that came from scaling wide, not necessarily being more in-depth with our analysis. And ultimately what that has done is that has let us take Buttercup to the next step, which we made available as we were being announced as one of the winners at DEF CON. We had to open source Buttercup as part of the terms of competing in the competition. We actually took it one step further. So not only did we replace the competition versions of Buttercup, we released a version that is standalone. It doesn't have anything to do with DARPA's infrastructure, so anybody can use it. More importantly, it runs on a laptop and it runs for $100. So anybody can use this now. So if you have a laptop and you're willing to download our software and you're willing to put an open AI, a Google, or an Anthropic LM key in there and give us $100 of budget to work with, we can show you through a demo through one of the sample problems that was used in the competition that Buttercup can find a vulnerability and it can patch it, and it does it in about a little bit less than an hour. So that's a simple example. We left all the hooks in to scale it up and down. So really, by being judicious with our resources from the beginning, we've made a tool now that anybody can use at any scale. You can use it on the laptop that we're recording this podcast on. A medium-sized organization can put it in the cloud and they can run it for $30,000 if that's the budget they have. And big tech companies, we help folks like Google who did great work with OSS. We hope they steal this and use it to keep doing it because they're the ones with the billions of dollars in accounts in Ireland or something that they can use to really make a dent with this. Our company doesn't have that kind of money. We're not going to be able to do that. So yeah, the reason why we were really judicious with resources was so we could achieve the reality that exists today where anybody can use the tool for whatever budget you have, you can use Buttercup. Wow. Such a juggling act and what a testament to just the brainpower and ingenuity on your team, especially with that second place finish. And it makes it really exciting to think about the future of automated security because this opens up so many ideas and it's going to get creative juices flowing for people. One of the big focuses of our show, Michael, of course, is education for the industry. You and your team, were there any particular learning outcomes or key takeaways that you all took away as professionals from this two-year experience? And what's next? Yeah, there was definitely a couple of things that came out from this. So I'm going to start off with one that pats myself on the back, and that is we were really... But don't worry, I'm going to follow up with one that I was wrong about too. So the first one, it really vindicated our approach by coming in second place and also having the first place finishers adopt a really similar strategy to ours. We said from the beginning, even while the AI hype cycle is intense and never ending, that AI was not going to be the only... That you weren't going to be able to get in here and just win this with a prompt. You had to do, honest to goodness, real security engineering. You had to combine it with judicious and really careful applications of AI, and you had to do everything well with a strong engineering discipline to win. Both of the teams that took first and second place did that. The team that took third place was a little bit LLM-heavy. They called themselves, or this is their words, Tyler and I swandered from theory. He said this himself, and he's a great guy, and they did an amazing job. He said they were more AI forward, but they still had components that were conventional software security analysis in there. So nobody went in this and won this with a prompt. Nobody went in there and said, hey, I'm just going to use O3 and say, here's the code, find the vulnerability, and give me a patch. That didn't work. So it was really vindicating. So we confirmed what we already kind of knew but what had yet to become mainstream knowledge that AI is going to be a multiplier. It's going to help us solve problems for which we need solutions, but the problems are kind of descriptive in nature versus stuff that we can go solve with an algorithm. They're going to work well when paired with the things that we've already built as humans to help us solve problems that are well-defined. So there's no reason to try and go back and gut these pre-existing solutions and throw them out and just throw AI at everything. We need to put these things together. It's another tool in the box. It works for a very specific set of problems that we didn't have good answers for before. So it is a huge advancement, but it's not going to replace humans at solving a lot of problems. It's not going to replace the existing solutions and tools that we have for solving a lot of problems, but it's going to help us solve new things. And by extension, it's going to help us comprise solutions that solve bigger problems end to end because now we can fill in these specific gaps that we didn't have before with AI. So that's the first thing we learned. The second one is that, I'm going to be honest, for most of my career, I've been an AI skeptic, which has sadly made me a really good AI security researcher. I was kind of blown away at a couple of different times in this competition at how well AI could perform. There was times where at the beginning, I was like, this is going to bomb. The people who win this are going to get a couple of points because AI is not going to be able to handle patching or it's not really going to be able to make the vulnerability discovery stuff move fast enough to win this competition or to do well. Somebody was going to win it. Somebody was going to score some points, but I didn't expect it to, the success level to be where it was. So I was blown away and kind of had to re-evaluate some of my thoughts on how effective large language models would be. When I saw how good our patcher worked in the first scoreground, the semifinals, our patcher was one of the strongest components. I think our team did the best in patching. Anytime we found a vulnerability, our patcher did it. Our patcher was able to patch it. That was more than I expected. Our success rate with patching using a multi-agent large language model-based system is way above what I expected. Similarly, we use AI large language models to help us accelerate the process of discovering vulnerabilities with conventional approaches. We kind of attach it to a fuzzing system and we use it to help the fuzzing system both find vulnerabilities faster and find vulnerabilities more often than we find just kind of typical bugs that aren't exploitable. I was really also stunned at how effective they were. I went into this much more skeptical than I am now. Still skeptical, but just a lot less. Now I'm in the moderate level of skepticism as opposed to very skeptical. We learned those are kind of the two biggest things that I took away from the competition. Got it. I'm going to ask you a hot take and you kind of alluded to it in the first part of your last answer, but for the future of automated cybersecurity, do you see us getting to a point in the future and if so, when, where these systems are kind of monitoring themselves and patching themselves and humans don't have to touch it a whole lot or do you envision a human always being in the loop and what is that timeline if you had to, and I won't hold you to this quote later, because things are moving, but just right now, what does that feel like for you? I'm going to give you an answer you probably don't expect and the answer is yes, but it's going to be heavily caveated. Yes, it's all going to be automated on its own without much of a human interaction. Yeah, but the answer is for certain classes of security issues. Certain classes of security issues, large language models are really well suited for. So for example, when you get a brand new wireless router that you set up in your house, if you don't change the default credentials, that's a security problem that is very much like a low-level, low-hanging fruit type security problem that large language models are going to be very adept at finding and identifying and even telling you how to fix. So are other types of common misconfigurations. Now, I think these are the kind of classes of vulnerabilities that as we start deploying fully automated AI vulnerability scanning intermediation systems that I think we're largely going to automate away, but there's always going to be these really nasty classes of vulnerabilities, like vulnerabilities that cross the software stack, like memory corruption vulnerabilities, vulnerabilities that don't even actually exist as real, as flaws within the building blocks of things we build. It's how we put them together, things like logic bugs. We still need humans to solve these. And I think AI and the large language model as it's currently architected, architected as it's as the transformer architecture is currently built. They're, you know, barring two or three major leaps or breakthroughs in technology, these models are not going to handle these bugs. So really what I say, you know, when I say this is heavily caveated, I like to bring this back to the concept of the CVE severity score. So right now when we find a vulnerability we score it from 0 to 10 about how bad it is. And the reason why we do that is so we can feel good about ignoring it because we don't have enough time or money to fix everything. If we think about it, when we find a vulnerability and the question comes up of should we fix this, the answer should be yes. It shouldn't be, well it depends on how bad is it. So when I talked about like systems like Buttercup being able to tip the scales back in the favor of defenders, what I mean by that is we have this giant, you know, population of bugs out there that need to get fixed. A lot of them are exploitable, a lot of them aren't. Largely, but right now the people who work in the security space get inundated and they have to try and triage and figure out which ones to fix. A lot of stuff gets just accepted as risk that we're gonna accept because we we think it's not severe, but maybe it is severe and we just didn't know. It was like lack of imagination that we rated something a 5 instead of a 9.8 as a severity score. I see a future where, you know, systems like Buttercup, they handle the low-hanging fruit and they handle it almost automated, almost completely automatic. And then the people that we have working as security engineers, they don't have to worry about triage tasks. They don't have to worry about dealing with a thousand new bug reports that come up, you know, over the course of the year. They look at only the ones that the AI models can't solve on their own because they're too complicated and then we use human intelligence to go solve those. So yeah, you know, I said before like AI isn't gonna fix everything and I mean that, but AI can fix a big part of it and it can do it at scale and that's still really important. Even though AI isn't gonna go out there tomorrow and stop the next, you know, Spectre, it's not gonna stop, it's not gonna prevent, you know, groups like the NSO group from like developing, you know, these zero-click and one-click like iPhone messaging exploits. Like those things are still gonna exist because our systems are just like infinitely complex. But what we can do is we can protect all the people who, you know, have you forget to change your credentials on their wireless router or when software engineers make stupid boneheaded mistakes that are very easy to spot with pattern recognition machines. We can fix all that stuff and we can make it so that the cost to do that is extremely small, that's amateurized, and we take care of all the slow-hanging fruit. So we clear out the kind of the clutter on the security engineers job board and we get them focused on the things that they actually need human intelligence to solve. What a time to be alive. Like as you're talking right now, and we've just we've just got a couple minutes, I'm gonna wrap us up here in a minute, Michael, but as you're talking I'm just thinking now about AI, its capabilities, the blast radius, and attack surface, and envisioning like in 10 years you're talking about the CBSS. I'm like one day somebody is going to decide or a group of people are going to decide that it is okay and sufficient for us to use AI to score AI vulnerabilities or other types of vulnerabilities and you know we're not gonna have to worry about the creativity or you know us scoring it as a five and someone's gonna hack into that system and make it so that you know certain vulnerabilities are scored a certain way so that they can have this chain of attacks to just it you never know right like it's all it's just crazy. Yeah it's turtles all the way down. Probably not feasible but you know who knows what the future holds. Yeah you know I think everything you said there is pretty reasonable honestly I like there's a natural there's a natural use case for triage when it comes to using AI systems. It's a it's a great problem for it because it's fuzzy, we're tolerant of mistakes because humans make mistakes at this all the time, and these are the kinds of places for us to use AI. You know we need to try and get away from the prevailing notion that we're going to use AI to solve everything. When we need the right answer and you need the right answer every time AI is actually not really that great of a solution because they're pretty much guaranteed to give you a wrong answer at least some of the time. So you know the advances we're gonna we're gonna see are gonna be where we make smart and dutiful applications of AI for constrained context-rich problems that we have a tolerance for false positives for and once we kind of start focusing on that as a group of technologists we're gonna I think we're all gonna be really really surprised at how far this actually takes us. Yeah but and what we need to keep top of mind is when we're thinking about these use cases you know what's smart what we actually need it for what can improve our lives and help optimize that security piece like okay I'm using it for this thing but how can someone exploit it you know and make sure that we're defending against that too. All right sir this was a fantastic conversation I really appreciate your time for our audience go check out Buttercup at trailofbits.com slash Buttercup once again I want to thank our distinguished guest Michael Brown and Michael I hope we get to talk again very soon. I'm saying thanks for having me on.

TL;DR

Trail of Bits secured second place in DARPA's AI Cyber Challenge with Buttercup, a hybrid system combining LLM-based patching agents with traditional program analysis, winning approximately $6 million across three competition phases.
Buttercup used specialized AI agents (software engineering, quality assurance, security engineering) that mirrored real development teams, achieving 90% accuracy while using only $40,000 of a $130,000 budget—less than half what first place spent.
The team prioritized engineering discipline and accuracy over volume, refusing to submit patches without verification—a philosophy aligned with real-world open source maintainer needs and critical for avoiding alert fatigue.
Brown believes AI will automate remediation of common vulnerabilities (misconfigurations, default credentials, pattern-based bugs) within certain timeframes, allowing human engineers to focus exclusively on complex issues like logic bugs and memory corruption.
Trail of Bits plans to open source Buttercup, reflecting their belief that taxpayer-funded security research should lift all ships rather than becoming proprietary products, making cost-effective vulnerability management accessible to the broader community.

DARPA AI Cyber Challenge Overview

The Defense Advanced Research Projects Agency (DARPA) launched the AI Cyber Challenge to address a critical security gap in open source software. The competition challenged teams to build fully automated, AI-driven cyber reasoning systems capable of finding and patching vulnerabilities in open source packages without human intervention. The initiative aimed to tip the scales back in favor of defenders by leveraging AI's promise of massive scale and the ability to work continuously. Trail of Bits competed across all three phases over two years, ultimately securing second place and winning approximately six million dollars in prizes through seed funding, semifinal victories, and final placement.

Buttercup's Hybrid Architecture

Trail of Bits developed Buttercup using a hybrid approach that combined traditional program analysis with large language models. The system employed specialized AI agents organized by role: software engineering agents generated patches, quality assurance agents validated functionality through testing, and security engineering agents ensured vulnerabilities were properly addressed. This multi-agent architecture mirrored real software engineering teams, with each agent having narrow scope and clear validation criteria. The approach avoided the black box problem of single-prompt solutions while maintaining high accuracy. Buttercup complemented LLM-based patching with conventional fuzzing techniques accelerated by AI, allowing the system to discover vulnerabilities faster and more reliably than traditional methods alone.

Competition Performance and Cost Efficiency

In the final scored round, Buttercup achieved approximately 90% accuracy while using only $40,000 of the available $130,000 budget—significantly less than first-place finisher Georgia Tech, which spent around $120,000. Trail of Bits prioritized engineering discipline and accuracy over volume, refusing to submit patches unless they could verify true positives. This philosophy reflected the real-world needs of open source maintainers who face alert fatigue and limited volunteer time. The team successfully found and patched vulnerabilities across all challenge categories, demonstrating that cost-effective, accurate solutions could compete with more resource-intensive approaches. A critical system bug in round two—a file naming issue that caused complete system failure—was quickly diagnosed and resolved, validating the team's engineering rigor.

Future of Automated Vulnerability Management

Brown envisions AI systems like Buttercup automating the remediation of low-hanging fruit vulnerabilities—misconfigurations, default credentials, and common coding mistakes—while human security engineers focus on complex issues like memory corruption bugs, logic flaws, and cross-stack vulnerabilities. This division of labor would eliminate the need for severity-based triage, allowing organizations to fix all discoverable issues rather than accepting risk due to resource constraints. The transformer architecture underlying current LLMs shows promise for pattern recognition tasks but will require major breakthroughs to handle the most sophisticated vulnerability classes. Trail of Bits plans to open source Buttercup, consistent with their philosophy that taxpayer-funded security research should benefit the entire community rather than becoming proprietary products.

Chapters

0:00 - Introduction and Background
1:02 - Michael Brown's Career Journey
3:33 - DARPA AI Cyber Challenge Origins
7:04 - Competition Experience and Stakes
10:25 - Buttercup System Name and Design
21:21 - Multi-Agent Architecture Explained
23:12 - Round Two System Failure
26:33 - Final Round Performance Metrics
31:14 - Cost Efficiency and Budget
40:40 - Lessons Learned About AI
42:00 - Future of Automated Security
48:51 - Closing Remarks

Key Quotes

4:00 "DARPA has recognized that securing this giant ecosystem of open source software, it's really tilted in the advantage of attackers. Attackers have to be right one place at one time, and defenders have to be right all the time everywhere."
6:02 "Let's challenge them with an open competition to build a fully automated, AI driven cyber reasoning system that can find and patch vulnerabilities within the open source ecosystem."
21:40 "It really allowed us to also validate the outputs much more easily from each component. So when the software engineering agent like produced code like the quality assurance agent acted as basically a check to make sure that actually passed these various tests and if it failed it had its own context for assessing the reason why."
27:24 "They use a really similar approach to ours. So I think in kind of like the ideological battle and should this all be IA or should this all be AI or should this all be conventional stuff is it all hype or is it like a combination of both the combination of both story really won out for first and second place."
29:55 "We want to be able to change the game to being like people are like whether it's an AI system that's validated by a human or just the AI system itself we want to be saying here's a vulnerability here's how we know it exists so it's not a false positive and here's a patch that will fix it and that to the best of our knowledge doesn't break anything else."
30:07 "False positives are such a problem with AI, and alert fatigue is a real thing. So we knew if we went out there and produced something that had low accuracy, that one, we weren't going to win the competition because accuracy was part of the scoring, but two, also nobody would use our tool afterward."
32:11 "We believe that security, the work that we do, particularly that we do on taxpayer dollars, it should be a rising tide that lifts all ships. It shouldn't be like some tool that we keep inside and hope that we can make money off of later."
32:22 "We hire smart people and we do really hard work. That's how we make money. We don't need to worry about trying to productize or shrink wrap this piece of software that we built. We want people to use it."
41:07 "Our patcher was able to patch it. That was more than I expected. Our success rate with patching using a multi-agent large language model-based system is way above what I expected."
44:32 "When we find a vulnerability and the question comes up of should we fix this, the answer should be yes. It shouldn't be, well it depends on how bad is it."
45:22 "I see a future where, you know, systems like Buttercup, they handle the low-hanging fruit and they handle it almost automated, almost completely automatic. And then the people that we have working as security engineers, they don't have to worry about triage tasks."
45:51 "AI isn't gonna fix everything and I mean that, but AI can fix a big part of it and it can do it at scale and that's still really important."

FAQ

What made Buttercup's approach different from other AI Cyber Challenge competitors?

Buttercup used a hybrid architecture combining LLM-based multi-agent systems with traditional program analysis. The team organized AI agents by role (software engineering, quality assurance, security engineering) rather than using a single black box prompt, and complemented LLM patching with conventional fuzzing accelerated by AI. This approach prioritized accuracy and cost efficiency over raw volume of findings.

How did Trail of Bits maintain such high accuracy while keeping costs low?

The team refused to submit patches unless they could verify true positives, using specialized agents with narrow scope and clear validation criteria. They employed conventional program analysis to validate LLM outputs and threw out patches that didn't fully solve problems rather than accepting false positives. This engineering discipline meant they only used $40,000 of available budget while achieving 90% accuracy.

Will AI completely automate vulnerability discovery and patching in the future?

Brown believes AI will automate remediation of common vulnerability classes (misconfigurations, default credentials, pattern-based bugs) but humans will remain essential for complex issues like memory corruption, logic bugs, and cross-stack vulnerabilities. The goal is eliminating low-hanging fruit at scale so security engineers can focus on problems requiring human intelligence, rather than spending time on triage and routine issues.

Categories:

» Cybersecurity » Application Security
» Data Protection

Tags:

Show more Show less

Browse videos

Upcoming Webinar Calendar

06/30/2026

01:00 PM

06/30/2026

Mastering Active Directory Certificate Services for Long-Term Success

https://www.truthinit.com/index.php/channel/2018/mastering-active-directory-certificate-services-for-long-term-success/
07/01/2026

04:00 AM

07/01/2026

Integrating Security in AI: Automated Red Teaming Strategies for Private Models

https://www.truthinit.com/index.php/channel/1969/integrating-security-in-ai-automated-red-teaming-strategies-for-private-models/
07/01/2026

04:00 AM

07/01/2026

Schutz von KI in Anwendungen, Agenten und APIs.

https://www.truthinit.com/index.php/channel/2008/schutz-von-ki-in-anwendungen-agenten-und-apis/
07/01/2026

01:00 PM

07/01/2026

Preventing Your AI from Turning Against You: Essential Strategies

https://www.truthinit.com/index.php/channel/2021/preventing-your-ai-from-turning-against-you-essential-strategies/
07/02/2026

10:00 AM

07/02/2026

Resilience Insights from Hybrid Threats Amidst Cloud Challenges

https://www.truthinit.com/index.php/channel/2011/resilience-insights-from-hybrid-threats-amidst-cloud-challenges/
07/09/2026

01:00 PM

07/09/2026

The HUMAN Experience: Manifesting Agentic Trust in Real Life

https://www.truthinit.com/index.php/channel/2026/the-human-experience-manifesting-agentic-trust-in-real-life/
07/14/2026

01:00 PM

07/14/2026

Crafting a Championship-Quality Security Team for Unmatched Defense

https://www.truthinit.com/index.php/channel/2025/crafting-a-championship-quality-security-team-for-unmatched-defense/
07/21/2026

04:00 AM

07/21/2026

Strategies for Managing AI Governance and Securing App-to-LLM API Traffic

https://www.truthinit.com/index.php/channel/1967/strategies-for-managing-ai-governance-and-securing-app-to-llm-api-traffic/
07/21/2026

01:00 PM

07/21/2026

HUMAN Dialogue: Insights from Attackers During the FIFA World Cup

https://www.truthinit.com/index.php/channel/2029/human-dialogue-insights-from-attackers-during-the-fifa-world-cup/
07/22/2026

06:30 AM

07/22/2026

Insights and Strategies from the DPDP Webinar

https://www.truthinit.com/index.php/channel/2000/insights-and-strategies-from-the-dpdp-webinar/
07/28/2026

01:00 PM

07/28/2026

Illumio + Netskope: Zero Trust in the Age of AI Autonomy

https://www.truthinit.com/index.php/channel/2031/illumio-netskope-zero-trust-in-the-age-of-ai-autonomy/
07/29/2026

04:00 AM

07/29/2026

Real-Time Strategies for Safeguarding Against Prompt Injections

https://www.truthinit.com/index.php/channel/1968/real-time-strategies-for-safeguarding-against-prompt-injections/
08/19/2026

12:00 PM

08/19/2026

Witness Cyera Agent Security in Action: A Firsthand Experience

https://www.truthinit.com/index.php/channel/2036/witness-cyera-agent-security-in-action-a-firsthand-experience/
09/30/2026

04:00 AM

09/30/2026

AI Command Center: Optimizing Visibility and Control in Your Operations

https://www.truthinit.com/index.php/channel/2024/ai-command-center-optimizing-visibility-and-control-in-your-operations/