What is secrets sprawl in DevOps pipelines?

Secrets sprawl refers to the uncontrolled distribution of credentials, tokens, API keys, and passwords across codebases, CI/CD workflows, and developer environments—often without proper governance or visibility.

How do hardcoded secrets end up in source code?

Developers often embed credentials directly into config files or scripts for convenience. Without secrets scanning tools, these sensitive values may be accidentally committed to Git repositories.

Can private Git repositories still leak secrets?

Yes. Even in private repos, secrets can leak via misconfigured access controls, insider threats, or by syncing with public forks—making proactive scanning essential.

Why are non-human identities a security risk in software development?

Automation tools, service accounts, and CI/CD bots often have powerful credentials. If these are exposed or not rotated, they can be exploited without detection.

What’s the best way to detect exposed secrets in code?

Use automated secrets detection tools that integrate into your SDLC—from pre-commit hooks to CI/CD pipelines and post-commit monitoring for both public and private repositories.

How do attackers find secrets in public GitHub projects?

Threat actors use automated tools to scan public GitHub in real time, flagging exposed API keys and credentials that can be exploited within minutes of being posted.

What are the best practices for secrets management in development?

Avoid hardcoding credentials. Use environment variables, encrypted secrets managers (like Vault or AWS Secrets Manager), and apply strict access controls with logging.

How should organizations respond after a secrets leak?

Rotate the exposed secret immediately, audit all usage, investigate potential misuse, and implement developer training to prevent recurrence.

Why isn’t “shift left” enough to stop secrets exposure?

While early detection helps, secrets can still be introduced later in the SDLC. Organizations need layered scanning—covering code commits, CI/CD jobs, and cloud environments.

How can security teams balance secrets detection with developer productivity?

By integrating detection tools natively into developer workflows (e.g., IDEs, Git hooks), teams can enforce security without slowing down software delivery.

GitGuardian: Your Devs Are Leaking Secrets (and They Don’t Even Know It)

Name: GitGuardian: Your Devs Are Leaking Secrets (and They Don’t Even Know It)
Uploaded: 2025-05-01T08:48:38-04:00
Duration: 21 min 59 s
Description: As development teams race to ship code, sensitive data, like API keys, database credentials, and tokens, often end up hardcoded into repositories, creating serious exposure risks.

Truth in IT

05/01/2025

0 (0%)

Report Like Favorite

Link

Embed

Transcript

Hi Mike Matchett with Small World Big Data and we are here today talking about security. Go figure. It's one of the hot topics, but we're going to touch on code security. Today we're going to talk about secrets. Where do you keep them. How can you keep them? What makes a secret a secret and not something everybody else knows when you're talking with your source code? We've got GitGuardian here today, and we're going to discuss some of the ways in which you should be protecting your secrets and some of the best practices to doing so. Hang on just a minute. Hey. Welcome, Dwayne. Welcome to our show. Hi, Mike. Thanks for having me. All right, so, you know, everybody who's watching this probably knows about repositories. We know about, you know, going way back SVN. We know about git, which everyone uses to all the cool kids use git today and a number of different places, whether it's GitHub or Bitbucket or whatever it is. Um, but you know, one of the things that I think people chase is they're doing security is using APIs and and what we are going to call today secrets where you've got to provide keys to get access to things. And they often get stored in repositories by mistake. Right. So tell us a little bit, you know, about what you see as sort of the scale of problem and how GitGuardian got started on that problem. Uh, so, yeah, the way we define secrets around here is anything that grants access to another system, authorizes one system to talk to another. Now, it could be a human that's using the system, but by and large, we're really talking about the non-human identity scale problem. Uh, by conservative estimates, it's about 45 to 1 for every 45 machine identities, you have one person that's actually logging into something. So that's the scale of the problem we're talking about. And we're just adding more and more APIs and more endpoints and more, um, service accounts, more everything as fast as we can. And that's how we're growing. One of the things we do at GitGuardian is we look at every public repository on GitHub as a new commit is happening. So I think it was 1.3 billion commits happened last year across, you know, 40 or 62 million repositories. And we scan them for secrets and see, hey, is there anything here that that committer should know about? Um, and last year, we found 22.77 million hard coded credentials out there on GitHub. Public, right. You don't want to tell all those hackers out there that there's 22 million credentials just available. Oh, I'm sorry I said 22. It's 23.77 million. It's even worse. Um, well, the thing is, we're not announcing that to the world. Uh, as it's going on GitHub is there's an API endpoint that anybody can subscribe to. It's a firehose, but it's everything that happens on git publicly because, well, it happens in public. So attackers already know about this. They're already doing the work of like, let's go find these credentials and use them against these companies. We're just the white hats out there who have a Good Samaritan program. We email the committer immediately, say, hey, you did this in public. You should probably clean that up. And that's the basis of our public monitoring product. And that's one of the things we sell, is we can look in public for the secrets that really by people inside of your organization, inside of repos, you know about. But the real value is to triangulate the data and expand that perimeter to repositories where you're probably not watching. So if your developer, under one of their personal email accounts, somehow accidentally pushes part of a repo into public, uh, or accidentally copies the wrong file to the wrong place, git is awesome, but it doesn't make you more or less secure in and of itself. It's the stupid content tracker that's still, to this day, what man git tells you if you type that into your terminal. Um, so what we want to do is help organizations get an alert quickly. Hey, you have a public incident and you need to get a handle on this, even though it's something outside of your defined perimeter, because, well, identity really is the new perimeter. Yeah. So let me just stop there a little bit. And let's put this in context. Not everybody who might be seeing this is a dev ops or a dev guy who understands even what we're talking about with with API keys. Uh, but let's put this, this in context of a larger zero trust program for a company, right? So we talk about secure and secrets and identity for people. And you mentioned there's 45 machine identities for each each personal entity. So the problem is 45 times bigger in some ways or at least multiplied out. So what we're really referring to is when someone writes code and that code has to access an API or access another database or some service, it has to present credentials as, as part of their identity program. Right. Because those things are password protected. And people put those credentials into plain text sometimes and push them into public repositories where anybody can read them, which is just terrible. So that's kind of I just want to level set with folks because that's what we're talking about. Um, now tell me a little bit, a bit more about, you know, how the landscape is changing and evolving, uh, with non-human identities. Is, is is this problem getting contained when you're doing your public monitoring or is it growing, still growing out of control? What's what's happening? Oh, I wish I could say it was getting better. Uh, but it's it's not. Um, we're seeing again an increase that the number I shared 23.77 million earlier. Uh, that's a 25% increase from the previous year. So that's not cumulative either. That's not added, uh, like added up over the years. That was just added in the year 2024. So we know this problem is accelerating and it's not just in public repos. Uh, we also do private monitor repos and private repo monitoring and look in other data sources, places where the conversation happens around the code. Uh, so places like Jira and slack, confluence, um, your container registries, uh, anywhere, you know, connected to that, that software development lifecycle. Right. So that's that's part of the zero trust thing if you shouldn't be. So even within the, you know, vast perimeter of the firewall, the old ancient idea with the wall around it inside that you still shouldn't be trusting everything. Uh, you should be doing more of a zero trust approach, not just authenticating identities of people. But you shouldn't be putting your API keys out for everybody in that company who can access it or has hacked into it to see it either. Right. Yeah. That's exactly what I was going to say. There's, uh, the research we did last year is we found it's significantly more likely it's like 35% chance that you're going to have a secret inside of a private repository. Um, but yes, it does absolutely violate that zero trust principle that anybody that has that key can just pick it up and use it. So if you do get breached or your code gets leaked, which happens all the time, we hear about so many data leaks and so many, uh, breaches that results in a release of a code base. Uh, that's not very zero trust. If anybody can just pick up that API key and just run with it. So how do you how do you then? I mean, I'm assuming you're providing doing automation here with your, your monitoring and and scanning. So how do you find secrets. Because it seems to me like if it's just a bunch of encrypted hash inside of inside of a code word, it it might not look like anything. Yeah. Base64 encoding is very real. Uh, well, there are basically two kinds of detectors in the world. Um, there's only really one kind of secret, and that's the kind you shouldn't write down in plain text. Uh, but there are two kinds of detectors, uh, specific detectors which go to known mapped systems. Uh, they either start with a prefix or they're a set length. Uh, they're recognizable patterns. Every secret scanner in the world, even if you write your own using regular expression, is going to follow that. That's where you're going to start. And then there's the entire rest of the world. Uh, homegrown APIs. Homegrown. Uh Kubernetes clusters. Uh. All the tools that we build internally, all the microservices we built, and that is just all over the map for their, uh, for those you need contextual awareness. How is the string being used? Where is it being used? Is it granting access? Does it behave as if it's a key? Uh, and that's something our platform does really, really well. We introduced a machine learning layer, um, in the past year, and we went from conservatively saying, hey, we're pretty sure this is a generic password because we can tell these things from deterministic programing to now we can, with much more accuracy, tell you, hey, this is definitely a string that's being used elsewhere in the code to grant access. We're pretty sure this shouldn't be here. And really, uh, looking at the kinds of keys that could be used by hundreds and hundreds of different services, that you've got these rules. I mean, I've tried to write regular expression syntax, and it would be painful to to duplicate any of that personally, to just start from scratch. So that's good. Um, so you talked about, um, this growth of, I think you called it nigh non non-human identities. Uh, tell me a little bit more about, um, what are some of the challenges someone has if they are trying to get a handle on this themselves? Right. If they, if they've got responsibility for it. In fact, you can even start by saying who is responsible for it. But but what are some of the challenges they have? I wish I had a magical answer on who's responsible for this. Uh, we can blame the person who put it in there. We can blame the person who actually ran it in production. We can blame the security team because they'll eventually get blamed anyway. Yeah. Uh, but really, there's two sides to this challenge. We've been talking about secrets this whole time. Traditionally, that is what GitGuardian is known for. That's our claim to fame. Uh, we're the best at in the industry of detecting those secrets, uh, wherever they may be. Uh, but the other side of it is governance. If you're not storing a secret in code, then where should you be storing it? That's that's the conversation starts. And why is secrets part of any security? Well, all of these things need these credentials to function. If you can't communicate, you can't get your work done. If you need to authenticate well, you need something to authenticate with. And that almost exclusively are secrets, uh, are long lived credentials or tokens of some sort. Uh, the other side of that is governance. How long should those credentials live? How long should this identity exist for you as a human? Have an off boarding date at some point, whether you like it or not, you do. You're human. Uh, machines? Not so much, if less. There's this very specific governance in place. Then you're going to get credentials and knees that live indefinitely, long, um, literally forever. And then the attackers can find those credentials and use them against you whenever they want. But there's other pieces as well. Um, are you storing it correctly? Enterprise vaults are the way to go. Uh, I think we're at this point in 2025. We're all in agreement on that. Your secrets should be stored, encrypted at rest, encrypted in transit, and properly being able to be pulled in programmatically only when needed. Um, not just living in memory forever and whatever your system you're putting it into. Um, but how do you know if it's in there correctly or not? And that is our newest offering with our new governance is we can map, um, all of the secrets in your vault and say, hey, we found this secret out in public. It's also already in this vault, but it's also in these other vaults. Vaults are very real. If you put the same secret across many vaults, which is the correct system of record, which is the right, the most correct place, which is the the true source of truth on you? Yeah. The master record. And it doesn't it does sound like that's another problem by the way. So not not only did you say we matched what you might be in your vault to something in public, which is itself a leak, but that you've got duplications which are going to create, uh, internal problems if you try to change one, which might keep people from changing it, right? This might be a big source of the friction from anybody actually doing the right thing and rolling and rotating their secrets. Well, that actually ties directly into the other problem of rotation. It's not just that we don't know where it is. We don't know which one to rotate. But one of the big problems is we don't know a lot of time what permissions were set by that developer initially. What's going on. Years ago, who did that and who did that and who owns it. Right. Going back to Zero Trust, was it set correctly following the privilege principle of least privilege? Is this giving just enough access to just get the work done? Uh, unfortunately, the answer to that is no. Another thing we found in our State of Secrets Sprawl report is that a lot of credentials when we analyzed them, were just way over permissioned. Like GitHub access tokens that really only needed read access that could delete the repo if they needed to. Um, not not not something you really want to do in the world. So that is the governance side again. Um, so we aligned with Owasp's new, uh, NIJ risk top ten, the newest in the top ten family. We love OWASp around here. And we said, well, let's let's go down the list and solve this problem. Is is it leaked? We're already good at that. The secret for it leaked. Is it being used across environmentally? Are are multiple things calling this is the secret duplicated across vaults. Uh, is it living longer than the policy should allow? Like, definitely shouldn't live for a year. Should it live for a month? Well, that's going to depend on your system and how you want to set it up. But that's where we're starting with our governance tooling is are you breaching these policies and giving it to you as a map view of seeing where is that secret live? What does it connect to? What environments call it? What ultimately is at risk if the, uh, this does get abused? So part of that mapping is an increasing number. So it doesn't do on every secret known demand, uh, or all the generic ones, but we are increasing the number of things where we can do a scoping, uh, secret analyzer, we call it, say what? Permissions even exist. So it would give you that insight. Is this over permissioned. So how do we go about solving that. Yeah. And you know, just going back to basic principles. If you can't measure it, you can't manage it. If you can't, if you don't have any visibility into where your secrets are and how they're being stored and how they're being used, or how they're public and private, you can't do anything about it. Compliance and regulation stuff that's coming in, this governance idea you're talking about. Absolutely important for people to get a handle on that. Uh, and then even just, you know, your own IP, right? It's just it's just like you don't want to give away the company. So even if you just don't want to be hacked. Right. This is this is the vulnerability you're trying to avoid, uh, is getting taken down. Um, so, uh, you do the mapping, the visibility, which is, which is incredible, right? To actually see where secrets might be stored across your environment. There's multiple vaults. You're saying, um, you know, I understand you can help people actually use their vaults better and and to and to encourage them to rotate secrets and things. Is that true? Yeah, absolutely. Um, if you find a secret that's in the vaults already. Well, if it's in the vault and the code already accounts for being pulling it from the vault, then let's just go ahead and rotate it. That's pretty straightforward to automate. Uh, our platform is API driven, as fancy as you want to get your scripting. The faster you can go with that auto remediation with our newest integration with the vaults, um, HashiCorp vault and Cyberark specifically, uh, we have a push to vault feature which would let you. Hey, this is, uh, find a secret outside of the vault. It's not in the vault at all. Because we know. Because we looked. And with one push of a button, we can push it in there. And then again, as fast as fancy as you want to get your script in. Just auto auto auto remediate. Let's talk about let's talk about I have to ask you this, and I apologize if this is coming out of left field, but I can't interview someone talking about AI. How is AI and Agentic AI affecting both the problem here and maybe encouraging better solutions are. Unfortunately, it's making the problem worse. Uh, we know that again from our report. Uh, it's 40% more likely that you're going to leak a secret if you are relying on copilot. Um, that's not a good thing. But the underlying reason isn't because copilot wants you to be insecure. I think copilot, at the end of the day, wants you to be secure. But it's trained on the internet. It's trained on all of GitHub. And I don't know if you've seen code on GitHub, but a lot of it's just terrible. Uh, and it takes a presence of mind to know that this told me to hard code a credential on line four of my file. It takes experience to know, like, hey, that's not a good idea. I know we internally use vault, uh, HashiCorp vault. So I'm going to go make a call into vault. I'm going to store this properly in vault. But that's a process. That's a human know how? Unfortunately, we're seeing more and more people relying on the. I told it to. I told me to do it that way. I'm just going to run with it. Especially in the world of live coding, low code, no code solutions out there where people are like, I really don't know what I'm doing, but I need to get a prototype up and running because I had a customer call. Yeah. And then they put in the root privileges as well, not just a narrow privilege for what they're doing. Put that out in plain text. And I think we're in trouble in a lot of cases if you're relying on that. But it's also helping with the, uh, answer to this as well. Like I talked about it earlier when I talked about generic secrets detection briefly, but our machine learning model that we built into the platform all sealed off, it is completely its own thing. It does make calls anywhere else in the world. Um. It's ours. We can contextually analyze the code and say, hey, how is this being used? Is this really a secret? And help us do that faster and at scale with much, much lower false positive rates. So, you know, when we raise an alert, that's actually going to be something you can take action against. It's not definitely actionable. And you need to step up and step up and do it. Uh, awesome. Uh, I know there's more to talk about here. You guys have some things coming along that are pretty interesting, but, uh, I think we have to wrap up right now. I'm sure we're going to get together soon and talk talk some more. Uh, if someone wants to get their hands on, say that, uh, security, uh, survey that you guys just completed the state of state of, uh, secrets, uh, or learn some more about, uh, get Guardian and how to start getting a handle on their secret sprawl. We'll call it that. Uh, where would you say they should, uh, start looking? Uh, so the. We're very proud to say that we do not gate our reports at all. So the state of secret sprawl report, uh, are voice in the practitioner report. All of those reports, you can just go to get guardian.com, uh, and then look under our, uh, platform section, uh, and go to, like, our security, and you'll find links that will take you to all these reports, uh, under resources, I should say, is the the one that will actually take you to the actual free reports out there. Um, anybody that wants to try GitGuardian. We're very happy to say we have a very generous trial policy. In fact, we are completely free for individual developers. If you just want to get a handle on it, handle on it as just a developer for your own work, we would love for you to use it and to stay safe out there. It's free for teams under 25 because you're a small team and you just want to do the code scanning. Um, we would love for you to get started and just develop and grow in a secure way. And for open source projects, if you're a project out there, just wants the peace of mind that everyone knows this repo is going to do the scanning. Uh, we have awesome developer tools, Pre-commit hooks, um, uh, vs code extension, so you can use that completely for free under those circumstances. For everyone else, a 30 day business trial, you can hook up your repos and say, how bad is the problem today? And just get that quick snapshot. And it's something we also do without any signup. Uh, do a quick snapshot of your public exposure. You can go to our resources section. And it's just a very large piece of it says get your comp, get your company GitHub audit. And we can just do a quick, uh, report for you on the spot to say, uh, give you a quick letter grade and let you know. Is this something you should probably be looking into? And you're not storing the secrets, by the way. You're not going in there and. Oh, yeah. No, just be clear. Be very clear. Um, what we do is we take a hash of every secret and then take a fingerprint. So we have a database full of these fingerprints. But that's what it is like. We know it goes to something, and we know where it maps to. We know in public where it maps to. And that allows you to scan public repos for somebody's internal secrets without knowing their internal secrets. Yep yep yep. Which is very cool. Uh, wow. There's a lot going on there. And I can't believe that people don't do this. Uh, but with 23 million published plaintext or readable secrets out there in in git today, it's clearly a big problem. I you know, I don't understand. Yeah, as a security person, but people need to get on it. Uh, it seems like you have some big opportunities ahead of you. Uh, check it out, folks. Um, any any last recommendation? What would your what would be your final recommendation? If someone's going, like secrets, I'm not sure where to start. Uh, absolutely. The just starting on our website, GitGuardian. Com sign up for it. Testing it against your own repositories. Testing it against your company. Uh, repos is going to be the place to start. And don't forget that it's not just the repos, it's the places around the code. Your JIRA, your stack, your confluence, um, your artifacts. Well, you know, it's a SaaS world out there. There's more and more APIs. People are connecting more and more things up to each other. And, uh, there's just more and more of this non-human interaction and non-human identity to track as we go further. And agentic AI is going to blow the wheels off that too. So you guys are becoming more and more centric, I think, to both governance and, uh, just security overall. So, uh, I think, you know, we'd love to have you back, uh, Dwayne, and show us what's coming next. Uh, but with that, guys, check it out, get Guardian, and, uh, find your secrets. Thanks.

In this conversation with analyst Mike Matchett of Small World Big Data, GitGuardian security advocate Dwayne McDaniel breaks down one of today’s fastest-growing risks: secret sprawl. As development teams race to ship code, sensitive data, like API keys, database credentials, and tokens, often end up hardcoded into repositories, creating serious exposure risks.

Together, Mike and Dwayne explore why secrets are so easy to leak, how public GitHub remains a top resource for credentials for attackers, and what security leaders can do to detect and remediate leaked secrets at scale. They also examine the role of non-human identities in the modern SDLC and the challenges of managing secrets across DevOps pipelines, CI/CD workflows, and infrastructure-as-code environments.

Whether you’re a security engineer or software architect, this session highlights best practices for code security, secrets management, and governance, plus why developers need security tools that work with them, not against them.

Categories:

» Small World Big Data
» Technology Communities » Data Security » Backup & Recovery
» Technology Communities » Data Security » Cybersecurity
» inBrief Sessions

Channels:

Mike Matchett: Small World Big Data

Tags:

Show more Show less

Taming the Hacker Storm: Your Framework for Defeating Cybercriminals and Malware

360View: AI Powered Innovation in the Enterprise

360View: The Data Resilience Imperative – Securing, Scaling & Optimizing Enterprise Data

360View: API Security & the Expanding Attack Surface

Dispelling Misconceptions Surrounding API Security