Transcript
Today, Rati and I will be talking about how the Vault, being a secret manager for the Databricks pipelines, helped us to enhance the data security and compliance overall to achieve data governance part of it. First of all, huge thanks to HashiCorp Vault, who gave MIQ an opportunity to talk about some of the great work that we do with respect to secret management. Brief introduction about both of us. I'm Sunil Khandelwal, currently working as an engineering manager in the processing team, a team who enables MIQ to analyze and action on top of any data in the secured and the governed manner through some services, technologies, as well as platform like Databricks. I have Rati along with me, who is a tech lead in the processing team. And this is what we are going to talk about today. We'll talk about what MIQ is, what we do as a business. We'll spend a couple of minutes on the problem statements that we are trying to solve and the approaches that we took, followed by the solutions that we designed, the secret interfaces that we built, as well as Q&A. Brief introduction about MIQ. We are MIQ. We are like leading programmatic media partner. We have been in this industry for the last 12 years. And we have partnered with multiple agencies and the marketers to deliver programmatic media buying through connected insights and data-driven technologies. We have sold 100 billion plus impressions as of day-to-day. We have worked with 10,000 plus advertisers across the globe. We are 1,100 plus employees who are spread across nine different countries. We have won so many awards, not only for the services or the solutions that we build, but also for the people and the culture that we have in MIQ. We are like 1,000 plus strong team of award-winning programmatic professionals, which includes top client services professionals, commercially-minded data scientists, expert programmatic traders, who helps us to build the strategies to run the media campaign. We are a people-powered company. When we say we are a people-powered company, we are not saying it. We do hire inclusive and diverse people in the team. And this is the only way to help our clients to outperform their competitors. We have won 10 plus awards in 2023 for our people and the culture, some of them like Best Places to Work 2023, Best Firm for Data Scientists to Work, so on and so forth. This is the glimpse of how the programmatic media advertising happens. You can say a small component of ad tech. Imagine a user visits a web page, and there are lots of ads coming up. What happens behind the scenes? Website publishers communicate with ad marketplace to put up impressions for an auction. While there are a limited number of ad slots available, and millions of people are trying to put their ad impressions in the given slots, which demands a real-time auction. And the real-time auction is held among the advertisers competing for that particular impression. And that's where MIQ team comes into picture, and MIQ being a programmatic media agency, work with the advertisers, and ensure to provide a medium to do real-time analytics, data science, and modeling, and ensure the right user is being targeted for the right personas. And the beauty of this is it happens within a 0.1 milliseconds of time period. And the advertiser with the highest bid wins. And this is being held for the several years. And MIQ is spreading like 40 plus billion impressions on a daily basis, generating 20 plus terabyte of data on a daily basis, and which helps us to compute who is the right advertisers we go ahead with, and whether the advertiser is meeting the intent of targeting the right users in its own context or not. Moving on, as we know, the data plays a very important role in any of the decision making. That's how data is at the core of programmatic decision making as well. And some of the states that we have collected, where MIQ is running 2,000 plus campaigns on a daily basis, generating and processing 30 plus terabyte of data. We are running 3,000 plus database job, for which incurred cost is more than $40,000 monthly debut. We have 7,000 plus trading strategies. And 750 million users have been targeted so far. Let's come to the problem statement. So to solve the programmatic media decisioning, we generate terabytes of data, process terabytes of data, which requires data ingestions, pipelines, and the processing jobs. And hard-coding the secrets directly into the data pipeline is a common but risky practice, which possesses several security and operational challenges, and which requires a secret managers to be used. And that's how we divided our problem. We looked up the problem statement in four clients. How did we capture the hard-coded secrets which already exist in the data pipelines? How should we fix, where how do we build an automated system to move the hard-coded secrets to a secret manager? And how do we ensure that no more secrets are added in the future created pipelines, as well as without disrupting the user experience? With that, I will hand over to Rati, who will talk about the approaches that we took, the solutions that we designed to solve this particular problem. Over to you, Rati. Hello, everyone. Thank you once again for joining us today for our talk. I'll take over from where Sunil left. We'll go through quickly towards that approach that we have taken to solve this problem. And then we'll deep dive into each one of them one by one. Starting off with the very first problem, that was capturing the statistics. As Sunil mentioned, that we had secrets lying around across all our repositories. And we were coming across it, but we did not have any consolidated report on how many secrets are we talking about, what kind of secrets are we talking about, where is it lying, who owns it. So our first step was to capture all the statistics. For this, the tool that we used was TruffleBug, quite well-known. And eventually, we also added GitLeaks into our system. We captured all the results, stored it, and we had that repository with us, telling us how many secrets, where it is lying, who is the owner of it. The next step for us was fixing it. Now, fixing this problem had two parts to it. One part was one-time replacement of all the secrets lying across. And the second part was providing users a utility through which they can talk to this new secret manager. So what we did as part of the first phase is we created user-level folder structure in Vault. Every single user within a MyQ got their own folder so that they can store their secrets over there. The second one was automated secret push and replacement of that. So we had to run an automated secret scan and secret push script, which captured all the secrets lying across all the repositories. It pushed all the secrets into the user's respective folder, and then also replaced the references in their respective code with the references of Vault. We also provided users with a utility library. We'll talk about the utility library in detail in subsequent slides. The next step for us was UX, user experience. This would have been quite disruptive if we had just asked user to move all their secrets to the new secret manager. And going forward also, asking them to change the way they have been doing their development by going to a new application, a new UI, making changes in their code. So what we tried to do that is we created a UI in our in-house application called Studio. Let me quickly introduce Studio to you. It's a no-code, low-code, drag-and-drop automated tool built on top of data platforms like Databricks, StreamSets, NiFi, and other data platforms. We saw an opportunity here, so we added a tool, a UI on top of this automation tool where you don't need to go to Vault, you don't need to go to another application. In the same application, you can replace your code and from the UI itself, you can push the code and mark a difference in your code. So we were able to capture, fix, and also improve UX for the user. The fourth step for us was to ensure that we did not end up reaching into the state we were in. What we did for that is we started monitoring on a regular basis to figure out how many secrets are stored, where are they stored, who is the accountable user for it. So we had reports scheduled and this gets shared regularly at this point, weekly, to the respective users and the leads. Let's jump to each one of them one by one. Now, capturing a secret across the code repositories, what we did is, first of all, we figured out all the commonly used secrets in a MyQ, figured out the regex, which we'll be able to guess, which will be able to help us capture those, fed these into TruffleHog, and we ran the scan. The screenshot that you see on the right-hand side is the very first report we just generated, which was shared with us as well. And once the second part was Python-Utility Library, which we created. The components of that Python-Utility Library was, it was a library installed in Databricks, that was one component of it. Another was something stored on AWS Secret Manager, the private key, and the third component was Vault itself. What we did is we wrote this library, and what this library does is it connects with Vault, figure out who the user is, gets a user-specific JWT token, and then makes a subsequent call to fetch the user-specific secrets. The sequence would look somewhat like this, that I go to Databricks Notebook and try to fetch my tokens. I invoke the method. What this Python-Utility Library will do is it'll make a call to AWS Secret Manager, get the private key. This private key is the one which is used to sign all the communication which is happening between Databricks infrastructure and Vault. Once the library has a private key with it, it'll make a call to Vault and get the user-specific JWT token. The call which we make also has information about user, which we get from Databricks Utilities itself. Once we have the user-specific JWT token, we make another call to the user-specific path, secret engine and folder within that, and fetch the secret which they have registered for. The third part to AWS was fixing the UX. So the utility library was self-sufficient, but again, you have to write code to access it. What if, as a user, I do not want to write code, what do I do? The in-house tool which we talked about, which is AWS Studio, we actually integrated Vault with it. We interfaced Vault with Studio. The screenshot which you're seeing in the bottom left corner is the screenshot. It's partially the canvas of Studio and the feature that we have built. You can trigger scanning of notebooks from Studio UI itself, and you'll get the details about it. The screenshot which you see on top right is how you see the details associated to your task. What kind of validation, what kind of violation type it is, what master screen we are talking about. Another screenshot from Studio itself. Studio also provides you a way where you can write Python code. We integrated an IDE-like feature on the Python code itself. So while you're typing your code at that time itself, inline, you'll get to know what all secrets you have added. And the validation will not allow you to save this unless we have moved this secret out of this code editor. So what do I do as a user if I have to remove this to a code editor? Do I have to go to Vault UI and paste it and come in a reference vault from here? Not, you don't have to do that. All you have to do is just pull down a bit, and then you see this section of update variable where you can just paste your secret, give a name to it, and then replace that, replace the secret here in the code editor with the name that you have provided. And everything works like a charm. The fourth and the last section was monitoring bit. So we were able to capture, we were able to fix, we provided users with all the utility, but again, that doesn't ensure that nobody's going to add, nobody's going to not add any secrets in their notebook. So we have this report, which is scheduled at a weekly basis, which scans all the notebooks across the MyQ and generates a report and share it with the respective owner along with that certain needs of that particular team the owner belongs to. It also gets shared with us, so we're able to keep an eye to it. The screenshot which you're looking at right now is taken from last week. It is for one of the Databricks workspaces we have called BA Commons. We have about 10, 12 Databricks workspaces. BA is Business Analytics Commons, is a common workspace which you use for different verticals. You can see that there are 10 secrets right now lying around. The screenshot which you see on top right is of the Excel sheet, which gets attached as part of the report, where you can get to know what path it is, what the user might be, and what is the detection region and what string it might have matched. That would be it from our side. If you have any questions, feel free to ask. Thank you.