In this podcast, Srini Penchikala spoke with Dan Benjamin, the CEO of Dig Security on three main topics: Cloud Data Security, Data Security Posture Management, Data Detection and Response (DDR).
Key Takeaways
- Cloud Data Security
- Data Security Posture Management
- Data Detection and Response (DDR)
Subscribe on:
Transcript
Hi, everyone. Registration is now open for QCon London 2023, taking place from March 27th to the 29th. QCon International Software Development Conferences focus on the people that develop and work with future technologies. You'll learn practical inspiration from over 60 software leaders deep in the trenches, creating software, scaling architectures, and fine-tuning their technical leadership to help you adopt the right patterns and practices. Learn more at QConLondon.com.
Srini Penchikala: Hi, everyone. My name is Srini Penchikala. I am the lead editor for the AI/ML and Data Engineering Community with the InfoQ website, and I'm also a podcast host. Thank you for tuning into this podcast today. In today's podcast, I will be speaking with Dan Benjamin, the CEO of Dig Security. We will discuss three main topics that are very important in the current cloud computing environment: cloud data security, Data Security Posture Management, and Data Detection and Response. Let me first introduce our guest. Dan Benjamin is the co-founder and CEO of Dig Security. He leads a team of security professionals focused on discovering, classifying and protecting cloud data. Previously, Dan held cloud and security leadership roles at Microsoft and Google. Hi, Dan. Thank you for joining me in this podcast.
Dan Benjamin: Thank you so much for having me.
Introductions [01:16]
Srini Penchikala: Before we get started, can you introduce yourself and tell our listeners about your career in the cloud data security area and any other accomplishments you would like to highlight?
Dan Benjamin: Of course. So hi, everyone. Thank you so much for listening. Again, I'm Dan. I'm the co-founder and CEO of Dig Security. I've been in the cloud and security and engineering space for the past 20 years. I started in the Israeli Army in the 8200 Units. I was there for about four years. After the Army, I co-founded my first cybersecurity company in the IAM space, which helped organizations' right-size permissions across the on-prem environment. That company got acquired by CA Technologies five years later. After that, I spent many years at Google Cloud and at Microsoft, both in Azure and in Office 365.
At Google Cloud, I mostly worked on building the startup program for Google Cloud that essentially helped startups from around the world, around 15,000 startups to build their entire infrastructure on top of Google Cloud. At Microsoft, I led the Microsoft CASB, which is cloud access security broker or SaaS security. And my last position was helping lead multi-cloud security strategy for the organization. So how does Microsoft transition from being a single-cloud security vendor or just doing agile security to helping organizations protect their AWS environment and GCP environment as well. I left Microsoft to co-fund Dig, and super-happy to be here.
Cloud Data Security [02:37]
Srini Penchikala: Let's get started with the cloud data security topic. What is the current state of cloud data security?
Dan Benjamin: I think that currently, what we're seeing is that most of the organizations today don't really know how to answer three main questions. What data do we even own across our clouds, across AWS, across Azure, across GCP, especially if we're running across multiple clouds? The second question is how is that data being used, by which machines, users, vendors, contractors? And lastly, how do we protect that data from being exfiltrated, misused or breached? And most of these organizations today don't have a lot of opportunities or a lot of tools to help them answer these types of questions. And I always split the landscape to the different kind of options that they own. So if we go to the cloud native solutions, we can potentially use AWS Macie, or Azure Purview or Google Cloud DOP. But those solutions are not multi-cloud, they're very manual, very, very, very, very expensive, and only support a very small subsets of the data stores that a typical organization owns.
Today, a typical enterprise holds at least 20 different types of data stores, whether it is on task solutions like RDS and Azure SQL and Google BigQuery. Whether it is IS data stores like VM running a database like a MySQL or MongoDB, or even environments like Snowflake and Atlas and Databricks. And with this kind of sprawl of technologies, sprawl of different types of data store types and thousands of data store instances, we need a solution that will help us discover and understand what data do we own, classify that information, and of course, help us protect that type of data. So we talked a little bit about the cloud native solutions, Macie, Purview, Google Cloud, DOP, that are not hitting the target for the other organizations. Then we can talk about the older vendors. We used to have data security solutions for the on-prem, but these solutions today don't really work well for the public cloud.
Take Varonis today, I think one of the best data security companies in the market. Only do buckets in AWS, right? A typical enterprise today has 20 or 30 or 40 different types of additional data stores in the clouds that these types of solutions don't support. And let's take, for example, the privacy vendors. We used to have multiple privacy vendors that were born in the last five to 10 years. Those solutions were born for on-prem. They don't really tackle the cloud native technologies. They're not security vendors. They were born to help lawyers validate and make sure that we're privacy-compliant.
So most of the organizations today don't have proper solutions to essentially protect data in the cloud. And what we see today is that many times, when we come into an organization, they built many homegrown tools. They either build proxies themselves, they build scripts to essentially discover data across their databases or buckets. So the landscape today for data security in the cloud is super, super-fragmented, but I think we are seeing a shift both by the cloud native vendors and by multiple vendors that are coming up in the DSPM space and in the data security space that are trying to solve these types of topics that up until now, remained unsolved. We're super-happy to talk about this topic today.
Challenges in securing data in cloud [05:55]
Srini Penchikala: You mentioned a couple of challenges with the cloud data security, but is there anything else you can highlight in terms of what are the constraints? I know storing the data in the cloud has become the main strategy for a lot of organizations. So what are the challenges in securing the data in the cloud?
Dan Benjamin: Other than the fact that we just have a lot of different types of technologies, without properly understanding and threat modeling and going deep dive into how these data stores actually work, most of these organizations don't really know what to detect and how to properly manage these types of data stores. So for example, up until about a week ago, every time you stored a file in a bucket, it was automatically stored as a public file. Just now, AWS essentially changed that configuration setting, and today AWS makes sure that all files within buckets are specifically private and less named differently. Now, a typical organization that doesn't have a lot of experience working in the cloud, they will not know how to essentially properly configure these types of data stores. That's why Data Security Posture Management essentially comes to life. On the other side of it, you have different types of information that you keep.
So you have unstructured data like files, you have semi-structured data, like JSONs, and then you have structured data. So relational databases like RDS and Azure SQL and Google BigQuery. And with the sprawl of technologies, sprawl of different types of data storage types, these organizations mostly don't know how to answer, "How's the data moving inside the organization? Where do we have PII or PHI or PCI? Where do we have regulated data?" And all of these are mandated by the regulation. Take a healthcare company, they're mandated by PHI to know all this information. Take a credit card company or a payments company. They're mandated by PCI to understand and keep specific types of information secure. And when you come to these types of organizations today, they're stranded. They just don't have the right tool sets to essentially build these types of controls. And I think that's where we're seeing this explosion in this space of cloud data security that helps organizations tackle these types of problems.
Process lifecycle for securing data in cloud? [07:57]
Srini Penchikala: Can you discuss what a typical process lifecycle would look like in securing the data on cloud? It is not the same as securing the data on premise. So what does a typical process look like on the cloud?
Dan Benjamin: Whenever we talk with customers about securing data in the cloud, we always talk about three main steps. And the three main steps is first off, understanding what data do I have? And that's a full discovery process. We need to understand where do we have data, whether it is on a managed solution like an RDS, a DynamoDB, a Redis and so on, or an unmanaged solution. So someone booted up the VM, installed MongoDB on it, or a database as a service. So initially, everything starts with full data asset discovery. We need to find the data stores that we own across our cloud environments. Then we need to understand what is the data security posture. And by data security posture, I'm essentially talking about multiple different topics. So first off, of course, how is data flowing inside the organization? Who has access to it?
Where do we have over-permissive access? Is the data sitting in the right locations around the globe? A lot of countries essentially have data sovereignty issues. So for example, European customer data today needs to sit in Europe, and Brazilian customer data needs to sit in Brazil. So we have a lot of data sovereignty issues. Then what is mandated by our compliance? Some compliance regulations today mandate us to have specific controls around our data, like data retention policies, data deletion policies, data encryption policies, logging, authorization. So we need to make sure that the posture of our data is correct. And once we both the data discovery in the data posture or in parallel, typically we see this in parallel, we also build the threat detection capabilities for the data stores themselves. What are the actions that we do allow and what are the actions that we want to be able to essentially detect?
I'll give you a very simple scenario. Let's say you have now a million customers, and that million customers datasets and a specific type of RDS. You would like to know if someone tries to now download the entire production database to their personal machine, right? That's a very obvious request. Or you would like to know if someone does a select asterisks and download the entire production database to its personal notepad. These are very common scenarios that take a typical enterprise today, super-hard for them to essentially detect.
And why? Because they need to properly threat model each type of data store. They need to understand "How do we detect something like this in the logs? Do we have the right logs essentially enabled to properly detect those?" I always talk about these three main steps: discover the data, understand that we have the right posture of our data, and then understand that we're able to detect the malicious events that we don't want to happen with our data. And what do we do if we do see these types of events? Whether it is a mass download or a copy outside of our cloud, or a copy to an FTP or a copy to an external asset. All of these are events that we need to think about, talk about, test, and then of course, make sure that we have the right controls for those.
Data Security Posture Management [10:56]
Srini Penchikala: So let's get into DSPM in more detail. How does the program help with discovering and classifying data assets?
Dan Benjamin: First off, when we talk about DSPM, this is a very new term. It was coined by Gartner, I think, around June or July of 2022, so very, very young. But it essentially talks about how do we essentially build data-centric security. So looking at the data and essentially building controls outwards, versus trying to essentially protect the perimeter, and hopefully, that will protect our data as well. So Data Security Posture Management is very similar or takes from its older brother of CSPM, Cloud Security Posture Management, that essentially talks about, how do we protect our infrastructure in the cloud from bad configuration and bad posture of our resources in the cloud.
Data Security Posture Management essentially talks about, how do we discover all of our data stores automatically across our clouds? How do we make sure that we understand what data do we own, classify it properly, have a deep understanding of what data should we have, and how that data is being interacted with? And of course make sure that we have the proper configurations for our data, wherever the data sits. Across dozens, if not hundreds of different types of data stores, and thousands and thousands of VMs, machines, path services, databases as a service, anything that might hold our information out there.
So that's where this originates from. And I think we're seeing an uptick in multiple companies that are coming into this space. We're seeing an uptick in the need of customers to build and deploy DSPM projects. And I think it's a blessing because most organizations today, when you come to them, they'll say or admit at least in their rooms that in the last 18 months, they did encounter some sort of a data breach. They don't feel that they have the right controls on top of their data. They don't feel that they have the right controls to match their compliance regulations, and it's just super-hard to essentially protect the data today. So DSPM is definitely a great step in the right direction, but it's still a very young category. I think that most organizations today, when we at least started the company at the end of 2021, most organizations have struggled with data security projects in the past.
Take a data security project, on-prem, it used to take 12 to 18 months to implement for at least 30% of your data stores on-prem. And I think the cloud technologies that essentially allowed us to do this quicker, to essentially deploy projects like this quicker have increased the adoption. So today, we see one out of every five organizations are implementing data security projects. So we're seeing a very big explosion in the space. And we're happy to say that we're one of the leaders of the specific space today in the market.
Srini Penchikala: If somebody is interested in learning more about the DSPM program, are there any other online resources outside of your website that you recommend?
Dan Benjamin: So first off, I think that the Gartner report is really good. It's the hype cycle of data security. I think it's really good. I think it's very succinct, but it talks about why is the market moving into Data Security Posture Management, and what are the different types of catalysts. I think that we're now seeing more and more resources, either by companies like us, so big security in our competitors, or by different types of other vendors. I even asked ChatGPT the other day, and I asked it, "Hey, ChatGPT, what is Data Security Posture Management?" And it actually gave a pretty good definition. So I think that's super-cool to see, but I think there's a lot of good resources. Dig in our blog, we have fantastic resources, and I think some of our competitors as well.
Definitely Google Data Security Posture Management. And I would either deep dive into the topic of "What is the difference between CSPM versus DSPM?" Which I think is an excellent question to ask. Or differences between on-prem data security and cloud data security, which I think is very interesting to also ask. So those are guided questions that I would definitely encourage people to do, and inquire. And if anyone wants to reach out and ask additional questions, I'm happy to answer. So my email is Dan@dig.security.
Srini Penchikala: You mentioned about ChatGPT, right? So I mean, it's been found very interesting. I think we just need something called the JustDoItGPT, right? So not only it can tell us what the solution is, but it can do it. I'm just joking, but let's go to the next main topic, Dan. It's the data detection response, probably the most important part of the cloud data security. So maybe we can start with the definition. Can you define what is Data Detection and Response?
Dan Benjamin: Yeah, so Data Detection and Response essentially talks about how do we detect and respond to bad and malicious events that happen with our data wherever the data lives today. So I think it draws from endpoint detection and response and network detection and response. And now I saw an identity detection and response. It focuses on the 40% of the resources that we have in the cloud today that are looking to get protected, but we don't have the right controls to essentially tackle there.
So Data Detection and Response essentially protects the data from bad usage. Bad usage can be either mistakes or it can be a malicious actor. And we see both of these in customer environments today. But drawing back to our previous discussion, if you think about cloud data security, we always think that a combination of Data Security Posture Management and Data Detection and Response is the way to go. And why? Because if we start with the posture management, what we were able to solve with deploying a Data Security Posture Management is how do we discover what data do we own and what is the posture or configuration or encryption status, or what kind of content do we own in the specific types of data stores?
But even if we deployed that type of solution, we're still going to have hundreds, if not thousands of people, machines, vendors, contractors that have access to the data, and that access goes completely unmonitored. We don't have any tools today to look at the activity that is done with the data itself. So with Data Detection and Response technologies, essentially, we look at each type of data activity. Admin events, data events, connections, resource events and so on. And we're able to essentially detect if something bad happens with the data. I'll give you an example. We onboarded one of the largest banks in the US. And what we were able to find is that for the past three years, every single day, there was a cron job that essentially ran at 3:00 PM. That cron job once a day was copying all their financial reports to an external AWS account that doesn't belong to them.
Now, how can something like this even happen, right? Because no one really monitors how data's being used after permissions are given. We don't have the right tool sets. So a detection response solution, threat models, each type of data store understands what are the bad actions that can happen to it, draws context from the data sensitivity and the data posture and the auto discovery capabilities of the DSPM capabilities, and then essentially is able to flag, detect and respond to anything bad that might happen, whether it is a mass download, a mass upload, someone disabling encryption on an asset, making an asset public, someone copying data outside of a residency area, or a machine mounting a database backup and copying the information out or copying data to an FTP. Data detection response is the protection on the runtime level. How do we essentially protect from anything bad that might happen with data that is not associated to posture? At the bank, they had the proper encryption, they had the right size permissions, they had the right posture under controls over there, but the usage problem was where they essentially failed protecting the information itself.
So combining both usage and posture, I think, is the way to go. And this is not a new concept. If we go back to how we used to do data security on-prem, we used to have E-discovery solutions that used to find the data, and then we had DLP controls that essentially made sure that no sensitive data would exfiltrate through the actual endpoint or the specific machine or the specific email. In the cloud, we just don't have one single entry point and exit points, and that's why we need detection and response capabilities. So I see DDR as the evolution of DLP for clouds, which essentially allows us to understand and monitor how data should be used and respond to any malicious event that essentially happens. Hopefully that made sense.
Srini Penchikala: Yes. So the idea is to protect the data in all phases of data lifecycle, like data at rest, data in transit and data in use, right?
Dan Benjamin: Definitely.
Data Detection and Response [19:23]
Srini Penchikala: So DDR solutions use real time log analytics to monitor cloud environments that store data and detect data risks as soon as they occur. Can you discuss how a typical DDR solution works and what the application developers should be aware of?
Dan Benjamin: Of course. So a DDR solution, as you mentioned, taps into the existing logs that the organization owns, whether it is admin events that are stored in the cloud trail or in the stack driver solution for Google clouds. They tap into the data events, and they tap into the resource logging and connection logging for each type of data store. Now, of course, not all organizations have all the right logs enabled, and a DDR solution should flag what the organization is blind to. Which logs we should enable to be able to essentially detect something important. So a typical DDR solution will tap into the existing logs, either collect it themself or tap into the organization sink. Typically, if you have all logs being funneled to one specific system, let's say a Splunk or a SIEM, then the DDR solution should also tap into that same sink, the same sink that you already funnel all the logs into.
But also, DDR solutions should enrich the content that these logs essentially bring. So either it is enrichment through understanding more information through threat intelligence sources, whether it is by CrowdStrike or Microsoft or any other solution that essentially can enrich and say that, "This IP is compromised" or "This specific actor might be compromised" and so on. It should also enrich more information because sometimes the logs are lacking by the clouds, right? So sometimes AWS will say, "Specific resources changed," but AWS doesn't mention what exactly has changed. So a DDR solution should go ahead and enrich the additional information in the additional context that they need to bring in from AWS or from Azure or from GCP. Now once we collected the logs, enriched the logs with either threat intelligence or additional API calls, then we need to clean up the actual data. De-duplicate, aggregate information, and be able to essentially do four main types of detections.
The first type of detection is, of course, single-event detections. "If event X happens, please wait." Like an alert. Then you have multi-event detections. So if event X and Y happen, then alert and a specific something that essentially happens. Then you have sliding window alerts which say, "If a combination of events essentially happen in the last 30 minutes, then also bring up an alert." And the last piece of it, of course, is an aggregation. "If you see more than X events or more than 1% of the events that essentially look something like this, then also raise an alert." These are different types of an alerts that we already see and a typical DDR solution should be able to detect and respond to. The last piece of it is the response piece. Let's say we saw an event, we saw a mass exfiltration event in a typical customer environment. What do we do now?
So some DDR solutions will either respond themselves. And that can be either suspending the user, removing the permissions, locking the identity out, requiring MFA, and so on. On the other side of it, some DDR solutions would also plug into a source solution, security orchestration solution like a Torq or a TINES or a Demisto and so on. And those solutions will be the ones that will essentially respond to the malicious event that happens. I see that most DDR solutions today can do both, either respond themselves or plug into your existing workflow that you already own. But detection without the response is not strong enough, but also having the proper, the right detections across your clouds, across your different types of data. So technologies are super-important as well.
Srini Penchikala: Can you mention what are some of the best practices for implementing cloud data security?
Dan Benjamin: when it comes to best practices for cloud data security, first off, I would say learn and understand your objectives. What are you trying to do? What are you trying to achieve? Some organizations just want to understand what data do they own and clean up, and I think that's a valid approach. But many of the larger enterprises, they need to build the proper program. And for that program, you need to have a leader for that type of program. You need to have an owner that will acknowledge and handle these types of risks. You need to have the right tooling in place. So whether it is Dig Security or using some sort of a cloud native solution, or using one of your old vendors that will essentially want to expand into the data security space. And once you have the program, you have the right person in place, you have the right tools, start resolving the issues that you find across your clouds.
Focus on the big things first. Focus on the bigger locations of the explosive data, how I like to say it, but we actually have a blog post about it. So I definitely recommend to people to read our blog posts that talks about how to build an enterprise data security team. We talk about this process that a lot of organizations are going through today, and how to identify the right person to lead the team, what are the responsibilities of a data security team today, and how to properly protect the information in the cloud. So definitely recommend it.
Srini Penchikala: Yeah, definitely. I think the team is the most important part of the strategy. Right?
Dan Benjamin: Definitely.
Wrap Up [24:29]
Srini Penchikala: And also, are there any standards or guidelines or any consortiums that our listeners can follow on this topic?
Dan Benjamin: So not yet. I think that this year, we're going to see more and more consolidation in the market. This is a very new category. I know that Gartner is now working on an innovation insight that will talk about how does a DSPM solution look like across 10 of the other vendors that they've already seen. And I also think that the privacy and data security regulations are going to mandate additional controls, and that's going to be more aligned with the industry today. I think today it's a little bit of a Wild Wild West, so you can read the GDPR requirements and you can read the FTC safeguards requirements that essentially mandate every single FTC-regulated company to follow some sort of controls. You can look at PCI requirements and CCPI requirements. So each one of them is a little bit different, but they're very similar in nature and what they're trying to achieve for the end user.
Srini Penchikala: Yeah, definitely. I think shift left security is the new strategy, right? Do you have any additional comments before we wrap up today's discussion?
Dan Benjamin: I'll be happy to talk with any organization exploring data security controls for their cloud infrastructure, exploring Data Security Posture Management, and exploring Data Detection Response. Please visit us at dig.security to learn more.
Srini Penchikala: Sounds good. Thank you, Dan, very much for joining this podcast. It's been great to discuss one of the very important topics, the cloud data security. This is the perfect timing to talk about how to secure data in the cloud where moving data to the cloud is becoming more popular. To our listeners, thank you for listening to this podcast. If you would like to learn more about data engineering or AI and ML topics and the security as well, check out the AI/ML and Data Engineering Community page on InfoQ website. I encourage you to listen to the recent podcast we have posted, and also check out the articles and news items we have on these topics on the website. Thank you.