Using machine learning across data lakes with @IoTahoe


Summary: How machine learning can help insight and visibility with data lakes? In this short video podcast, we learn about a new disruptive player, IoTahoe and learn how this data management platform can affect both structured and unstructured data so you can uncover the relationships amongst all the data in your environment and get insight well beyond traditional approachs. Transcript below:


Mike Matchett:                  Hi, I'm Mike Matchett with Small World Big Data and I'm here today with Rohit Mahajan who's the CTO and product owner for Io-Tahoe. Io-Tahoe is an exciting new entry into the data management space, it's gonna help people apply machine learning to really getting a grasp on their data lakes and both structured and unstructured data. So welcome, Rohit.

Rohit Mahajan:                  Hi Mike, thanks so much for having me.

Mike Matchett:                  So first just start off, tell us a little bit about Io-Tahoe, where you guys are coming from and what your sort of core vision is for data management.

Rohit Mahajan:                  Sure. So Io-Tahoe is a data discovery platform and what we do really, really well is actually discover the data across both the data lakes as well as the non, as well as the structured data stores. The way we discover the data relationships is actually leveraging machine learning and we have a patented algorithm that we run across the data to discover those relationships. Now the key differentiating feature is we, our philosophy is that we go through [brute 00:01:19] force the data. The data tells the story the most accurate and most truest story is told by the data itself. So that's our underlying philosophy, right? So we brute force the data, we read, obviously, the meta, but we don't stop there. And there's tons of implied relationships because we've all grown up in the last decade, couple of decades, and the data has grown organically and most of the times these relationships are lost. Nobody's, you know, very few metadata stores are kept updated.

Mike Matchett:                  Yeah, copies are made, subsets are made, things are emailed all over and there's databases forked off and then suddenly you don't know where everything is, right?

Rohit Mahajan:                  [inaudible 00:01:58] development, right? Absolutely. So we brute force the data and we project the data out to the end user, so essentially what I would say is that we ingest the data, we run our machine learning algorithms, we project the relationships, be it within the lake or the non-lake. That's the traditional data stores, and we let the end users then either accept or reject the projections that we have made. Normally the comments that I've been getting is that we give a level of confidence and the comments that I get are that clients normally have a high degree of confidence in Io-Tahoe's confidence. So that's a good sort of, you know-

Mike Matchett:                  Right, so if I understand this, somebody brings in Io-Tahoe into their complex data environment and points it at all these data stores that they have. You will release the beast, so to speak, it goes and it crunches through all the data, does this machine learning thing and discovers all the relationships that are inherent in the data, not just what might be declared in the metadata which we know is insufficient. So then you come back to the user, you say, "Okay, here's the picture of what you've got, and where we think the relationships actually exist." So if you're trying to do data governance, now you actually know this database is related to that database, which is cool. Is this a one-time thing? Do people just run this once or is this, I mean, data governance has got to be an ongoing thing. Do you brute force the data every day? That seems like a lot of overhead.

Rohit Mahajan:                  No, you're absolutely right. To answer your last question first, we don't brute force the data every day. We take the deltas, but the idea is, obviously the initial run is pretty intense because you are now brute forcing five years, ten years, a decade worth of data, right? And trying to figure out their relationships and give them back to the end users. But Io-Tahoe will be there as long as you're SDLC keeps running. Your data relationships do change. They don't change daily, but you do ingest data sources, you do develop new business rules on top of those data sources, you do make software changes into your systems, right, as a client. And as long as you are making those changes, SDLC is one of my personal best, favorite use cases as long as they're an SDLC in an organization, you need Io-Tahoe.

Mike Matchett:                  Right, so discover something new every day, right? Every day's a new discovery which is kind of a cool thing for people who think data governance is a, it's kind of a static backroom thing. It's like you're able to discover things. But now you've got this, you're coming out, you've got a GA, you've got that machine learning based discovery, but you're finding that you're actually able now to give someone a catalog view and catalog features, right? Tagging features and so on? Which really gives them data governance. Can you tell me a little bit about that level of functionality?

Rohit Mahajan:                  Sure, so we are actually, we have developed what we are calling this feature set as data catalog, and that entails tagging which is based on machine learning also. We have data governance feature set which actually lets the users define the data storage and so on, define the various relationships, so that's data governance. We have what we also call as little bit of business glossary, that's the initial roll out that we are doing, so when you pull all of this together between the tagging, data governance, definition of the business rules, letting the end users do that, letting the end user get a grasp of their data asset from a tech as well as from a business perspective. Now, the reason we are significantly differentiating even in data catalog is because of two-fold: One, it is sitting on top of our data discovery, which is already leveraging machine learning. On top of that, which is a lot more discovered data relationships than just the meta, on top of that data catalog is also leveraging machine learning with [phonetics 00:06:07] and LP, anagram type algorithms to actually predict the catalog.

                                                      I just want to use one example: If you have two various attributes or columns, one is called transaction ID, which is really flowing to something called attribute 1, it is very challenging to catalog or tag something that is called 'attribute 1,' right? But if you're leveraging Io-Tahoe, because of its discovery of data flow as well as its various machine learning algorithms on tagging, we will be able to call 'attribute 1' as a transaction identifier.

Mike Matchett:                  Okay.

Rohit Mahajan:                  And that's a very powerful feature set, especially to the end governance user.

Mike Matchett:                  I mean, where the primary key relationships are, where the foreign keys are, where the personally identifying information is, it sounds like this is something now that can allow someone to really handle the scope and scale of these larger and larger data environments that aren't just structured data but now also include data lakes, and I know you guys eat a lot of different data sources, I'd love to talk to you some more, maybe we'll get you back on here to talk about some of the upcoming stuff that you've got cooking. Because I know you've got more machine learning in the works, but this is exciting enough. Where can someone find out more information about Io-Tahoe today?

Rohit Mahajan:                  So you're welcome to get on There's a lot of information there. It tells you about various feature sets, so just to summarize, right, there's discovery which actually does primary foreign key, it does the sensitive data discovery, it does the data flow, data redundancy, there's tons. We call it discovery but there's a lot of insight in there. Now you take that insight and you get into data catalog, oh boy. We've how created a whole data management gamut of good information out there for both the tech and the business users. There's a lot of good upcoming feature sets, one of the biggest one that I am personally driving is a business data view, back to the business owners. That's, I think, is going to be very disruptive and valued to the end clients.

Mike Matchett:                  I will look forward to seeing that. Thank you Rohit, for being on today, and thank you for watching.

Rohit Mahajan:                  Thank you very much, Mike.

Mike Matchett:                  Take care. All right, bye.