How to make big data self-service for your users

05/03/2018

128

1 (100%)

Report Like Favorite

Notable Quote: (Kelly Stirman of Dremio): "Companies have huge, huge amounts of data, but it's spread across many different systems, and it seems like, for decades, the answer to the problem was, "Hey, just move all your data into a new silo," but you just never finish, and the next silo pops up. The idea was we have data consumers in companies, the BI users, data scientists, executives who make data essential to their daily lives and how they get their jobs done, but getting access to the data means they have to get into what I'll call a data breadline, where they're waiting their turn to get from IT the things that they need."

Transcript:

Mike Matchett: Hi. I'm Mike Matchett with Small World Big Data, and, today, I'm going to talk about big data. I've got Kelly Stirman, who's the VP of Strategy with Dremio, which is an exciting, new product that I know all of you are going to want to hear about. What basically is happening is you've got a lot of big data, you've got data across many different repositories in your environment, you've got things in Oracle, you've got things in your big data lakes that you've just spent a lot of time constructing, you've even got data in Mongo and some other new and no-sequel databases, and your users want to marry that data, they want to join it together in some intelligent way and get value out of it, but it's hard. There's a lot of challenges. First of all, finding the data, stitching it together, get a good performance. How can you do a join between Oracle and Hive, right? That's just not really possible.

We're going to find out today. There is a new product out here that's making some of that possible. Welcome to the show, Kelly.

Kelly Stirman: Hey, Mike. Nice to have you. Nice to see you. Thanks for having me.

Mike Matchett: No problem. Dremio's kind of a new idea. How did you guys come up with this idea, and what exactly are you doing?

Kelly Stirman: Sure. I think those of us here at Dremio have been in big data and open source and distributed systems for over a decade, and we see the same patterns over and over again. Companies have huge, huge amounts of data, but it's spread across many different systems, and it seems like, for decades, the answer to the problem was, "Hey, just move all your data into a new silo," but you just never finish, and the next silo pops up. The idea was we have data consumers in companies, the BI users, data scientists, executives who make data essential to their daily lives and how they get their jobs done, but getting access to the data means they have to get into what I'll call a data breadline, where they're waiting their turn to get from IT the things that they need.

What everyone wants is to be independent and self-directed and to work at their own pace, and what Dremio is about is making data self service so that, whether you're using Tablo or Python or any other kind of tool, you can get the data you need on your own terms and make sure you have great, fast experience with that data no matter what tool you're using.

Mike Matchett: This really says to people who do have those disc brake sources of data, "Here's a way you can stitch those together and provide those to that class of users universally and seamlessly," and not just do that, but you guys have worked really hard on accelerating that, so there's a performance aspect to this.

Kelly Stirman: There's actually a lot of things you need to do and do extremely well to deliver on this vision. One of them, as you say, is you said, "Stitching the data together," the data's never in the shape you need for whatever job you're trying to do. Traditionally, what that's meant is, "Okay, I need to go to IT and say, 'Hey, here's what I'm looking for,' and then IT goes and makes a copy of the data and puts it somewhere where you can access it." That's a real challenge for companies because companies don't want thousands of copies of the data. It's a big risk, and it's very expensive.

In Dremio, we handle that curation of the data in a virtual context. There is one master copy of the data in whatever system you're using to store that data, but every user can have exactly what they want in a virtual way. In terms of accelerating the data, we have a really exciting patent pending capability called data reflections that, invisible to an end user, gives you interactive speed on massive datasets, so whether it's in your data lake or in a data warehouse or in MongoDB or Elasticsearch, you can get sub-second response times from Tablo or Python or any other tool and visit it in a way that Dremio's managing for you invisibly behind the scenes.

Getting the data, making it able so you can stitch it together very efficiently in a virtual way, but then invisibly accelerating the access is core to how we're delivering on this vision.

Mike Matchett: To be clear, you're not picking up the data and making copies of it. You're not going through and pre-building cubes of this thing. You're doing something very intelligent, keeping the data where it is, looking at what's there, able to accelerate the query, basically providing a virtual view across the whole thing in a very fast way.

Kelly Stirman: Exactly. A common example is I think every company has a data warehouse, at least one, right?

Mike Matchett: Mm-hmm (affirmative).

Kelly Stirman: Many companies now also have data lakes where they have raw data or unstructured data sitting in a large Hadoop cluster or maybe you're on AWS using S3 or on Azure and ADLS, the Data Lake Store product, but you have data in both environments, at least those two, and you want to join the transactional data that's in your data warehouse to the unstructured data that you're pulling from social media feeds. What's that meant traditionally is you've got to find some new place to combine those datasets before you can query it.

With Dremio, the data can live in the data lake, in the data warehouse, and you can use Dremio to, in this virtual layer, combine the data from those two different systems, and then Dremio in the background is making the access really, really fast. By the way, we made that so a BI user could do that for themselves instead of opening a ticket with IT to do it for them.

Mike Matchett: You've actually been working a lot on some things that are autonomous and some things that are predictive and some things that are dynamic so that you can actually ... It keeps itself current and helps the user find things and you don't have to actually crawl through it and manually set that up. Part of this is getting automated and getting predictive.

Kelly Stirman: Exactly. One of the models that we like, and we thought about it in designing the product is, it is sort of like Google Docs, but for your datasets. When you can add Dremio to your Oracle database and to your data lake and to your MongoDB cluster, we automatically catalog the schema of those sources, so you already have a searchable starting point where any user can go in and do a Google search and say, "Hey, I'm searching for this dataset," and Dremio's search results will be datasets from all of these different systems you've connected to, and the users can, from there, click a button and launch their favorite tool connected to that dataset.

To make that possible, Dremio needs to be able to detect and catalog schema, to index it automatically, to be able to rewrite the query from the user's tool automatically, to speak the language of the underlying data technology. It needs to detect changes in the schema of the source automatically and adapt to those changes behind the scenes without the user needing to know. There's a lot we're doing to make it easy for the end user, but, also, in terms of administering the care and feeding of the system while it's in production.

Mike Matchett: I know we talked earlier, you also have some collaboration intelligence built in that looks at what people have done and starts to become predictive about what you might want to do based on queries and things like that. You've also added some support, started to add support for security, being able to do row and column access controls and some other smart things, as the data [inaudible 00:06:58], but just to wrap this up, what is a Apache Arrow, and how does that fit into this puzzle?

Kelly Stirman: Oh, that's a great question. It's interesting that I think a lot of us have watched computing move to an end memory model because accessing data from RAM is about 10,000 times faster than accessing data from disk. There's been a big move in recent years to do more and more of data access and memory, but there is no standard for representing analytical data in memory without Arrow, and that might be a big surprise to people, but the analogy I like to use is without Arrow, it was like when you used to go to Europe on vacation, and you were going to do five countries in seven days, and when you were going to go from Switzerland to Germany, you were going to wait at the border and get your passport stamped and then you were going to have to convert your money, and you were going to lose some money along the way, so there was this friction built in for each country you were going to visit.

Well, Arrow is a standard that everyone can use, and it's like going to Europe on the Euro and the EU. You just drive over the border, and you use one currency everywhere. When you look at end memory computing, one of the big bottlenecks now is converting the data between different representations for different processes, and with Arrow, everyone can share one standard and completely remove those friction points and dramatically improve the efficiency of end memory processing.

Arrow's a project we started about 2-1/2 years ago that is core to the Dremio engine, but it's being used by dozens of different open source projects today because there was no standard before, and it just makes perfect sense. Whether you're working in GPUs and machine learning and AI or whether you're working from a Python library or whether you're running sequel queries in Dremio, Arrow is the right way to manage data in memory for analytics.

Mike Matchett: It sounds like Dremio is really focused on making data, I have to say, fungible, but making data fluid and flexible across and allowing an access and interchange of access no matter where the data sits, no matter what the format is in, and you just want to get there.

Kelly Stirman: We make it fast.

Mike Matchett: How can someone find out more about Dremio?

Kelly Stirman: Well, one of the things we wanted to share with the world, Dremio's open source, so everyone can go to dremio.com and download. You can run it on your laptop to try it out. It's really designed to work in clusters of dozens, hundreds, even thousands of computers. We, of course, have an enterprise edition that we sell that has some nice features around security and management, but the really exciting things about Dremio are in the open source edition, and they're for everyone to take advantage of and use.

Mike Matchett: I know I'm going to install it and get some access out of this. As we were talking, I'm like, "I have a problem that that can solve for me," so I'm getting right on that. Thank you for being here today, Kelly. I look forward to hearing more about what you guys are up to. Thanks.

Kelly Stirman: Thank you, Mike. It was a pleasure being with you.

Mike Matchett: Thank you for watching. I'm sure we're going to be back with more about Dremio and some big data topics soon, so stay tuned. Thanks.

Categories:

» Data Management » Virtualization

Channels:

Mike Matchett: Small World Big Data

Tags:

Show more Show less