How do you manage petabytes of data w. global workflows


Learn about a new technology designed overcome management of large global datasets and workflows.

As datasets become increasingly large, with workflows that are increasingly global, management becomes extremely cumbersome very quickly. In this video, learn about a new technology designed to overcome this w/a radical approach to the way files are stored and workflows and policies are deployed across them. If you need to manage petabytes of data, and potentially trillions of files, legacy file systems are not the most efficient way. File systems are familiar and useful but there's a new and better way to manage them.


Mike Matchett:                  Hi. I'm Mike Matchett with Small World Big Data. And I'm here today with Peter Godman, founder and CTO of Qumulo. Qumulo's been a company I've been watching and working with for many years. They're doing some great, interesting things on the performance end of file systems. Welcome, Peter.

Peter Godman:                   Well, thank you very much for having me.

Mike Matchett:                  First, just tell me a little bit about what Qumulo did uniquely when you guys came out a couple years ago. What was it you brought to market that really differentiated you from all the other kinds of file storage?


A short history of Qumulo

Peter Godman:                   We founded Qumulo just a little bit more than six years ago, and in March of 2015, we announced our company and product after working away at it for right about three years. And that product was something unusual. It's a scalable file system that, then, we called Qumulo Core, now we refer to as the Qumulo File Fabric. A scalable file system that helped folks not only manage ... sorry, not only store enormous data sets, but also manage them. And we did that by helping them understand what data they had, how it was growing over time, how it was getting used, et cetera. And so, we set out to radically simplify the process of storing and managing huge data sets.

Mike Matchett:                  And when we say "scale" and "huge" ... And I talked to some other people. Their definitions vary. You really mean millions, billions, and trillions, right? It's not a couple hundred thousands [inaudible 00:01:27].

Peter Godman:                   Yeah, that's right. So, for us, large numbers of file counts are tens of billions. That's where things start to get large. And a lot of folks just have millions. When we think about capacity for us is, large starts to happen into the tens of petabytes. But, everything is relative, as you say. I routinely talk to folks who will say, tens of petabytes isn't large, exabytes is large. Or, whatever.

Mike Matchett:                  Right, but we're definitely talking about file systems that are on the top end. So, these are things that-

Peter Godman:                   Absolutely.

Mike Matchett:                  In a prior world, these would be just HPC systems. They would be-

Peter Godman:                   That's right.

Mike Matchett:                  And we could rattle those off, but you sort of brought that back and said, hey, there are plenty of enterprise use cases for this, or plenty of people doing media entertainment, or plenty of people doing global distributed. There are plenty of people now getting into IOT. These are no longer nichey, HPC things. These are rank-and-file scale things that people need.


Use Cases for Qumulo

Peter Godman:                   Autonomous vehicles are a really interesting example of a brand new use case that is consuming, already, exabytes of file storage. Files specifically. Not object [inaudible 00:02:37] file.

Mike Matchett:                  All right. So tell me, what is the key data management challenge when I get to those scales of objects? What are some of the problems, and then what are some of the things you guys have brought to the table to help stay on top of that scale?

Peter Godman:                   The challenge is, file systems are just trees. That's really what they are. They're trees that map names onto folders, and names onto individual files. That's what these things are. And trees aren't the best way to understand large data sets, necessarily, because you have to go visit everything. So at Qumulo, we have these T-shirts that say "no tree walks" on them. And the reason we do that is, we try to systematically eliminate all tree-walking from traditional file system context. People like file systems, but they hate tree walks. So, we eliminate tree walks from systems.

                                                      We do this in various ways. I'll give you one example of something we released in the last year. We added a quotas function to our product. And, whereas all the quotas functionality that other folks have added to their products require re-scans of entire directory trees or rebuilding a volume, or building a new queue tree with a scan, processes which can take weeks or months to complete, Qumulo quotas are always added instantaneously. So, I can take a Qumulo system that has a billion files and a petabyte of data in it anywhere in the tree, and say, I want to add a quota here and I want to set it to 1.1 petabytes, and it will take effect immediately. Or, I can say, I want to set it to 900 terabytes, and it will cut it off immediately, because Qumulo File Fabric always knows how much data lives everywhere. This is one of the many ways that we're making data management really, really easy for people who are trying to manage enormous data sets.

Mike Matchett:                  We talked about a number of tasks earlier that really become hard at scale because of things like tree walks, because of the time it would take to look, and snapshots we looked at. So, that's really hard thing to do. Getting a handle on performance. Finding where the capacity is really being utilized. And just a number of issues that you don't really necessarily think about when you're working with a small file system would be the problem. And then, people quickly discover that, when they get really big, that ... wait a minute. Here's the real problem. And you guys are way ahead of those issues for that group of people, right?

Peter Godman:                   Yeah. You're absolutely right. In the old days, a million files was a lot, and when you wanted to know something about it, you just go visit all of them and take a look at what you had. And to do that with a billion files would typically take people months. To do it with ten billion it would take them years. It falls upon a scale. You can't do things [inaudible 00:05:15] anymore.

Mike Matchett:                  So, we talked a little bit about some of the things that you guys are looking forward to doing. And you talk about being able to be more distributed. Being more ubiquitous I think was the word we used. And adding more values. [inaudible 00:05:35] there anything that you want to give us some hints about what you're gonna be bringing to the table going forward?

Peter Godman:                   Yeah. Hey look, the world is changing radically in three different ways. One, globalization. We're becoming a global species very, very rapidly. Two, public clouds emerging is a completely, more-than-viable way of dealing with large data sets. And folks need to have their data moving between public cloud and on premises. And then, three, people are managing these gigantic data sets that we just discussed. Any file storage platform that is designed for the future is going to have to take on all three of those challenges. It's gonna have to help people create global workflows. It's gonna have to work in the public cloud and on premises. And it's gonna have to help people deal with billions and billions of files.

                                                      So, everything that you see us building and talking about is, in some way, gonna tie back to one of those three different things. And usually, more than one of those different things. So, what you know today is, we built a software-defined platform. So, we have our own appliances, we work great with HPE servers, we work in the public cloud. We have replication functionality we can expect to extend to more and more collaborative experiences for our users. And we built this really phenomenal core called Qumulo DB into our [inaudible 00:06:54] that helps folks manage enormous data sets. And you can expect to see that get more and more powerful over time.

Mike Matchett:                  And there's far more to discuss with you guys about that, if people are interested, and we'll send you to your website here in a minute. But, one of the things you've just ... or, just about to announce. And maybe by the time this comes out, you what is an all-flash system. But not just an all-flash file system run of the mill. What's unique about this all flash file system.


Qumulo Product Highlights

Peter Godman:                   Our all-flash system is based entirely on NVME. So, we're always looking to what the future of technology is, rather than backwards at what the past is. So, rather than build something on SAS, [inaudible 00:07:34] something like that, would say, okay, great, all NVME. 100 gigabit. It's based on modern sky like SPCPUs. So, this is really cutting-edge hardware technology in the form of appliances, but all standard. All standard. Nothing proprietary about it.

                                                      And, it is difficult to, without using any proprietary hardware, build the fastest, all-flash, scalable file systems in the world, but that's exactly what we've done. So, we built throughput machines. The throughputs ... Sorry, the throughput of our all-flash, Qumulo File Fabric systems beats our competitors hands down. Some of our competitors we're beating now by a factor of four, and throughput on a [inaudible 00:08:20] basis. This is kind of no contest.

Mike Matchett:                  You're taking the gauntlet, throwing it down, and saying, look, we've got a scalable file system going to billions, and it's the fastest one, so we got you beat there, too.

Peter Godman:                   It's a fallacy that you just have to have proprietary hardware to win in performance. Proprietary hardware is actually a huge drag, because most storage vendors just simply do not have the volume they need to get great economies of scale doing that. Qumulo takes a completely different approach. We say, look, there are all these folks out there building fantastic servers based on modern, standard hardware approaches. How do we ride on all of the R&D and investment that they're making, and not make our customers pay for our bespoke, niche hardware development enterprise, which doesn't make any sense at all? So, we're based entirely on standard hardware and we're winning in performance.

Mike Matchett:                  So, you have capacity, performance, and some sort of cost efficiency and reliability.

Peter Godman:                   Oh, of course.

Mike Matchett:                  Proven reliability. So, I think we have ... We have to stop talking, there. I'd love to get you back on camera and drill down more into your roadmap there, at some point. But, thank you for being here today, Peter.

Peter Godman:                   Well, thank you very much for having me, and I really appreciate the opportunity.

Mike Matchett:                  And thank you for watching today. I'm Mike Matchett with Small World Big Data, and we'll be back soon. Thanks.

Peter Godman:                   Great. Thank you, Mike.