How can I handle the massive amount of unstructured


The Big Picture: It's no secret unstructured data is growing at an explosive rate. Add to this the complexity of mutlple workflows, sites and cloud infrastructures...not to mention security...and you've got a recipe for trouble. You can't apply yesterday's legacy technologies to solve today's problems. They're inefficient at best and non-compliant or unsafe for your company's data at worst. In this short video learn how Aparavi takes advantage of today's technologies and webscale infrastructure so you can manage a limitless quantity of unstructured data...regardless of how heterogeneous your infrastructure is.


Mike Matchett:                  Hi. I'm Mike Matchett with Small World Big Data and I've got here with me today Rod Christensen who's the CTO, the smart guy from Aparavi. Aparavi's in the, technically the backup space but I tell you what, listening to these guys and what they're doing backup has changed significantly. We are now not just talking about taking your files, taking copies of things, moving them off in a snapshot way and filling up 10 petabytes of old archives and running your cloud bill through the roof.

                                                      We're talking about how do you get the maximum value out of your backups, how do you get all the functionality you want out of it, how do you do some really cool things with it, and most importantly, do backup in a multi-cloud world so that it makes sense for you, optimizes your space, your cost, your benefits, your values, your functionality, all this stuff. These guys have thought it through. Welcome to the show, Rod. How you guys doing today?

Rod Christensen:              [inaudible 00:00:53] Mike, I appreciate being here. We're doing great. How are you going?

Mike Matchett:                  Good, good. Aparavi, multi-cloud, active archive. Let's just start with that multi-cloud bit first because I think that's really interesting. What do you have to bring to the table in a new kind of solution to really handle multiple clouds today, what is it that you really had to re-think through?

Rod Christensen:              You know, a lot of vendors call multi-cloud, they have the ability to target one cloud and then use that as an output device. Then, if they want to they can go to another cloud and use that as a different device, but you actually really have to pick one or the other. Now with Aparavi, it allows you to put data into one cloud and then later on, after you get a better deal on another cloud or a different relationship with a different vendor, cloud vendor, you can switch over to that cloud and use both clouds at the same time simultaneously.

                                                      Without actually moving your data from the first cloud, you can start writing new data over the second cloud. It seamlessly picks ... When it goes to recover data it seamlessly knows where all that stuff is and picks out the right data from the right cloud at the right time.

Mike Matchett:                  Yeah so, and this is part of the value of Aparavi. What we're really saying is, it's not [inaudible 00:02:14], right? Aparavi comes from a different word you were telling me, to prepare and to plan and to get things right, not to make your data disappear, although that's what I think of with the Harry Potter part. We're really talking about how do you deal with masses of unstructured data that are growing today? You don't just want to deal with it the same way we used to, we know we're going to end up with petabytes of stuff.

                                                      If I use one to today's, I hate to say it, but Legacy backup solutions and I target the cloud with it I also end up with the situation where I can't really use what's in the cloud directly. I have to still come back out and, in some ways, rehydrate or come back out of that system before I can use that data. That's not the approach you guys took, you looked at it a little bit differently, right?

Rod Christensen:              We looked at it very differently and that's how you access data once it's in the cloud. Obviously, you can recover data from the cloud to on premise and get all your data. It's rehydrated, de-duped, de-deltified and all that kind of stuff. The real value of the data is actually in data analytics, e-Discovery, and things like that, that the data's sitting in the cloud. How do you make use of that?

                                                      We've actually come out with a public domain DOL and sure object that you can actually put up into a cloud instance, write a program or connect to an e-Discovery or a gateway to an e-Discovery. It gives you complete access to the archive data without bringing it down back on premise. Basically, if you have 10 petabytes of data sitting up in the cloud you can access that data without the rehydration and the egress fees that are normally associated with it.

                                                      Trying to bring that data back on prem for any kind of analysis is just impossible once it [inaudible 00:04:03]. You have to be able to make the data accessible where the data is.

Mike Matchett:                  There's really three things with that, and I know we talked about, yeah, this open data format and what you're saying so I can get to that data in a standard way no matter where it's sitting. That makes it very useful, globally accessible, makes copies also for test data and things like that. You need to have a security layer on top of it so you still have to impose all the constraints to get at different things. We know that's one of the values you guys also bring to the table is that.

                                                      You're not just dumping it up there into S3 and saying here's the bucket, go get it. You still have to go through your management system to get at the data. Tell me a little bit about it. I think what most people first think of is, "Hey, I'm putting all this data up in the cloud capacity optimization. I don't want my cloud subscription costs to go through the roof, I want to use multiple clouds.

                                                      What do you guys do to the data that makes the cloud really an effective and cost efficient option?

Rod Christensen:              That's a great question. The first thing to really optimizing data storage and data capacity is recognizing what you have in the first place, what kind of data you're dealing with. If it's documents or Excel spreadsheets or PDF files or something like that, you really need to understand what you're dealing with. Once you understand that then you can classify the data as to its importance.

                                                      Once you classify the data as to its importance, then you can set policies on how long that data is to be retained. For example, say if you have PDFs that you want to keep a couple years but all your DOCXs that have legal information, have special aspects and characteristics to comply with regulations, you need to keep those a lot longer so you can set data retention periods of seven years on those.

                                                      In addition to that, we can recognize things like social security numbers, phone numbers, addresses, and things like that within the dataset itself while we're actually copying it. You can do queries on it to say give me all the documents with personal identification in it or search for a particular person or word or whatever you want to do. That's how you actually multi-use the data that's out there.

                                                      The most important part of it is, once the data is no longer needed get rid of it, you don't need it anymore. You can save a huge amount of cost. A lot of companies are actually saying, "Okay, well we're just going to throw stuff up in the cloud and we'll keep it for seven years." What happens if you only need three-quarters of that data. For two year savings you can actually obtain by getting rid of 75% of data for years three through seven.

Mike Matchett:                  Yeah. In that way of going in and carving pieces out of it, you've got a couple clever technologies, right? You've got some things we talked about earlier, I think you were talking about pruning is what this whole thing is called. You don't necessarily ... Somebody says give me all the documents in this dataset, you don't necessarily deliver the metadata pointers to each of the objects in that dataset directly because that gives them a static thing and it's almost like you can't garbage collect it underneath.

                                                      What you do is you give them some metadata pointer that you can go back and still prune back within that live and dynamically. Maybe you can explain a little bit more about that.

Rod Christensen:              With pruning, pruning is actually some of the secret saucing area, it really is and it's actually pretty complicated to map out.

Mike Matchett:                  Well, we only have a couple more minutes, Rod, so ...

Rod Christensen:              I know, I know. The thing is, is that what pruning does, it only keeps the data as long as it needs to be kept and once all ... Let's say that you have a retention period set for two years. Anything beyond that two years will automatically be removed from the system so you no longer need it. Unless that is being held or referenced by something else further down the line, say eight, six months ago, it will actually keep that data until that secondary document that relies on that data actually expires and then it can get rid of the whole thing.

                                                      That's how the pruning of a data management works. I have a white paper on it and it's how many pages, three, four, pages long that explains it. It's pretty technical though, I will tell you as a warning.

Mike Matchett:                  Well tell the world, tell us where we can find that paper. Is that on the ...

Rod Christensen:              It's on

Mike Matchett:        , great. I should point out that Aparavi is, at first level, a SaaS service and works on whatever targets you want across the board so people can get up and start pretty quickly with it, right?

Rod Christensen:              Yep, absolutely. Onboarding is very simple.

Mike Matchett:                  Yeah, awesome. I guess, and there's a whole bunch more of technologies that I would love to get into, we got some snapshotting discussion that we got into. We got into some things about data analytics and about the storage analytics and about the open data format so tons more things to really dive in there together, kind of uniquely packaging together. What I think is really one of the smartest assemblages of backup and archive software for the cloud, it just really puts it together in one piece so kudos to you guys for that.

Rod Christensen:              Oh thank you very much, I appreciate it.

Mike Matchett:                  Again, it's Aparavi, check it out. I think we don't have too much more time, do you want to have any final thoughts?

Rod Christensen:              Not for me, last call.

Mike Matchett:                  Well thank you for being here, Rod. Hopefully we can do this again and dive into some of the other topics in some deeper episodes. Thanks for attending today. Check it out, there's lots going on to backup archive data management space. I'm sure I'm going to hear more about Aparavi as they roll out yet some more features going forward, they're just getting started. I'm Mike Matchett with Small World Big Data, thanks for being here, Rod.

Rod Christensen:              Thank you, Mike.

Mike Matchett:                  Thanks for watching and come back soon. Bye.