What is CelerData StarRocks 4.0?

It’s a high-performance, distributed MPP database designed for real-time analytics directly on open data lakehouse formats like Apache Iceberg and Delta Lake.

How does CelerData eliminate data pipelines?

By supporting high-speed joins and aggregations directly on raw lake data, StarRocks removes the need for pre-processed denormalized tables and batch ETL pipelines.

What architecture powers StarRocks?

A massively parallel processing (MPP) engine that distributes workloads evenly across compute nodes, allowing for horizontal scaling and zero bottlenecks.

How does CelerData improve governance and data quality?

By enabling a single, unified copy of data to be queried and governed, ensuring consistent access control, quality, and reduced data silos.

What’s new in the StarRocks 4.0 release?

Performance boosts of up to 60%, expanded JSON support, vector search for AI, and deeper Apache Iceberg integration including table writes and maintenance.

What is vectorized execution in StarRocks?

A columnar data processing method where CPUs process data in batches (vectors), dramatically improving speed for analytical workloads.

How does StarRocks support AI and LLM workloads?

It provides vector indexes and low-latency access to massive datasets, enabling faster embedding searches, semantic queries, and data preparation for AI models.

What is CelerData Cloud?

A managed enterprise version of StarRocks deployed directly into a customer’s AWS VPC for secure, high-performance analytics without data duplication.

How does StarRocks maintain workload isolation?

Using multi-warehouse compute separation, it isolates resources for ingestion, computation, and compaction to ensure consistent SLA performance.

Is StarRocks open-source?

Yes. CelerData maintains the open-source StarRocks project and has contributed components like StarOS and multi-warehouse to the community.

CelerData: Breaking the Data Pipeline for Real-Time Analytics

Name: CelerData: Breaking the Data Pipeline for Real-Time Analytics
Uploaded: 2025-11-20T14:24:34-05:00
Duration: 21 min 43 s
Description: In this interview, Mike Matchett of Small World Big Data sits down with Sida Shen, Product Manager at CelerData, to discuss the release of StarRocks 4.0 and how it’s redefining performance for the modern data lakehouse.

Truth in IT

11/20/2025

0 (0%)

Report Like Favorite

Transcript

Hi my Matchett with Small World Big Data. We are here talking about one of the great emerging things that's happening, which is how we take massive amounts of data and be able to query them in real time, get to the most current data, and of course eventually feed them to our AI agents that we're developing. It's really about how to make use of massive amount of data quickly. For that, we really have to go and look at what we've already got, which is a lot of times data lake houses, you've got lots of big tables. We've got lots of what we might consider BI or OLAp data we want to mine, but we've got to bring that into the real world and make it make it look like it's a real time thing. For that, we've got seller data here to talk to us about StarRocks and particularly their latest release, DB StarRocks DB 4.0. Uh, so hold on a second. We'll get right into it. Hey, Sida Chen, welcome to our show. Oh. Hello, Mike. How are you doing? Good, good. All right. So so that that's a lot of stuff to talk about here. So. But really, really what you guys are doing at CelerData is taking and developing forward this a lot of open source really. And we'll get into that. You know you have you have an enterprise edition too. But we'll get into that later. Uh, that's taking a approach of applying some really modern theories about how to implement high performance, high scale, real time access, kind of in an SQL way. If I can abuse these terms to what most people consider to be their data lakehouse kinds of data. Yeah, which is just huge volumes of stuff that needs like data pipelines to preprocess. I mean, if you want a dashboard on your data lakehouse, you've got to have development engineers working for days to build these pipelines of queries and create these static things and and get the thing to process everything from the day before. You're not really getting real time out of that. Uh, but, uh, the world's changing. So before we even dive into that, let me ask you, from your perspective, what are some of. The things people are trying to do with with that, uh, old, old app kind of perspective? Yeah. That's not working. What are what are some of the challenges they're running into? Yeah, that makes sense. So, um, data lakes are big, right? Um, and people think, you know. It's only for that kind of workload. Right. And anything, you know, they, they want to run. That's low latency, high concurrency or anything that is, um, uh, performance sensitive. Uh, they move to. Immediately just move to a proprietary system. They people have been down there for for years. Right. Until the merging of, you know, data lake house table formats. Right. So their data lakes that behave sort of like data warehouses, right. They have data warehouse features, such as, you know, um, Projects like Apache Iceberg and Delta Lake. Right. And they have data warehouse like features, you know, but they're still sitting on top of parquet files as a data lake, right? Um, and people now are, are, are moving to that direction where they want to run all of their workloads on top of open formats, on top of table formats like Apache Iceberg and Delta. The reason is, um, if you say you you have a database, you know, for each one of your workload, right? And that creates a lot of data silos, right. And you you don't want to manage, you know, different copies of the same data scattered around in 15 different system because you have 15 types of workload, especially with, you know, the emerging of AI. Right? People don't know when they're. What they're going to do tomorrow, even tomorrow. Right. Uh, they don't know if they're proprietary storage is going to support that kind of, you know, tooling they're going to need, you know, uh, committed in the coming years. Right. So that's why people are moving even more, you know, moving to this open format kind of environment. Right. And we see a lot of users running low latency dashboards. Um, even customer facing in product analytics, directly serving analytical capabilities to their end users or AI agents, right? Directly, uh, from uh, from data Lakehouses. Uh, right. And this is a trend we've been seeing and we've been a key, uh, player in this field. Right? We support that kind of low latency, high concurrency kind of workload directly from your, your data lake, your data lake. So just just as examples, you know, and and you could pick your favorite online retailer, uh, where you might go and say, hey, what what are my recent orders? What are my current orders? Where's my current status? Yes. You know what? Where you know what what's what's popular right now? I mean, there's just things that happen at scale in the broader sense, uh, where the company, the e-tailer, the online presence wants to get that information right back to their user in real time. But, you know, that's a massive pyramid of data underneath them, right? Yeah. How do we do that fast? Okay, so so give us some clues. What what what do what has to happen before we even dive into StarRocks? What has to happen in order to become more real time or to, uh, break up this idea of having old data pipelines? What do we have to what do we have to do? So, yeah, a lot of the pipelines are pre-processing pipelines. So which means that, you know, you have sometimes, you know, if you have multiple tables you want to query, that's a kind of a norm now, right. Database management system relational. Right. That's a definition, you know, for multiple tables. Right. Uh, so if you have queries that query multiple tables, you know, a lot of the engines don't support, you know, doing that on the fly or even, you know, doing that just fast enough, you know, to, you know, uh, to return the result fast enough. Right? So people stuck with pipelines, right? Uh, so people denormalize basically means, uh, turning those multiple tables into a big flat table that has all of the information of all the tables in it. And that's the sort of pipeline, um, that's very, very, um, uh, popular. Popular. Not bad. Popular. It's a bad type of popular. Right. Uh, because of, you know, limitation from their engines. Right. And those pipelines are very, very expensive. And, you know, for us, um, we just figure out a way, you know, to run those queries fast enough so you don't have to do those, create those, uh, pre-processing pipelines. So we want to avoid pre-processing, obviously, because that's baking in. You're sort of becoming self obsolete the first time you do it right. It's like you're answering one question, but you're not. But now you're baking in this inflexibility to answer different questions, uh, because you have to build a different pipeline to do that. Yeah. Or making another copy, which if you start making copies of all your big data. Yeah. Then you lose all of the governance, right? You cannot do anything else right. So yeah. So let's talk about that. So first of all, you know real time kind of breaks that model anyway because you know, you you you if you have to preprocess your versus getting a real time look, there's a, there's a problem right there. Yeah. And then tell me about governance. Tell me about tell me what what happens with the governance at scale when we do that okay. So, uh, it's always easy to, to govern your data when it's like one copy, right. If you have like the same, same copy of data, you know, different 15 different places, it's it's not it's not very good. Right? It's it's horrible. Uh, for your data quality, it's extremely difficult to do security due to access control. Right. And and also, you know, having all of your data in one place is, is, is really, really good for your AI agents, right? Because under of course, under access control policies, you know, um, AI is SQL agents really benefit from having a unified view of all of your data in your organization. Right. And having that one big performance. Uh, Data Lake. Right? Um, that contains a single copy of all the data in your organization. Uh, and that is really, really good. You know, if you want to develop AI agents, uh, on that, on that, on top of that piece of data. Uh, so, so, so just let's let's just look at seller data and StarRocks in particular. What do you what does StarRocks doing differently, uh, to solve this problem? I mean, you've got you've got these problems with latency, the problems with the inflexibility in, uh, you know, large pre canned pipeline development and pre-canned pipeline processing. What does StarRocks do differently here? Okay. So yeah. Thank you. Uh, great question Mike. So uh, so for to to get rid of all of the data pipeline. So basically you need to run all the queries on the fly fast enough. Right. So to get that kind of performance for us, uh, it's basically two things. Right. So first you want to think about how a, OLAp analytical query you look like. It's always see how many people in this room is wearing red today, right? So that is all the people, right. And a filter and aggregation. Right. So every almost every analytic query is always columnar focus. So always like batch focus. Right. So that's a lot of columnar storage. So you want to store things in column. Right. And you want to process a lot of things in column right. So that's vectorization. And you want each CPU cycle to be able to process multiple rows of data. Right. So that's in columnar. So in Minibatches right. So more the more batch the more columnar you run, the faster the query basically is going to go. Right. Um, yeah. So so first is, you know, like like to summarize, you know, getting good performance, right. Uh, just baseline good performance. And second is how do you horizontally scale. Right. How do you make your system run not only for one user, but for tens of hundreds of thousands of end users and AI agents in the future. Right. And also, how do you run it on not only gigabytes of data? How do you run it on petabytes of data? Right. And you want to design a system that is naturally distributed without any bottleneck. Right. Um, and one architecture is the MPP architecture massively parallel processing. Right. So basically, you know, you're able to reshuffle your data according to whatever the query is asking for. So to evenly distribute the data, you know, across, um, all of your nodes. Right. So there is no bottleneck. Right. So you can just keep horizontal scaling, you know, um, you know, to, to take care of, you know, how however big, you know, amount of data that queries scans. Right. And the second thing you know, is, is for the number of users, right. So you need to be able to slice and dice the data. The database needs to be able to slice and dice the data to only process the data belong to that one tenant, right? Uh, so you're able to, you know, linearly scale this process as well. And also another thing that's very important for to stabilize your query, you know, for those for your different kind of tenants is, is a lot of, uh, resource isolation, right. So you want to make sure that your top, uh, tenants or actually all of your tenants or all of your customers are able to get the appropriate amount of resource to run their workloads appropriately, right? So, right, uh, physical and, um, resource isolation is very important here as well. So okay, back to you. So yeah, so we went from uh, talking about, um, you know, kinds of the old way of doing bi, which slow to mentioned a couple use cases earlier of, of serving a real time dashboard to end customers at scale of, of what's going on. Uh, and let me talk about AI agents in the brave new world of what the AI demand is. Could you just say something about like how you see AI agents coming to want to use something more like StarRocks than they would something, just the old lakehouse? Yes. So, uh, yeah, the lakehouse concept is it's still important, but, you know, you need you need different kind kinds of tooling, you know, on that piece of data lake. That's the magic of, you know, building things in open format, right? If you don't like the tools you're using today, uh, you can just ditch it and switch to another one without moving your data. Right? Uh, so this really sparks, you know, multiple tools on top of, like, one big, um, uh, single source of truth of your data, you know, your entire organization, right? Um, I think I think data lake houses, the concept is extremely important for AI agents, because AI agents even benefit even more than humans if they have an overall view of all the data that's in your organization, right, and overall view of your organization, and also has that good performance to be able to support whatever, you know, workload, uh, that AI agents trying to do. Right. Um, right. So there's that. And second is you want to really shorten, you know, the the amount of time you go from the generation of your data to that kind of, you know, intelligence with context, right? Uh, from the chatbot. Right. And right now, a lot of the bottlenecks on the infrastructure side. Infrastructure side. Right. Apparently, you know, querying a petabyte of data is pretty slow. It's pretty time consuming and expensive, right? And we're here to solve that problem to lower the latency, you know, for however type of query you or your AI agent sent to your massive data lake and return the, uh, the result, you know, in satisfactory latency. So that makes. Sense. I mean, we see a lot of AI emerging, and I think some people are like, hey, I just can access everything instantly by magic. And it's like, no, there's going to be a lot of work yet that people have to do to to make all the data in an organization available in AI time. I don't think yeah, it's yeah, it's in in any organization, you know, their their data infrastructure is extremely scattered without the data lakehouse system. Right. Is is, you know, you have here and there, you know, in a key value store you have here, you know, in a, in a, in different kind of TP stores. Right? You have different kind of analytical engines. Each team has their own like special tooling. Right. You want to like extract all of them and, and serve that to an agent without a centralized storage is very, very difficult to do. Right. You have tools out there. You know, that that claim to have connectors for each one of the database that the world has ever seen. But those connectors can never deliver the kind of performance that the agent needs, right? Probably for a batch ETL job, you know, that runs for for a whole night. Right. And two hours just extracting data from the database and do a whole batch, uh, operation. But that's not what the AI agent is asking for, right? They're asking for overview and also good performance. So having a centralized data lake data lake house in open formats is definitely the answer for us. Definitely the way it goes. Okay. Yeah. Uh, and now tell us a little bit about, um, the 4.0 release. So you just you, you have you have a thriving open source community is growing. You've got increasing numbers of people contributing and following you guys on there. Tell us a little bit about 4.0 here. Yeah. Thank you. So uh 4.0 first of all you know we really we still you know, although we have already have this kind of performance and we also strive to, you know, uh, to to break the, the performance barrier even more. Right? So, uh, even for us, year over year, we see around 60% performance gain. Just the raw performance, all that our performance joins and aggregations. Right. And also we're expanding into different types of workloads that AI agents are going to need. Right. So semi-structured data such like JSON right. That's good. Yeah yeah. Yeah. JSON performance is not as fast as columnar. It's just our native columnar tables, right? Uh, and also vector indexes. Right. Uh, so vector search, you know, if you want to search with semantics, uh, that's a lot of, you know, uh, semantic search kind of hybrid search kind of scenario, right. And also building rack, you know, empowering your LMS. Right. And also we are determined to be even more open. Right. Uh, now we're, uh, we have in the 4.0 release, we have, uh, deepened our integration, strengthened integration with Apache Iceberg, not only with writes, uh, not only with just query. Now we support writes and iceberg table maintenance. Because if your iceberg table is messy. It doesn't matter. You know how we optimize our query, right? Uh, it's the scanning is going to take all of the time. Right? Scanning the data is going to take all of the time. Right. So yeah, even more ecosystem integrations with Apache Iceberg with Delta. And we're even more committed to open source. We're open sourcing uh, two of our features. So first is star OS, our implementation for, uh, storage and compute separation, uh, for the open source project and also multi warehouse. Right. So that's basically, you know, a way to group your different compute nodes, um, into physically isolated compute warehouses, right. To serve different kind of workloads to have compute ingest separation compute compute separation compute compaction separation. Right. So make sure all of your customers, you know, have the right amount of resources, uh, to keep their service going to achieve their SLAs. All right. So 4.0 you got another roadmap ahead to. I'm sure people can look that up if they're interested because it is an open source project. Uh, but you also have an enterprise version of StarRocks in the cloud. Uh, I understand, uh, and given the complexities involved in some of this, that might make sense for some people, uh, to, uh, have you run their, their StarRocks. Yeah. But tell us a little bit about, um, I guess you're calling it Seller Data Cloud. Tell us a little bit about seller Data Cloud. Yeah. So StarRocks is the open source project, and seller data is the company that's behind supporting, uh, the project. And we're the ones that initiated the project and we're the ones that are maintaining the project. Right. So, uh, so the data cloud is basically, you can think of it StarRocks enter enterprise version of StarRocks on the cloud. Right? So StarRocks on the cloud for enterprises. Right. So that is uh, has a basic architecture. So basically bring your own cloud. So we're able to deploy automatically deploy enterprise StarRocks clusters directly into your AWS VPC. Into your AWS account. Right? So that's a more secure way to run enterprise startups on the cloud, right? So we do all of the maintenance and management, you know, from our VPC remotely, right. Uh, so you have the best of both worlds. So that's so. So so people don't have to copy their data lake house into your. No they don't into third party. Right. They can keep their keep what they've got. You're going to you're going to insert StarRocks, uh, in on top of it and then remotely manage it basically. Yes, yes. Yes, yes. It works like a query engine. You know, you just have your, your your S3 tables, right? Uh, S3 um, you have your tables and you connect and directly query. So no data copy, right. No data copy into our, our. You know, we don't want to make one more copy. Uh, that's pretty cool. So, uh, tell us a little bit, I think I think we're sort of running to the end of our time here. Tell us a little bit if someone wants to learn more about the open source project. Maybe look and figure out where it is, start to get comfortable with it, or even look into seller Data cloud and start to explore what that would do for them. Uh, where would they where would you recommend they start? Yeah. So thank you. So check out seller data cloud. So cloud.com. Uh, enterprise StarRocks in your own, um, cloud VPC. Right. Uh, so now we're running a 30 day free trial on the cloud. So definitely check that out. It's cloud.com. Uh, if you want to check out the open source, uh, please go to StarRocks GitHub page. Right. Uh, just give us a star and also join the slack channel and visit the slack StarRocks. Um, website StarRocks. Io right. And join the slack channel and join the conversation. All right. So it looks like there's a lot someone can do, especially if they put a lot of effort into their, uh, sort of lakehouse already. And they've got these burgeoning demands for AI agents or from AI agents to say, give me access to all your data and you and you've got initiatives that say like, hey, how do we get value out of doing AI? And it's like, well, you're going to get more value out of AI if you can give it all your data to work with. Um, so it sounds like this is a key piece of architecture that could really make that happen for folks, particularly at that large scale and in a more real time application mindset, rather than saying, hey, you know, that bi stuff's old school like we are now, we're now going to we're not going to run this. Uh, yeah. So customer facing on on your data lake, who would have thought like three years, five years. Customer facing the customer facing data Lake is probably a good way to summarize that. Yeah. Uh, so check it out, folks. Um, and I appreciate you for being here today and taking the patience to walk me through some complex topics under the hood here, but I think we've got the gist of it. Uh, so thank you very much. Thank you. Appreciate that. Thank you. Mike. Thank you.

In this interview, Mike Matchett of Small World Big Data sits down with Sida Shen, Product Manager at CelerData, to discuss the release of StarRocks 4.0 and how it’s redefining performance for the modern data lakehouse.

They dive into the challenges of pre-processing pipelines, latency, and governance across massive data volumes, and how CelerData’s distributed MPP architecture and columnar engine deliver real-time analytics directly on open formats like Apache Iceberg and Delta Lake.

The conversation explores how the CelerData gives enterprises secure, managed StarRocks clusters in their own VPCs, eliminating data duplication while improving scalability and performance for AI workloads.

With vectorized execution, JSON support, and integration for semantic and vector search, StarRocks 4.0 positions itself as the bridge between analytics and the AI-driven enterprise.

Categories:

» Small World Big Data
» Cloud Webinars » Hybrid Cloud Webinars
» Cybersecurity Webinars » Data Security

Tags:

Show more Show less

Browse videos

Upcoming Webinar Calendar

12/02/2025

01:00 PM

12/02/2025

The Invisible Threat: How Polymorphic Malware is Outsmarting Your Email Security

https://www.truthinit.com/index.php/channel/1629/the-invisible-threat-how-polymorphic-malware-is-outsmarting-your-email-security/
12/02/2025

01:00 PM

12/02/2025

Kick Off Your Journey with Netwrix

https://www.truthinit.com/index.php/channel/1631/kick-off-your-journey-with-netwrix/
12/04/2025

12:00 PM

12/04/2025

CMMC Level 2 Assessment Insights: Expectations from an OSC and C3PAO Assessor

https://www.truthinit.com/index.php/channel/1557/cmmc-level-2-assessment-insights-expectations-from-an-osc-and-c3pao-assessor/
12/09/2025

01:00 PM

12/09/2025

Energize Your Connections with Netskope and Presidio Insights

https://www.truthinit.com/index.php/channel/1553/energize-your-connections-with-netskope-and-presidio-insights/
12/10/2025

01:00 PM

12/10/2025

The Next Generation of Managed Data Security Services

https://www.truthinit.com/index.php/channel/1620/cyera-the-next-generation-of-managed-data-security-services/
12/10/2025

10:00 PM

12/10/2025

Maximize Revenue Opportunities with Druva’s Microsoft Expansion in APAC

https://www.truthinit.com/index.php/channel/1624/maximize-revenue-opportunities-with-druvas-microsoft-expansion-in-apac/
12/11/2025

05:00 AM

12/11/2025

Partner Tech Talk: Bridge Gaps and Enhance Revenue with Druva’s Microsoft Expansion

https://www.truthinit.com/index.php/channel/1625/partner-tech-talk-bridge-gaps-and-enhance-revenue-with-druvas-microsoft-expansion/
12/11/2025

12:00 PM

12/11/2025

Secureframe: Addressing the Top 5 Compliance Challenges for Startup Leaders and Solutions

https://www.truthinit.com/index.php/channel/1526/addressing-the-top-5-compliance-challenges-for-startup-leaders-and-solutions/
12/11/2025

01:00 PM

12/11/2025

Enhancing Revenue Capture through Druva’s Microsoft Partnership Insights.

https://www.truthinit.com/index.php/channel/1623/enhancing-revenue-capture-through-druvas-microsoft-partnership-insights/
12/16/2025

01:00 PM

12/16/2025

HUMAN Dialogue: Discovering the Depths of Page-Level Performance Intelligence

https://www.truthinit.com/index.php/channel/1630/human-dialogue-discovering-the-depths-of-page-level-performance-intelligence/
12/18/2025

11:00 AM

12/18/2025

Trend Micro Webinar: Smarter Decision Making via Network Intelligence

https://www.truthinit.com/index.php/channel/1372/unlocking-network-intelligence-for-smarter-risk-decisions/
12/18/2025

12:00 PM

12/18/2025

360View: 2026 IT Predictions & Emerging Trends

https://www.truthinit.com/index.php/channel/933/360view-2026-it-predictions-emerging-trends/
12/18/2025

01:00 PM

12/18/2025

Delving into IconAds, SlopAds, and Future AI Threats for 2026

https://www.truthinit.com/index.php/channel/1649/delving-into-iconads-slopads-and-future-ai-threats-for-2026/

The Invisible Threat: How Polymorphic Malware is Outsmarting Your Email Security

The Next Generation of Managed Data Security Services

CMMC Level 2 Assessment Insights: Expectations from an OSC and C3PAO Assessor

Energize Your Connections with Netskope and Presidio Insights

Maximize Revenue Opportunities with Druva’s Microsoft Expansion in APAC

360View: 2026 IT Predictions & Emerging Trends

Rethinking Hybrid Access: Securing Users, Vendors, and Infrastructure in the Zero Trust Era

Microsoft Advanced Group Policy Management (AGPM) End of Life: Your Practical Migration Playbook

Cut Ticket Resolution Time in Half with Smarter IT Documentation

CMMC Certification: Next Steps for Continuous Monitoring and Management

Deep Packet Inspection (DPI) Insights within Endpoint Protector Learning Lab

HUMAN Dialogue: Cultivating Trust Amidst the Rise of Agentic Commerce