Small Data By The Numbers: Fireside Chat | Small Data SF 2024

2024Small Data SF

Small Data By The Numbers: Fireside Chat

Join MotherDuck CEO Jordan Tigani and Fivetran CEO George Fraser for a fireside chat on the realities of Small Data. George's extensive research on benchmarking will provide a foundation for a discussion on where we are today and how the 'useful data' movement will evolve in the future.

Speaker

George Fraser

George Fraser

CEO

George Fraser is the co-founder and CEO of Fivetran. Prior to Fivetran, he worked as a neuroscientist before transitioning into the tech industry and leveraging his analytical background to solve complex data challenges.

0:00[Music]

0:14Thanks, George, for being here. It's an honor to have someone of your stature come and and chat with us. I appreciate you taking some time out of your uh your I'm sure busy schedule to uh to come here and help us celebrate uh you know, what can be done with uh with with small data.

0:28>> I'm very glad to be here. I think small data is a very important trend, maybe the most important trend. Uh the most important right now.

0:35>> Wow. That's uh that's a lot of responsibility to have to uh to deal with. But um so you know you know you're

0:45building this this this company. Uh you see a lot of data. Um you see some data from some of the biggest biggest customers in in the world, some of the largest enterprises. Surely you know they all have they all have big data.

0:58like what what what interests you in in small data? >> So, Fiverr's been around for a while, over 10 years. Uh we are part of that modern data stack that's been mentioned at earlier talks here. I think in fact my co-founder deserves partial credit for uh popularizing that term uh almost 10 years ago. Uh and we've been doing

1:16data replication. So, if you don't know, Fiverr replicates all your company's data from your systems of record like Salesforce, your production database into one place, place of your choosing.

1:27And uh all that while uh from from the very beginning one of the things we observed was where are the big data sets uh we were doing we were replicating all these databases and applications and uh the data in general was surprisingly small and it's not a selection effect.

1:42Fiverr has huge customers. We replicate some of the uh largest data sets in the world. We replicate uh the open AAI production data sets. We replicate Proctor and Gamble's uh uh ERP systems uh which is one of the largest SAP installs in the world. So we replicate some of the biggest data sets that are out there and those are big but in

2:02general big data sets are very rare and and uh I my the theory I have developed over the years is that most big data sets in a business environment come from inefficient data pipelines that are not doing change data capture that are copying a new copy of everything every time and preserving it all. That's how you get pabyte scale data sets most

2:23commonly in businesses. And because we always do everything based on change data capture no matter the source uh we end up only having as much data as is actually there which generally turns out to be less than you would think.

2:37>> I can imagine that database vendors like the uh those inefficient pipelines that uh let them rack up you know very large large data sets.

2:45>> You know sometimes people speculate like that. What I have found working very closely with Snowflake and others over the years is that those companies really are very customer centric and are eager to help their customers be efficient even when it costs their own revenue. Um I mean Snowflake switched to Graviton years ago and that cost them like over

3:045% of their revenue uh and they just gave that to their customers. So I I think in fact um uh the uh the vendors

3:13are are really trying to make things efficient and the inefficiencies that do happen are just due to sort of patterns of behavior in the industry that have built up over many years and in the example I just gave of uh inefficient data pipelines snapshot pipelines are easy to build and maintain and so when you do it yourself unlike Fiverr it's

3:32very common to fall into that pattern. >> Sure. Um so uh we just recently had a uh

3:38a speaker talk about this Redshift uh Redset data set um that uh you know I think it had a half a billion queries against against Redshift several hundred several hundred customers lots and lots of databases that um I think is a real treasure trove of information about how people are actually using Redshift. Um I wrote a blog post on it um called the

4:01Redshift files. Um but I am obviously um

4:06you know motivated and and and an interested observer. Uh however you also wrote a blog post uh about the uh you know your experiences with that data set and also there was a snowflake data set as well. Um and you know would love if you could share some of the you know the the things that you you found as a

4:24somewhat less less um opinionated uh maybe you're opinionated but >> I'm opinionated about everything.

4:31uh you're not necessarily incentivized to find to find uh certain certain types of patterns.

4:36>> Yeah, this data set was really fascinating and and kudos to Amazon for publishing it. And when I saw the uh commentary on it when it came out, um I learned that Snowflake actually published a similar data set years ago which I had just somehow missed the whole time. So I spent a bunch of time looking at that data at the Redshift

4:53data set and the Snowflake data set myself. uh you can look at my Python notebook on uh on on GitHub if you want to. I actually used duct DB to run all the queries because that was the most convenient way to interact with it. Um and uh I found very consistent results between the two systems. There's all

5:10kinds of interesting stuff in that data. I mean, if you're a person, if you're like a startup founder working in the data ecosystem, I encourage you to look at Redset and Snow Set yourself, uh, to try to figure out what it means for your own, uh, whatever you're working on. Uh, because it it's a it's a sample of of

5:26real world workloads. Um, but the things that really struck me were um, I think some of the same things that struck you.

5:32Uh, the average query is astonishingly small. uh the uh you know you look at the distribution you know the the the middle of the distribution is something like uh 64 megabytes uh of data scan

5:45>> uh and that's true of both systems uh so this isn't some quirk of the of the sample uh in one or the other like it's very consistent between the two systems most queries are not just small but tiny um and the other thing that was really um interesting to me as a uh someone running a data pipelines vendor is 30%

6:04of the workload is inest. >> Um I I was astonished by that. I mean we had seen signs of this in Fiverr customers uh before but I guess I just still didn't totally believe it.

6:16>> Uh that in general about a third of what your data warehouse does is just accepting ingest queries from Fiverr or whatever your data pipeline is. Uh and that you know that is a huge amount of dollars that's getting spent. Um and you know we we um our fastest growing destination now at uh Fiverr is uh data

6:36lakes uh where we write the data directly into uh iceberg or or delta table formats and when we do that uh we do the ingest compute for that and one of the things we observed as we built this over the last two years is that by controlling um you know both the pipelines and the ingest portion into a

6:56data lakeink uh we were able to build a special purpose ingest engine meant just for fiverr pipelines and we were able to make it so much more efficient uh that we just roll it into our regular pricing model. So for example if you use fivetan to write a data lakeink uh that entire ingest cost line item from your

7:15perspective just disappears it's just included in the existing botran pricing model. So I I think there's there's just there's some profound observations in that uh in that data set, but one of them is that uh there's a lot of workloads that are sort of up for grabs uh in this in this uh in this new era that we're going into. And and the

7:34workloads are not necessarily the ones uh people thought were inside these data warehouses. Like the stuff that's going on in these systems is quite different than I imagined that it was and that I think a lot of people imagined it was.

7:46as Gorov talked about earlier, you know, how TPC is so which was designed to be like what people thought a real world uh data warehouse workload is like is is just so different than um what is actually happening in these systems? Um so do you think that like one of the reasons that people that so much of the

8:03computation is being done um is uh is is inest do you think that's you know a sign of inefficient pipelines or do you think that there's that there's actually something deeper in the nature of how you know how how people are use these these warehouses? No, because Fiverr works very hard to optimize our uh our um our ingest code and even when you're

8:26100% on Fiverr, your ingest costs are like 20%. That's what ours are in our own data warehouse about 20% of it is our own queries ingesting the data. So it's not it's not that the ingest is horrendously inefficient. It's that people prioritize latency. And if you say I want to have even one hour latency on my on my analytic database, you have

8:45to do a lot of ingest. who constantly are patching uh the data. >> Any any feedback that you've gotten on the uh on the you know since since you posted that? I mean so I posted a similar you know a similar roundup of kind of my my interpretations and I found some of the same things that uh

9:04you know people don't people don't tend tend not to scan as much data as you would as you would think. Um most databases are actually pretty pretty small. I looked at the users and most users actually sort of never do anything that cate is categorized at big data. I did get some push back from the uh from

9:22the AWS team. They uh they sent me some grumpy emails. uh a an MIT professor who also works at at AWS sent me an email as well as Peter Bon who's who's one of our adviserss at mother duck and he was a he was the person that invented column stores and you know vectorized execution and um and so it felt like he was like

9:41including the teacher also on the uh on on this on this on this report >> and um and they were sort of p pushing back it's like yeah well you know like this isn't necessarily a representative representative example um but I think as as you mentioned and the the fact that actually the snowflake data set says sort of the same thing. I mean there are

10:00like there's some slightly skewed um skewed skewed things there and and I think you know effectively like very strictly maybe it's hard to make some of those those uh um broad characterizations about the full the full set of the fleet because you don't have the full set of the fleet. It's just this this one data set. On the

10:20other hand, they're not picking those workloads because they're small. Like they didn't they didn't publish this data set like, "Hey, there's a bunch of like really interesting big workloads.

10:29Um, but we're gonna we're gonna hold those out. We're just going to give you the give you the small ones." I think if if things are skewed, chances are they're skewed towards the um towards the larger side. Um, but did you you know, did you get have you got any feedback or anything anything interesting any uh any other surprises

10:45in the in the data set? Well, I think first of all it it's so great when people just put the data out there because different people can look at it and have different takes and put and and and and offer up their own conclusions.

10:56Gorov can offer uh you know his take on his own system uh which probably you should read first uh and you can offer yours and I can offer mine and it's fantastic. Uh I think yeah I think the representativeness thing uh for the conclusions that I drew they were the same in Snowflake and Redshift. Uh and so that I think lends some strength to

11:15hey this is what's really going on there. And then in terms of reactions uh I think one of the most interesting ones uh someone wrote on Twitter and in reply to my um post of of my own analysis um that something very similar to what we just heard about about paddling round and round saying like hey this is

11:31exactly like mainframes to PCs. We have these PCs which is your laptop running duct DB >> uh and it can do a lot. Maybe it can't

11:43do everything, but it can do a lot. And for that last point, you know, 1% at some point people will start to say, uh, like maybe I'll just find another way to solve that that super longtail problem.

11:58Now, I think, you know, that's a very like provocative take. It's super you asked me for interesting reactions.

12:04That's one. I think what Gorav said earlier is also very true. like if if you if you do this uh compute weighted instead of query weighted >> that.1% is a lot of your compute because they are so big. Uh and I think that my forecast for what's going to happen is I think this interacts in a very

12:24interesting way with open table formats. I I think open table formats are going to create this Cambrian explosion where you can mix and match different compute engines against the same underlying database. And I mean database in the way like the academics mean it, the the the concept, not the execution engine. So you'll have Redshift in there and you'll

12:45have duct DB in there and you'll have PowerBI running the same uh calculation engine that they've been running since the '9s, which is still good by the way, uh directly against the data lakeink. I I think you're going to see a lot of sort of jump ball for different workloads. you'll have if you're using Fiverr, the Fiverr data lakeink writer

13:03will be doing the ingest portion of the workload for you on our own engine that we built uh to do that very efficiently.

13:10So I I think you're going to see more diversity. I think um the MPP systems

13:16these these sort of big iron the main frames will will still be in there.

13:19They'll still be doing a lot. Uh and I think the great thing about uh data stuff if you're a vendor is that data budgets never go down. if you make the systems more efficient, people just come up with more things to do and and I think that is how this will play out as well.

13:34>> So that's something you and I have agreed disagreed with on the past. Um and I tend to be a storage person and so I'm like no database storage people are going to always want to use use database storage and um it does seem like there's a lot of excitement around these open open table formats. um you know the you

13:51know data bricks just spent you know a couple of you know couple billion dollars on um on a tiny startup um uh so

13:59that they could kind of own these own these uh these these formats um you know

14:05but I guess you're in a you're in a position where you actually see what you know what not what people are saying they're going to do but what they're actually doing like how much uh how much of the open table formats are you seeing people people using and kind I mean do you see that do you see the

14:21the move towards that as happening quickly? >> Well, first of all, no one owns the open table formats, especially iceberg. Uh buying tabular does not buy iceberg. Uh there are many uh participants in that ecosystem at this point including Fiverr and open table formats mean open and it's going to stay that way. Um and uh it is like I said the fastest growing uh

14:42destination in 5G. We're doing sometimes that's a euphemism for like it's nothing right because it's just growing. It's like, oh, it went from $1 to $3. We're doing about 1 and a.5% of revenue. Uh, and it was like zero 6 months ago, uh, writing to data lake destinations. Uh, and it is still a little bit like, uh,

15:01of a tinkerer's tool. Like you have to set up scripts to register tables in your destinations from the data lake.

15:07It's still not quite all wired together and as seamless as other destinations. But it is change it is getting better like week by week because we have to collaborate with all the other systems in order to build the links uh so that you get the same user experience from a data lake in six months uh it'll be

15:22exactly like setting up native snowflake or data bricks or red shift or or whatever and and I think it's really exciting because I just I think there's so much opportunity to save money by um

15:34you know using a mix of computation engines but also to create better user experience cuz for those small queries uh you in many cases would really be better off running them on your laptop the latency would be shorter and you don't have to share the compute with anybody else.

15:49>> Sure. And I mean certainly if people do move to these open table formats it's good for it's good if you're not the incumbent like that's right like like mother duck because then we get we get access to uh to to that to that data.

16:00Um, but you know, super interesting that you're that you're you're saying especially like we we chatted a couple couple months ago and um and I think I remember you saying that that you know, stuff really the technology wasn't quite there yet and so it sounds like actually things are progressing pretty pretty quickly.

16:15>> It's the catalog stuff that's really changed in the last six months. Uh the cat and and specifically support for the cataloges by the the data platform

16:24vendors. That's what has to happen in order for this to become super userfriendly is everyone needs to link together at the catalog layer and it is literally happening like week by week.

16:35You have to check the docs every couple weeks of the data platforms because they keep adding more capabilities. Mhm.

16:41>> So, you know, fast forward five years, like um you know, in my my talk this morning, I talked about how kind of the the big data threshold, the threshold from what you can do on a single server, what you can do on a single laptop is sort of getting getting bigger and bigger. Um I think, you know,

16:58uh Wes, you know, kind of showed that network speeds are improving. You know, you're going to be able to download more, you know, more stuff to your to your laptop. Um, as you know, you see more open t open open open formats. Um, what does the world look like in in in five years in your opinion?

17:15>> I I think Wes has it exactly right. It's the the network speeds are getting faster and the CPUs are getting faster.

17:23The, you know, what's the M5 going to be? What's the M6 going to be? uh and this that trend is outpacing the growth of data sizes which like I keep saying in a business business data sets are not getting that much bigger that fast and so more and more of these queries it's going to make sense and some of them are

17:41not queries you know they might be Python scripts doing feature engineering or whatever uh um my own analysis of the

17:48uh of the red set and snow set data sets I ran on my Mac studio and by the way to scan the entire data set took 7 seconds.

17:57Uh so um I think we're going to see a lot of stuff actually get pulled down uh to local compute because it is so cheap and because it gives you such a good user experience. And then we'll see specialized engines like uh I I've me mentioned earlier Fiverr's data lakeink writer which is like a specialized engine for efficiently ingesting data

18:18into an iceberg or delta data lake uh that comes from fiverran and it it you know it relies on various u invariants that we maintain in the data that we deliver uh to be more efficient. Uh so I I think you're going to see you're going to see those sort of peripheral uh workloads get pulled away from the

18:36vertically integrated platforms into just like local compute and these more like ad hoc uh engines.

18:42>> Um so sounds like it sounds like you're a user of ductb. I also heard that fiverrren uses ductb and some various various places. Can you maybe share how how you use duct db?

18:51>> Yeah so data lakeink this data lakeinkake writer service is powered by ductb at the lowest level that is what it does. It runs duct DB to rewrite the parquet files. Uh and the reason that was such a uh a great unlock for us was because uh Fiverr when we ingest data into your destination uh the queries are

19:09much more complicated than you might think. There's a lot of uh subtle corner cases that we have to worry about especially involving recovery from failure and um uh migrations between different modes and stuff like that. So we have all this code built up that's written in SQL that knows how to do all this stuff and by using duct DB we were

19:28able to execute that same code that we run against Snowflake and Redshift and all of them uh against just paret files.

19:35Uh so it it sped up development a lot and then it's a very efficient execution engine. I I suspect, you know, over time as we continue to make it more and more efficient, we'll write special purpose code paths uh for where for the most common cases where we just like write C++ code, but we haven't done that yet

19:52and it's not yet risen to the top of the optimization priority list because duct DB just in its stock configuration is a is a very efficient system.

20:00>> Awesome. Well, um well uh thank thank you so much George. It's been a pleasure being able to to to chat with you and uh I hope you've gotten a chance to enjoy the uh the conference.

20:10>> Yeah, thanks for having me. It's been great great material. [Applause] [Music]

More 2024 Talks

Big Is Not A Number: Dispelling The Myths Of Big Data

Big Is Not A Number: Dispelling The Myths Of Big Data

Jordan Tigani

Data Minimalism: Delivering Business Value For The 99%

Data Minimalism: Delivering Business Value For The 99%

Ravit S. Jain, Jake Thomas, Celina Wong, James Winegar, Josh Wills

Squeezing Maximum Roi Out Of Small Data

Squeezing Maximum Roi Out Of Small Data

Lindsay Murphy