Building Distributed DuckDB Processing for Lakes | Small Data SF 2025

2025Small Data SF

Building Distributed DuckDB Processing for Lakes

DuckDB is the best way to execute SQL on a single node. But with its embedding-friendly nature, it makes an excellent foundation for building distributed systems. George Fraser, CEO of Fivetran, will tell us how Fivetran used DuckDB to power its Iceberg data lake writer—coordinating thousands of small, parallel tasks across a fleet of workers, each running DuckDB queries on bounded datasets. The result is a high-throughput, dual-format (Iceberg + Delta) data lake architecture where every write scales linearly, snapshots stay perfectly in sync, and performance rivals a commercial database while remaining open and portable.

Speaker

George Fraser

George Fraser

CEO, Fivetran

George Fraser is the co-founder and CEO of Fivetran. Prior to Fivetran, he worked as a neuroscientist before transitioning into the tech industry and leveraging his analytical background to solve complex data challenges.

0:00[music]

0:05[music] So, my name is George Fraser. I'm CEO and co-founder at Fiverr, uh, where I've been for the last almost 13 years. And I'm here today to talk about distributed processing uh, with DuctTB. Uh, now you might be saying, I thought DUTDB was a single node system. Why are we talking about distributed data processing and looking at an army of cyborg ducks uh

0:32which thank you chat dbt uh for helping me with these slides. Uh well it turns out that ductb is also a great component

0:41for building a distributed system and that is what we have done at fiverran. Uh so fiverran you may know us as the company that does this. Uh we replicate data from your sources, your Postgress database, Oracle database, your apps like Salesforce, Jira, Netswuite, you name it. Uh from all the places where your business data lives. We replicate

1:07data uh into your data warehouse. We've been doing this for many years. Um, and we pride ourselves on our high quality, easytouse connectors that just work. Uh, so that you don't have to be in the business of troubleshooting the many, many, many corner cases of these APIs.

1:23And one of these destinations we support is data links. Um, so we support iceberg and delta data lakes as a destination.

1:29And in fact, we have been working on this for some time. Uh, we started working on data lakes in 2020. Um, but you didn't see it for the first couple years because it didn't work right away.

1:40uh it took several runs at it to get data lakes working at Fiverr. Um

1:48so uh if you think about a data lake from a Fiverr perspective, it is uh the problem the problem to be solved is you have a bunch of inserts, updates, deletes and other operations that you need to apply to a bunch of parquet files. Uh there's more to data lakes in iceberg and delta format than parquet files, but the base of them, the part

2:08that actually contains all of the records is a big list of parquet files. Uh and so uh from a fiverr perspective, you have a fivetain connector that produces a patch that needs to go into the into the data lakeink that needs to be applied to the set of parket files and uh it is not an appendon system,

2:29right? the sources we support like databases. Databases produce write- ahead logs. Right ahead logs have all kinds of changes in them. They see the simple ones you see here. Um they see uh they produce also a lot of very complicated ones. There are a lot of complicated corner cases in the changes that Fiverr has to process. Uh, examples

2:48of these are things like a lot of databases produce updates in a uh like patch format where they only tell you the cell that has changed in the record.

2:59Some databases like Oracle and Hannah, they don't even give you the primary key. They give you a a record offset uh which uh is like an internal identifier. uh it's kind of actually the real primary key of the database uh because these are heap databases and then um there are other complicated scenarios like history mode. One of the options in Fiverr is

3:23for any database like system we can deliver a type 2 history table that preserves every version of every uh row that has ever existed. uh and that gets really complicated especially when you're doing things like recovering from a failure and you're not totally 100% sure what's in the data warehouse and you have to somehow converge back to a

3:42correct state for some definition of correct and many many more. So this is a long way of saying that the problem of applying changes the kinds of changes that we see from all the different data sources that we support to a table that

3:58table being represented by a set of paret files is a very complex problem. And if you look at the implementation of our destinations uh or you go into your Fiverr account into your Fiverr uh Fed data warehouse and you look at the queries you are running you will see that sometimes the queries are really crazy. They get really complicated in

4:18migration scenarios and recoveries from failure and all these other things like I'm in regular mode and now I want to go to history mode and I have to go and move things around in the existing data that's already there in order to get it ready or vice versa. Right? All these things have to be supported. So the

4:34implementation of fiverr destinations which people imagine is sort of just like oh you just like write a merge query and you join on the primary key and that's it right that is it like 95% of the time but the other 5% is the hard part and we have to support all these things with data links and this is why

4:51it took us a couple years to even figure out a strategy that worked. So most of Fiverr is written in Java. uh and our first attempt was like okay let's write a bunch of Java code to update the parket files there are paret readers and writers implemented in Java and let's go to it you know public static int update

5:12here we go and we worked at that for a while the problem was the sheer uh breadth of scenarios that we have to cover in our destinations that I was just talking about um there was so much work to be done to cover all these scenarios all of our other destinations are SQL destinations and they share a

5:29lot of implementation. There are all these SQL templates inside Fiverr that define how to do all that crazy stuff that I was just describing. And we had to recapitulate all of that logic in Java, which was a very tedious and errorprone process. And so after some time of attempting to do that, we did what you do a lot when you start

5:48a new big software engineering project. We gave up and we started over and we tried a different approach. We said, "Hey, uh, duct DB is getting pretty mature, uh, and it's really fast and so maybe and and it gives us a SQL interface on two paret files, which is pretty close to what we need to make a

6:07paret file look kind of like the destinations we're accustomed to working with. And very concretely, if we do it this way, we can literally use the same SQL templates that we use for all of our other destinations.

6:22uh we can sort of use the expressions basically the outer query you have to build it a little differently. Uh so we gave that a shot and we were able to get correctness. So that was a big step. We were able to get a correct implementation. The problem is that uh

6:38we could not make it work with large tables especially when we have to do operations where you restate the entire table which happens sometimes. Fiverr mostly is applying little patches but occasionally we have to do big batch operations where we reorganize the entire world and when we would try to do that with a duct DB worker that's

6:59sitting in the same process that's running the sync it explodes and we tried making these VMs bigger we gave them more memory but it we eventually deemed that strategy unworkable so while it was correct it had a problem with scale because These processes are not that big. The default size for a fivet process that moves data in the happy

7:22case when you're not doing an initial sync or a reync is uh a few gigabytes of memory. So the resources are limited.

7:30So then we went to the third approach which worked. It was still ductb. It was still the same basic implementation that was that that was correct but not scalable. um we scaled it out with a distributed system and this was actually only the second uh major additional service we built inside Fiverr. There are now more um but Fiverr at that time

7:54was an extremely macro serviceoriented company. Our attitude about infrastructure was almost nihilistic. It was like everything goes in that process. The answer is if you want to do something it goes in there. Uh and this really broke that model. If you're curious, the other thing many years earlier that broke that is actually the web hooks collector because it's an

8:14always on service that can't ever loot drop an event. Um, but we built a separate service that uh is deployed independently and scales independently for running these duct DB workers that patch parquet files. Fundamentally, that is what they do and that is represented by our little cyborg duck here. Uh, so

8:36uh what does that service look like in action? The sync process here is represented by the little process in the center. So that is getting data from the source. Uh whichever source that is. Maybe it's a Salesforce instance. Maybe it's a database feeding you a change log. It gets a bunch of data. It decides I've got enough. It's time to write some data

8:55to the destination. And it packs that data up into actually uh one or more parquet files. Most often one because the batches are usually pretty small in the in the happy path where you're doing incremental updates. Then you need to figure out how to distribute your problem. What is your distribution strategy? You want to build a distributed system, you need a

9:15distribution strategy. Uh every worker represented on the right is using duct DB to patch paret files, but you somehow have to assign parquet files to workers.

9:25And what we do is basically a broadcast join. So we distribute the files arbitrarily. Every uh ductb worker just gets an arbitrary subset of files of a target size. This works because of certain properties that are enforced by fiverr connectors. Remember we are not building a general purpose database. Uh which is a very important insight of how

9:47we built this system. We are not trying to execute arbitrary SQL queries. We are just trying to do the particular thing that Fiverr sync jobs need to do. And Fiverr connectors have lots of invariants that they enforce universally. If you do not do this as an employee at Fiverr, you have written a bug. One of those invariants is that

10:07every table has a primary key and there is some prefix of that primary key where within that prefix all operations are local. You don't need to know about anything outside of that set of records in order to perform an update to the destination. That statement is probably more complicated [laughter] than you thought it would be. You probably

10:26thought I was going to stop at primary key. This is a good example of how things in Fiverr are always uh more complicated than they seem from the outside. um cases like uh history mode and certain types of list of tag scenarios mean that just a primary key is not quite enough. Um but there are these little domains within the data

10:47within which you can uh apply a correct patch without needing to know anything else. So we ensure that those domains always live in the same file and that means that we can distribute the files arbitrarily to duct DB workers. Uh so

11:03the the entire I said it was like a broadcast join uh the entire incoming batch gets broadcast to every worker. Uh in cases where the incoming batch is very large which are rare but they do happen uh we do something uh very simple very inefficient which is we just divide the incoming batch and do a loop. We

11:23basically do a nested loop on our update. But uh because we are operating the service internally, it only needs to be efficient in aggregate. It's okay if it's really bad at some things as long as there's a way to throw compute at it.

11:38And those bad things are rare. This is another example of how we are not trying to build a general purpose SQL execution engine. We are just trying to do this one thing internally. Uh and we're leveraging that, right? We're saying it's okay if it's really awful in a few cases as long as in aggregate it's uh economical for fiverrren to operate. Uh

12:02and this has worked. This is the basic structure of the data lake that you see today. There are some more interesting details of how this update process works. So if you look at the data lake from a checkpoint perspective, right? uh the data lake a particular version of it represented by the column of files in the center of this slide a column of

12:22parquet files uh is enumerated by a

12:27metadata snapshot uh which you see on the left uh and I haven't mentioned this but all fiv data links are actually dual format parquet uh dual format iceberg delta um data links so we just we write both metadata formats referencing the same parquet files so that like you just don't have to decide. You can you can use whichever one you want, which is a

12:49cool uh benefit of the way we've built it. But in this diagram, I'm just showing one metadata snapshot. So that's what you see on the left. You have an existing snapshot. You want to make an update. Um as part of that update, there are some row versions that are in this table that are going to get invalidated.

13:07And the way we deal with them is we delete them, but we delete them in a specific way. We squeeze them out of the old files. So we write a new file.

13:15That's represented by that little green file on the right that is missing the deleted row. And over time, if files get really small, we will compact them. But we don't actually put the new row version back into the same file. And the reason for this is that if you do it this way, you will naturally reorganize the table into a kind of young

13:33generation of recently updated data uh and a tenure generation of data that hasn't been updated in a long time, which is a very useful property because it tends to reduce the right amplification of your data lake over time. Um so we we follow this very specific update and compaction strategy uh which we have found works in practice

13:53for fiverr connectors to reduce that amount of write amplification. And then after you're done with this so you've squeezed out all the invalidated row versions. You may have done some compaction if those files have gotten too small and then you've appended on the files with the updated data. You write a new snapshot that enumerates all the new versions of the file and you're

14:13done. And the way this uh concretely works inside Fiverr is that the same sync process which originally we tried to just jam duct DB into uh and have it do the entire update. It still acts as the coordinator for the process. So the data lakeink writer service it just understands paret files. it does the heavy lifting of rewriting all the pair

14:33K files and then the metadata is managed inside that uh parent process that sync process that um that is uh calling the

14:43the data lake writer service. So this has been a a very successful uh implementation for us. Um it works with real customers with real connectors.

14:54We've been operating it successfully for about two years now. Uh it's really grown a lot this year. Um it's it's doubling in usage about every six months now. Um it uh it's a reasonable cost for Fiverr to operate. So when you use data link as a destination uh we don't actually charge you anything additional on top of the regular Fiverr connector

15:16costs. Um we just roll it in to the same pricing model that we already had. Even though we're doing all this work that previously would have been done by like Snowflake or data bricks or something like that. Um we uh we've done the math

15:30and at this point if you're paying five train a dollar and you're using data links as a destination on average uh it costs about 20 cents uh for us to do those data lakeink operations and we think we can actually make that a lot better. There's a lot of things uh that we can improve to improve efficiency.

15:46You will notice if you look closely at this diagram we are not actually doing uh positional deletes which are a feature of iceberg and delta. And the reason is that until recently hardly any of the readers supported them and we have to support a lot of different readers. So that does make it a lot more work to update the data lake.

16:05Fortunately the reader support is finally getting there. So I think we will finally be able to do this. Um and uh and that speaks to a larger principle of what we care about when we write these data lakes. One of the things we care about is that they work well with every destination. Unlike a lot of other

16:22vendors who will write iceberg or delta data links for you, uh we want to be interoperable with everybody. We want to make sure that all the readers work well. So we benchmark our park our implementation against a whole array of readers. There are lots of choices when you write these data links about the details of how you do this. What are

16:41your parquet file sizes? What are your row group target sizes? what compression algorithms are you using? What metadata are you putting into the snapshots? Um because you can copy metadata into the uh commits in in iceberg and delta format. And so there's actually a lot of knobs uh on on these formats. And when we've tuned those knobs, we've done it

17:03benchmarking against an array of readers to make sure that we have good performance across the entire ecosystem.

17:11And I'm sure we haven't done that perfectly, but to the extent that we, you know, discover that reader X or reader Y is not doing well in a particular scenario in the future, we are going to fix it. Uh because that's what we're all about. We're all about interoperating with the whole ecosystem.

17:28So the service is not done. Uh I think we've made a lot of progress. Um there's a lot of users. Many of them are happy.

17:35uh and and uh I think it's going well, but there's a lot of things that are still coming um that we see on the road ahead. Another great [snorts] slide here. Uh and uh and and and if just a few of them that are particularly top of mind right now, one of them is right amplification. So there's still a lot of

17:53right amplification because of the way that we squeeze out deletes right now. uh and we will be able to fix that by incorporating we you know the the the young generation reorganization of the table helps uh with right amplification but it's still pretty bad in many cases.

18:09We'll be able to fix that by incorporating positional deletes now that they are widely supported across the ecosystem. I think we're finally there and so that will be coming out soon. I think we will have to do something a little bit complicated where we continue to write a lower frequency full update periodically for readers who are not uh quite ready. Um but we're

18:29excited about that. It's going to improve latencies and reduce storage costs for our customers and it's going to save a lot of compute costs for us.

18:36Um read amplification is another problem. Uh it's like what I just described will fix right amplification, but it won't fix readlification because you still have to go find the correct row uh inside the paret file. Um, we think what we're going to do is we're actually going to build a custom index in order to uh shortcut that. Um, it's a

18:56very specific problem we have to solve. Like it's okay to have false positives. Uh, everything is in batch. So, it's kind of like bloom filters, but it's batch. Um, we think that there is a good strategy for writing a custom index that will enable us to most of the time avoid having to go search uh for the deleted

19:14record. Um and and clustering uh userfacing clustering is something we don't support today. Right now everybody gets this sort of most recently a used clustering pattern in the data lakeink.

19:25Uh and uh we would like to allow users to specify how they want the data lakeink uh to be clustered and then we would compromise between those two goals. Uh so that's coming and lots of other little things. Uh it's a really great service. We're very proud of it.

19:40Um, we think it's a great foundation for people who want a a a data a a core data

19:47set that is going to support many different readers, including ones that haven't even been invented yet. And I encourage you to give it a try. Fiverr has a free tier. And this is in it. So, if you haven't driven it around, give it a shot. Go look at the files uh and see how you like it. Thank you very much.

20:07>> [music]

More 2025 Talks

The Great Data Engineering Reset: From Pipelines to Agents and Beyond

The Great Data Engineering Reset: From Pipelines to Agents and Beyond

Joe Reis

The Unbearable Bigness of Small Data

The Unbearable Bigness of Small Data

Jordan Tigani

Projection Pushdown vs Predicate Pushdown: Rethinking Query Efficiency

Projection Pushdown vs Predicate Pushdown: Rethinking Query Efficiency

Adi Polak