2024Small Data SF

Give Every User Their Own Database! Unleashing The Uptapped Power Of Small Data

Is big data a thing? Or can it be seen as a collection of small data? If we were more intentional about what that collection entails, could we build bigger and faster?
In this talk, we will explore how massive multitenancy with SQLite - the king of small databases - can be used to deliver fast and memorable OLTP experiences for your API.
Speaker
Glauber Costa
Glauber Costa

Founder & CEO

Turso

Glauber started his career working with the Linux Kernel, where he met Turso Co-founder and CTO Pekka Enberg. Before Turso, Glauber worked as Staff Engineer at Datadog, where he authored the Glommio Rust async executor. In his experience at companies like IBM and Red Hat, he has worked with virtualization technology, storage, and containers. He then spent almost a decade at ScyllaDB, serving as VP of Field Engineering and designing core database features.

0:00[Music]

0:15Welcome everybody. It's great to be here. We're very happy to be co-organizing this conference and we've in very good company with Mother Duck Olama. I also want to make sure to thank all of the sponsors that are here and of course uh all of you. I'm Glober Costa, the founder and CEO of Turso. Uh as I said, we're very happy to be here. We

0:38actually brought a bunch of Terso people to the conference, including my co-founder, Pekka, who's right there at the back. He never travels because he lives in the boonies in Finland somewhere in the woods. So, it's very hard to get him out of there, but he's here. So, make sure you talk to him. you might never have this opportunity again

0:54in your life, but it's a great place. I've been there and I don't blame him, you know. Um, we have the booth outside, so make sure you check us out. Today, I want to talk about why and how you would want to give each of your users their own database and essentially build a multi-tenant architecture. This is an

1:15architecture that was very hard to pull out uh couple of years ago even but tools like motherduck uh enter so make it easy make it doable and there's a lot of power to extract uh out of those architectures before we get there I just want to get everybody on the same page uh so let's start with a refresher uh everybody

1:35should know that but like what are the kinds of workloads that databases can do uh there are essentially two main workloads that you can run on a database case, those are the analytical workloads or OLAP or the transactional workloads are like OOLTP.

1:50If you want to learn more about OLAP, you're not going to learn it from me. I recommend you go talk to Jordan or any of those mother duckers out there. I love the name of that company by the way. It comes with the pun, right? So, uh again, uh we're just going to give you a very brief refresher, but if you

2:05want to learn more about OLAP, I do recommend you talk to Manduk. Quick overview on that. OL app workloads, analytical workloads, they tend to go through, they're long running. Uh we've learned that from Jordan's talk, like we talk about minutes sometimes, they tend to scan large chunks of the data, especially in comparison with the transactional ones. Jordan was talking

2:28about like a terabyte and things like that. That's already pretty large for a transactional workload. So in analytics, you will see that quite often. uh and and as a consequence of that the latency uh is less important once more at least compared to the OLTP workloads. These are things like are your business intelligence dashboards. So those are

2:47examples of things that will be analytical. Today in this talk I'm going to be talking about OLTP the transactional workloads because that's what our product does. In OLTP you tend to have short queries. A very easy example to digest is a user profile. So user profile and sure I mean same thing as Jordan was talking about you can have a

3:09massive amount of data but you're going to have an index in that data and I'm going to give you a user ID and you're going to retrieve that one user ID so you're going to be scanning like a a very small chunk of your data you got the profile associated with that you scan as little as possible those queries

3:25are short-lived and the latency of that query is the most important metric you're talking about and we're talking about microsconds you know plus network time.

3:38There are transactional workloads that write a lot but most of the transactional workloads also tend to be very read intensive.

3:47SQLite as an example and Toro I'll talk more about that later is a database built on top of that is notoriously not the best database in the world to do write heavy like heavy data ingestion but again that's fine because there's so many workloads in the transactional space that read mostly we discussed the user profile example but you're going to

4:06have things uh like shopping carts content layers like most of the stuff that you run on the web those are the things that you just read a lot you have your data there and retrieve me one thing, give me another thing, combine it, join with something else. But that at the end of the day, those are the

4:22kind of things that you run beyond the techn the technical requirements of just like what is a query, what do I need? There are many modern like more modern requirements that those transactional workloads tend to have for example compliance, privacy

4:39and etc. So an example of data that you can have on a transactional database is again user profile but those could be patient data. So you have patient data or or credit card information. Those are things that in the modern world uh especially in Europe where my co-founder Pekka comes from. Uh you have the GDPR and like you have to uh essentially have

5:00all those laws regulating what you can do with this data. So if a user tells you I want you to delete all of my data you need to do so in a specific way. You may have geographical requirements of where this data is placed, right? And and all of things like that. And even if the laws are not the things covering

5:16that, this is still sensitive private data. And look, a lot of security like security own industry, I'm not even going there, but a lot of that like a lot of having a secure compliant private system starts with just having a good architecture that you can build upon. So the more isolated those things are, the more isolated those pieces of

5:37information are, the harder it is uh the less opportunities you have to screw things over. That's essentially the way it is. Now with that in mind, why would you want to build a multi-tenant system?

5:51If we could keep per user data in completely different databases, the privacy level and the security immediately goes up. Uh compliance is easier to achieve, right? If you want to delete, like if the user calls you say, "Delete all of my data." All you need to do is delete their per user data. If they ask for a full copy of their

6:12database, you make a copy of the user database and give it to them. And look, if you want to make sure that nobody will ever read data they're not supposed to read, you can encrypt those databases with different keys. So, it allows for a much better foundational system from the get-go.

6:31My favorite advantage of building a multi-tenant system though is just developer velocity. Every time you are less afraid of making mistakes, developer velocity increases.

6:44So if you make a mistake on a big shared database, the consequences of that and developers make mistakes all the time.

6:51I've been told I remember it does happen. So, every time you make a mistake on a big shared database, um you might have read user data you're not supposed to read. You might have updated data that you're not supposed to update.

7:05You might have deleted data that you're not supposed to delete. Now, if every user has their own database, the the blast radius of those problems are more limited, right? So, it's easier to deal uh and as we're going to see later, you can just do like restore a backup just for that user. So, that happens. And if

7:22your team is less afraid of making mistakes, they can move faster. Something that pairs very well with this is just the way replication can work in a multi-tenant system because now you can also make independent replicas of those databases. We all saw from Jordan's talk that one of the things that he recommends out of this like small data manifesto is that if latency

7:44matters, put those things as close as possible to your user. Now, if you have a large centralized database, that becomes pretty hard because like it's a lot of data. Which data do I copy? Do you need a sync engine? And there are very good sync engines out there. Uh but if now your workload is comprised of all

8:03of those small tiny databases, well, you just get this one user database and you copy close to where the user is, right?

8:10So, it's easy to do that. But beyond just that, right, uh the way TURo works

8:17and and we're going to talk specifically about that is you can replicate those data th those databases whatever you want. So if the user has a mobile device, you can create a copy that is kept up to date in that mobile device just for that user data. If you have a server, you can create a copy a

8:34up-to-date constantly up-to-date copy in that server. If you have a microserver architecture, each one of those microservices can have its own copy of the database that is kept up to date. So developer velocity comes here again because now you're operating on a copy of the database. You can have the worst query in the world and you're not going

8:55to affect the production quality of the main database.

9:08Not to mention the latency advantages that that we already discussed because now you can get again you can get a copy of this database that is put exactly where you want as close as possible to your users.

9:20Reading a reading data from a database

9:24that is right there on your device on your server in your file system now takes micros secondsonds. It doesn't take milliseconds, right? In a transactional workload, the time the latency which is the most important metric in that system is usually dominated by the network traffic. Like if you don't have any network, if you just read data right there, you don't

9:45have n plus1 query problems. You don't have anything of the sort. You just read, you can't really get any faster than that.

9:53All right. So the way you do it is like what? Just give everybody a SQL database. It's it's SQL databases have been there for a long time like you don't have anything uh we have many uh users that we con usually deal with that like if you have for example an application development platform there are many systems today uh in in the AI

10:14space in other spaces where like users are creating their own systems uh AI is creating code for you if you want each one of those things to have their own database you can if you want those things to be throwaway databases you can create a database. The database lives for one hour. That's it. You create a database, let whatever model you want

10:35generate any questionable quality level uh SQL query on that. That's fine because so those architectures allow you to have those databases that they're that are very nimble and they're very fast and and can be spread out everywhere.

10:53um it makes a lot in your architecture simpler. So if you have for example data that is now split in those per user pieces of databases you don't have to worry about things like role level security because the database is now the boundary. Once you give user access to the database they can do whatever they want.

11:15There's no need for caching. A lot of the users that we see on our platform, they don't have a separate cache because just make a copy of the database that you keep up to date inside your application server. No need for a cache.

11:27So architecture becomes simpler. Mess up with something which we all invariably do at some point. Restore a backup just for that user. Allow the user to control. uh in in fact what we start to see is that like things that were hidden behind an API because oh if we were to give users direct SQL access to their data it will be a complete

11:49nightmare they become possible so a multi-ener architecture has all of those advantages now let's talk about how we can achieve this because look if a giving a database for each one of our users is so good why wouldn't you do it right so let's do it and the The reason you don't do it is this it costs too much.

12:13Now if all if this was this is expectation versus reality, right? So the expectation is that I have a thousand users or a million users or however many users you want. Uh each of them will come and use the database more or less uniformly.

12:29So you just give them each one a database. That's fine. Each database is in a container. Each database is on a VM. that's there is some overhead here but it's not tremendously large uh it's fine right just especially there's some overhead in managing all of those VMs but today with Kubernetes and other automated tools that's okay there is

12:51some overhead in the in the fact that there is like a you're going to be wasting a little bit of memory here and there but it's totally fine u just put every user in their own VM end of story

13:02the problem is reality is not like this reality looks a lot more like this. What you have is that you have this couple of users that use your system a lot all the time. You have a couple of users in the middle. They use the system a fair bunch. But then you have a long tale of people who almost never use your system.

13:24like they show up once day they do a query per minute or or like a there is maybe at night for a period of time they do a query per second and then they go away as well or so you have this power law distribution in reality and if you keep a database up all the time for

13:41those users it doesn't work now there are ways you can get your cost down you can do for example a scale to zero architecture in a scale to zero architecture you have a VM a request comes VM is down, you bring it back up or a container, Kubernetes does things like that. Uh the statefulness of the system

14:00makes it a little bit harder, but you can have like a separation of storage and compute. It's all fine. You bring uh the VM up, you handle that request, you keep it up for a couple of minutes uh in case there's another request, the request doesn't come, you shut it down again.

14:14That actually works decently well for the very end of the long tail, like a request per month, a request per day, things like that. But especially for those users in the middle like imagine you scale to zero every 10 minutes and then you have a user that does a query every 10 minutes it costs as much as

14:30keeping the database up all the time right uh and this is only about the CPU

14:36in practice again this is this is transactional data you don't need to keep the data extremely hot for the longtail users but they have to be reasonably hot you can't just leave it on on S3 and then fetch upon request because it just takes too much, right?

14:53So you you want to keep memory around, you want to keep some caches around. So you want to do that and when you do that is it just is still too expensive.

15:02So how can we solve this problem and SQLite and and things like SQLite are a great solution because now you have a database and a file. Uh and for the ones who don't know like we have SQLite and at Turso we maintain a fork of SQLite where we add a lot of quality of life improvements on top called lib SQL. So

15:23I'm here talking interchangeably about them. People have been doing multi-tenant architectures with SQLite for a long time. In fact when we announced our product we did not do multi-tenency.

15:35we were using like SQLite in a different direction but users would come to us and ask us why can't I create 10,000 databases and and and have per user databases I've been doing this with SQLite I've been experimenting doing this with SQLite it's just a file give me a file right uh and then we moved in that direction why a good foundation for

15:57this problem is known to be a good database for quote unquote small data But besides, I mean, it runs on your toaster. That's the example. I actually believe it might run in some toasters in in if now we're in the AI phase, but if you remember the IoT phase where every single day you had a new smart device

16:17for things that don't actually necessarily need to be smart, I'm sure we had many toasters. Uh, but aside from

16:25just the fact that it's very optimized for those architectures, uh, it has the seeds of a multi-tenant system needs.

16:31You can for example use pragmas and other internal APIs to control the amount of memory to the bite that a database will use.

16:41You can also somewhat limit the amount of CPU that the database will consume at once. SQLite is very unique in the sense that is a database that compiles SQL to bite code and then we can count how many instructions you can run on of this bite code before yielding. So you can have an event loop, a very efficient event loop

16:59that moves from database to database. And it's not time based. It's not like you can control like to a couple of microsconds, but with lib SQL, we're actually moving in that direction. But you can control well enough to say look just get me a multi-tenant system out of this.

17:15I also love to remind people what actually mean because again we are in the small data conference but SQLite can actually handle 200 terabytes of data.

17:23They don't have machines this big, but like the the software stack can. It's not great for writes, but not great still means like 2,000 writes a second and and things. So small in 2024, I think it's one of the things that allow the small data movement to happen. It's not the you know this clicker small.

17:42It's it's pretty large. And when you put a lot of those small units, you have something like tisso. Now you don't have to do um any multi-tenant architectures with terso right just uh you you can we are multi-tenants so you can show up and just get a very cheap database that's fine a lot of our customers do that

18:02but you have a very powerful architecture on your hands in in which you can have your own multi-tenant architecture sometimes I like to think of our software stack is not too much a database but more as a web server just to illuminate how that happens you have essentially uh every request has a route

18:23that is backed by a SQLite file. That's it. Right? So the database itself is is SQLite. Uh but then you you you send SQL requests to those routes. Uh and the request just goes to the right database and except for the storage that it uses.

18:41So you you know and the long tail is usually like databases in the megabytes and and so it's fine. If you don't count the amount of storage that is used, the cost of your inactive database is not close to zero. It is zero

18:59because again you get the request over HTTP, you serve to you route to the right database, you serve it, it goes away, nothing, you know, just so it's really good for all those architectures in which you want to achieve multi-tenency. If you're interested in starting with DRO, you can start today.

19:15You actually get a month free uh of any of our scalar and app plans at tours/small.

19:23We have a boo boo outside. Thank you so much and have a great conference.

19:29[Music]

More 2024 Talks
View all