Paddling In Circles: The Return Of Edge Computing | Small Data SF 2024

2024Small Data SF

Paddling In Circles: The Return Of Edge Computing

The Small Data movement is being driven in part by long running cyclical trends in edge computing. Richard’s talk explores this history and connect it to his work at Tableau and DuckDB Labs.

Speaker

Richard Wesley

Richard Wesley

Group-By Therapist

Richard is an IBM brat who started programming in the mid-1970s. After a brief flirtation with pure math, he moved to Seattle in 1989, where he worked in the software industry, specializing in digital signal processing. In 2004, he joined visualization pioneer Tableau, where he worked on connectivity, query compilation, and performance, including being the lead developer for the Tableau Data Engine. He tried to retire in 2019 but spent so much time hacking on DuckDB that Hannes tracked him down and invited him to join the DuckDB Labs team, where he focuses on temporal data processing.

0:00[Music]

0:16Welcome. Uh, I'm going to be talking about what I call edge computing. And this is different from edge servers, which someone pointed out to me earlier.

0:25A little background on me. Uh before I was a duck, an old duck, uh I was a math graduate student and dropped out of that and went into software engineering. And most of my experience before getting into databases was in

0:42digital signal processing, which turned out to be a great transition to column stores because it's all big arrays of numbers.

0:49uh I switched over to Tableau uh very

0:54early on and was one of the people building their first data engine and I'm going to be talking a lot about that in a moment and then the I left just before the pandemic and spent a lot of time at home hacking on duct DB till Hamus tracked me down and asked me to join the group and

1:08what I do there uh mostly is what I would call temporal analytics so anything involving time windowing and so forth so I'd like to take a

1:20back through time and talk about what I call the tides of computing. And in the original days of computing, we all had computers. Well, we didn't all have computers, but the computers were all single nodes and you had to go up to them physically because there was no such thing as a network. These machines were expensive. if they were a

1:42shared resource and you had to do things like bring in your deck of punch cards, have someone else schedule when it would run and you would come back the next day and find all of the typos in your punch cards and have to build them again. But eventually you got uh something done.

2:00But in the early 1960s it became clear that this was something that everyone was going to need to be able to do in the future. So some universities started setting up systems where everybody could use the mainframe that they had which was a very expensive piece of equipment.

2:21And so like at my alma moater in Dartmouth they set up this machine and then they put 20 teletypes around the campus where anyone could come in and use the computer and you no longer had to have the priesthood at that was taking care of the machine schedule things. this the time sharing system that they invented would do this for

2:41you. And uh I want to mention that the part of their motivation was education.

2:47They didn't just want mathematicians and engineers to be using these machines. They wanted everyone to be able to use them. Statisticians, social um scientists, and so forth. And this was such a great idea that soon everybody was doing it. And we got a picture of IBM 360 time sharing system there. And the great thing about this was you no

3:08longer had to go into the computer, but the software was still kind of locked down by this priesthood. And this offended some freeth thinkers out in a weird place called California.

3:19And the group of them got together and formed something called the Homebrew Computer Club, which their motto was computers for everyone. And this led to the creation of desktop hardware.

3:33Some of the more famous people in the homebrew computer club were people like Steve Waznjak and Steve Jobs whose names may be familiar to you. And they came up with uh the original Apple computers.

3:45IBM then came up with the PC. And this led to the era of shrink wrap software.

3:51So finally, not only could everyone have a computer, but they could control what was sitting on that computer.

4:00But people still had to share some data.

4:04And the typical way this was done back in in those days was you had another desktop computer sitting there which is called a file server. You put a big disc on it and everybody threw all of their shared data. So that was great for stuff your organization was using. But if you wanted to use some like shared resource

4:22like government data on weather or something like that, you had to use these horrible protocols with names like Gopher to go find these things. And this bothered a guy at CERN named Tim Berners Lee who came up with a more, you know,

4:39userfriendly way of doing this involving hypertext. And the idea then was that you would build your own homepage which had no URLs to the things you wanted to look at. And then people built these things called web browsers which would allow you to go and look at these pages.

4:58And this led to a whole new way of building uh customerf facing interfaces for businesses and other organizations. you could now set up something where a user would go visit your site and people realized that they didn't have to serve up static pages anymore. They could actually generate these pages on the fly. And so you wound up with all of

5:23these systems for building these types of uh web pages. One of the most popular was something called the LAMP architecture where you had uh Linux as the operating system, Apache as the web server, MySQL as the database, and then uh this thing called PHP which I guess is still around uh to do uh the coding

5:46and gluing it all together. And there were a couple of other systems from other major vendors.

5:51So people started building these things. They got more and more elaborate as e-commerce took off. And I'm now going to completely bolderize what um Jordan was saying this morning. Basically, the these companies got better and better at building these systems and uh managing the all the hardware. They went back there, scaling them and so forth. And they realized that they could start

6:15selling this as a service. And this was great for them to made them lots of money, but was great for everybody else because now you didn't have to worry about scaling. You didn't have to worry about having expertise in house to operate these machines and you could outsource all of your operational expenses. And for users, it was also

6:37great because they could now have a shared uh hardware system, but that they could control what software was on it.

6:47And then as these machines continued to get more powerful, it turned out that our ability to do computation at the edge uh scaled up again. And so now we once again have all of this compute power sitting in front of us on laptops. But it's not just laptops. There's a lot of other systems out there that uh have chips built into them that can

7:13do this kind of processing. Last year at DOM, which is a hardwork database conference, I saw a system where they had taken they built a database where you could actually push some of the processing down onto the memory chips which had small processors on them. Um, and the most fun example of uh what

7:35people are now doing with having all of this compute power at the edge is you there are now smart two different brands of smart electrical meters out there that have their own app store.

7:47So that was uh a sort of very high level view of of uh what happened with processing power and where compute happens. And I thought I'd now move on to what my experience with this was. And I called this the Tableau extract saga.

8:03When I arrived at Tableau, uh they had this great new desktop product which I didn't think I would ever see again in 2004.

8:11But the problem was was that because they were an early stage startup, they had only come up with connectivity to basically one database system, which was all Microsoft products, which is fine, but we needed to expand this out. And so my first project was I looked at I went, you know, the most popular database in the world is my SQL. Why don't we talk

8:33to that? So I redid the whole connectivity layer. And this led to an explosion of connectors. And I think this may have been the most valuable thing I did there. But the problem was that all of these databases were slow and we wanted to do interactive analytics which requires response times of like 200 milliseconds before people started getting bored and going off to

8:55find coffee. And most of the databases were too slow. And this wasn't always a problem of the database. Often it was just that the systems were very overloaded by everyone using them. So we tried a long series of solutions to this problem. The first one was we said okay a lot of these databases we could take the data we it's often rolled up and

9:16filtered and aggregated and why don't we try and stick the data the user actually wants into a temporary table and this sort of worked but we still had a lot of problems. There was a lot of roundship latencies and there were a lot of issues like do people have permissions to create these tables on a shared server

9:35and it took time to build it. So we started thinking about well can we bring it to the user's desktop machine and the

9:47original Tableau was a Windows only product. Everybody had Office and so they had this thing called Microsoft Access and it was it used a query lang version of SQL that we already supported. So we said, "Okay, let's start pulling this data down." And one of the big advantages we found is all of a sudden people could take their data with them

10:08on an airplane because this was long before airplanes had Wi-Fi. And the thing was usually installed, so we didn't have to do anything special. But the problem was that Access was not much of a database. It had limited SQL. some of its and this is especially bad about some of its join syntax which we used a lot for filtering was very slow to

10:31insert and occasionally we'd run into people who actually didn't have office and that was a bit of a problem so we started looking around for can we find another database that we could use the other problem with access it was Windows only and we were starting to think about moving out to Mac and Linux and we found

10:49this thing called Firebird which is an open-source SQL engine And it was tiny. And 10 or 15 years ago, this was important because everything was being downloaded from the internet. Uh it had a rich type system. It had an active developer community. So there were people actually working on it. And because it was open source, if we found

11:09something that they didn't want to fix, we could fix it. But it had other problems that it had. It was based basically on system R. And that was using algorithms from the 1970s which were all sorting based. And those don't scale linearly with the data. They're super linear. And so when you do something like this, people start

11:31throwing more and more data at it. And so it kept getting slower and slower proportionately. And people started complaining about that. So we said, okay, let's go look out and see if there's anything newer we could use. And that's when we happens to catch the wave of the column store revolution in the database community. And the idea is you

11:52store data in columns, not rows. And there's a lot of advantages to this. And uh IBM had actually tried this in the 1970s, but couldn't get it to work because their memory storage system was rotating drums.

12:04But the whole idea was resurrected by a group at CWI, the Dutch Research Institute that gave you Python and DuckDB.

12:13And they got this to work with a tool called Monet DB. And there's a lot of progress and excitement in the 2000s around column stores. And we looked at this and said, "Hey, we can build this." And I did some experiments and found that it was 200 times faster than what we were originally using. So, we went right into

12:34it and built this thing even though none of us knew what we were doing, including me. And so, we had this in-house codebase that only I knew how to fix and that was not good for a startup that was going places. So they went looking around and found a new approach that was

12:56being had been developed at the technical university of Munich where you took the queries and compiled them right down to LLVM and then down to assembly.

13:06And this was a startup. It came with an entire development team and we just bought the whole team locktock and barrel. And this was great because I could now be hit by a bus. But uh there were a few problems with it. Um there was a certain amount of time it took to integrate the product because it was SQL

13:25and Tableau's semantics are not quite SQL and had a very complex code base. I bridgely tried to work on it and um made some progress but it honestly the code was so weird that it was to in order to really be effective at it you kind of had to join a cult.

13:44Um and it also turns out that it this approach doesn't necessarily produce the best performance. There was a famous paper uh written by the two proponents uh called tector wise versus typer and they found that the two systems had different strengths and weaknesses and so there wasn't really a clear winner on this but I'd always felt that

14:07uh as a professional software engineer that maintainability was an important uh

14:15attribute of a system and that brings me to duck DB So what does duct DB offer uh to computing at the edge which is where we are today?

14:28Well, first of all, okay, these are my solar panels and my solar panels connect to a switch. The switch connects to the grid, a battery and my house. My house connects to a charging utility and component and the charger connects to my car. So, I can charge my car from my solar panels. And this all works because

14:50these things all know how to talk to each other. And I think the most valuable thing that a cool piece of software can do is talk to lots of things. I mean, this is what I found at Tableau. And this is one of the biggest things I think that duck DB offers. And we have a huge number of connectors to

15:08data frames or other databases like Postgres and SQLite. And just being able to get data in and out is extremely valuable. And there's a whole subculture of people who are just using duct DB to transform data from one format to another.

15:25So how does it work? Uh well there's a number of components. The most the one that everyone hears about is this is a vectorzed engine. And this is based on the idea that there are different kinds of memory. And this called a hierarchy.

15:38You start with registers. You go to data in instruction caches. then main memory, then disk, then all the way out to the network.

15:46And the traditional systems did what was called tupil at a time processing. And there you do one row and you just go as far as you can. And this makes very good use of the data cache because you're only keeping a small amount of data there, but it doesn't actually use all of the data cache, but it completely

16:05blows out the instruction cache because you're constantly changing what you're doing. So when manb came along, they had this column at a time system. This was great with the instruction cache because basically you just sit there and you have a loop that goes over and does the same thing to every va every value in the column or two or three that you're

16:24working on. But by the time you get to the end of the column and then it's time to go do the next thing, uh all the data has long gone uh way out into main memory. So the inside of the vectorzation approach is that you break the data up into chunks of like one or two thousand rows and

16:43then you manage to get the benefit of both approaches. You keep the data all the data you're working with fits in the data caches and the code you're using stays in the instruction caches.

16:55Now these caches aren't just attached to the chip as a whole. the caches are attached to individual cores. So if you can then start using all of the cores with parallelism, then you can start, you know, leveraging that memory, you know, by multiplying it by the number of cores you have, which on my silly little laptop is 18 these

17:19days. And so we have gone and parallelized all of our operators. And so we're pretty good at scaling linearly. And that's what this graph here is. It shows one of the TPCH benchmark. I'm glad you had that previous talk. So now you know what that is. Um and we just as you can see as you add the number of cores uh the

17:41scaling gets pretty linear. So that's using the CPUs but memory

17:49doesn't always isn't always enough. Sometimes you have queries that require more than the amount of RAM you have.

17:56And so our one of our recent projects is to try and make uh the system spill data to disk in in the middle of queries so that we can use your disk as well as your memory and then park stuff we're not working with right now. And we've had a lot of progress on this. Um this chart over here is again TPCH at scale

18:16factor 300. And the point of this is not to show how fast it is, but to show that the uh the x-axis we're we're cranking down from 30 gigabytes of memory the system's allowed to use down to one. And almost all of the queries complete.

18:31There's three that don't quite make it and we're looking at those. Uh windowing isn't quite done. I'm almost ready to commit that.

18:41But there's more than hardware involved. Uh we also use the literature. The company was founded by database researchers and we also track that literature. We have an entire page full of all of the papers u that we have taken from the community and put into this tool.

18:59Uh but we're not a purely academic project here. We don't just take everything. We have a very practical focus. And as an example of that, um, we

19:10have a a very good, uh, what's called range join where you have intersections between two, uh, time intervals. And there's a number of algorithms for doing that. Ours isn't optimal, but it works for anything instead of special cases.

19:24So that that's one a good example of us being practical. The last thing we use is the community.

19:31We have an extensionable library. The product is fully extensible. Uh we have built-in extensions. We have a few a number that we maintain. But now we've also added community extensions which is where you all come in. Unless you think that our extensions or extensions are secondass citizens. We use the same APIs internally as you would use in an extension. So welcome to

20:01the flock. I was going to take questions, but I don't have time. So, you can find me later.

20:08[Music]

More 2024 Talks

Big Is Not A Number: Dispelling The Myths Of Big Data

Big Is Not A Number: Dispelling The Myths Of Big Data

Jordan Tigani

Data Minimalism: Delivering Business Value For The 99%

Data Minimalism: Delivering Business Value For The 99%

Ravit S. Jain, Jake Thomas, Celina Wong, James Winegar, Josh Wills

Squeezing Maximum Roi Out Of Small Data

Squeezing Maximum Roi Out Of Small Data

Lindsay Murphy