Retooling For A Smaller Data Era | Small Data SF 2024

2024Small Data SF

Retooling For A Smaller Data Era

In this talk, I will offer my perspective on the landscape of modern data tools, particularly user-facing tools for interactive data science and data exploration.
The latest trends in composable data systems and embeddable query engines like DuckDB and DataFusion create both challenges and opportunities to create a more coherent and productive stack of tools for end-user data scientists and developers building data systems. You'll come away with tactical tips and leveraged learnings from my own experience to carry forward in your work!

Speaker

Wes McKinney

Wes McKinney

Principal Architect & Co-Founder

Wes McKinney is an entrepreneur and open-source software developer who focuses on analytical computing. He is currently a Principal Architect at Posit. Previously, Wes co-founded Voltron Data, where he serves on its advisory board. Wes is also the creator and co-founder of the Pandas, Apache Arrow, and Ibis products, a member of the ASF, and the author of three editions of Python for Data Analysis.

0:00[Music]

0:16so this talk is um is about a bunch of things but mostly focusing on a variety of of open- source efforts that we've been engaged in over over the last decade to basically overhaul the this the Computing stack of working with data to make big data feel smaller to make it easier to work locally and to take work

0:38local workflows and more gracefully scale them to to run on uh remote remote clusters uh and to make that whole pipeline a great deal more efficient and ergonomic for for end users so most of you probably know me from from the python pandas project which is now a 16year old project that I started when I uh right when I got out of school uh I

1:02wrote the book python for data analysis in 2012 it's now in its Third Edition it's very popular a lot of people still still buy it apparently uh to learn how to do learn how to use pandas and to get started using python for data analysis um I co-created the Apache Arrow project which I will talk some about in this

1:21talk as well as the the Ibis project for python which is a a almost a 10-year-old project but still a growing uh growing python project that is is part of the story of how we can facilitate local development that can scale I work for for posit formerly known as as our studio where I'm a software architect but I also co-founded Voltron data which

1:42is a startup working on Hardware accelerated analytics for this ecosystem and I invest as a am a smallscale venture investor uh through my my fun composed Venture so doing that as a side project a bunch of people asked me when when it launched earlier this year Wes are you a a VC now I'm like well only kind of so I'm only doing it as

2:03part-time and not I have no plans to be a a full-time a full-time investor number of people have asked me what I've been doing lately especially at at posit and I have been working uh as part of the positron team we're building a new uh beautiful IDE for data science built on top of the vs code platform it's in

2:24uh public beta soft launch so we're not doing a lot of marketing but if you know you know it's on get Hub you can download it you can use it um you know it's a great time to give feedback about how we we can build and and develop these tools to to to facilitate uh you to facilitate working with work working

2:43with especially small data that that works at the at the laptop at the laptop scale but as we all know uh data size data size is relative and what was once big data is is no longer big the amount of data that can fit on your mobile phone on on your laptop is orders of magnitudes larger what you can process

3:07is orders of magnitudes larger uh than than than what it once was and I know that Jordan spoke about the original OG map ruce paper at Google in 2004 so I don't know if anyone knows but do you know how many server cores were common on server processors in 2004

3:29the answer is one so in 2004 the topof thee line zeeon

3:36server processor from Intel had one physical processor core um so it kind of goes to show like how much Hardware has changed in in the last 20 years so today in 2024 you can buy a top-of-the-line epic processor from AMD with 96 physical cores so 180 192 uh concurrent threads

3:59with hyperthreading you can get a server with two of these installed so that's 384 uh concurrent threads of processing so that's that's more than that's two orders of magnitude more processing power than what we had in 2000 in 2004

4:18but at the same time the the clock speed on the processors in that Arrow was also the same about 3 3.6 3.7 GHz so clock

4:27speeds haven't gone up in the last 20 years but we've seen this exponential exponential rise in core counts and parallel processing capability for the same level of power consumption we've seen similar uh exponential increases in dis performance throughput as well as latency in the transition from spinning disc hard drives to the first generation solid state drives to the latest uh

4:54nonvolatile nonvolatile memory same same thing has happened in networking back in 2004 we were working with one gigabit Network so drinking drinking a milkshake through a very small very small straw uh we've seen exponential increases in networking performance and you know the the kind of if you look at uh nvidia's

5:18for example nvidia's Keynotes talking about many people think about Nvidia that they're a graphics card company but actually they are in the business of building these uh these very complex EXC

5:29accelerated systems that that provide uh high performance not just through not just through having a huge amount of processor cores on on a on a on a processor on a GPU but also all of this high bandwidth memory and networking and interconnectivity between an irq and a rack of machines and so performance comes not just from the processors but

5:51also through the synergistic relationship between storage high band with memory and networking and of course you know a lot of the um the where the most intense advances in parallel processing capability has been happening uh in graphics cards and in uh application specific integrated circuits as6 and so the latest uh Nvidia black hoo processor is theorized they haven't

6:16actually announced it but has almost 25,000 uh parallel cores on a single a single die which is which is pretty impressive now this all sounds really wonderful but we have this this big problem which is that over the last 20 years we we've had two parallel worlds that have developed with entirely different ergonomics for working with working with data so there on one

6:43parallel track were folks who are concerned with concerned with scalability and how we can have the capability to process process data at scale and now I came from a totally different side of uh side side of the aisle working on small data tools data that fits into memory you have a CSV file you have an Excel file the data is

7:05not big but there's all these different diverse things that you need to do with the data and so we've been very concerned with like the ergonomics of facilitating these complex and intricate manipulations the way I describe it to people is is the kind of work if you imagine somebody working on um like a pocket watch like a watch makers like

7:24this very kind of nitty-gritty intricate work whereas you know the Big Data world is a little bit more more of like blowing things up with dynamite you know or like blowing things up with you know with a bazooka so very different very different type of work and as Jordan pointed out and I I love this paper from

7:412015 these are x Microsoft uh Microsoft research folks that the first generation of Big Data Systems achieved impressive scalability um but they also introduced a lot of overhead and so they they they scale yes but but they're you know but they're very bloated it's a bit frustrating because if you look at the code that we are writing and how we are

8:05working whether it's in SQL or in a in a a programming language like like python or r or Java or rust we're doing many of the same the same things we might write this SQL query or we might write this we might write this Panda's code or we might write this all code our code and so you say to yourself well we're kind

8:23of all doing the same thing so why is there there this this big dichotomy in terms of the system of system efficiency scalability performance uh and ergonomics and so there there's kind of a hierarchy of needs that that that goes on where if your primary concern is scalability like you have this mountain of data and we need to be able to

8:43process it then that that is your primary concern and it's only after that then you can start to think about well how do we make this fast and then how do we make it how do we make it efficient and so if you have a system that is scalable and performant and efficient then maybe something that we've started

9:01to think about only recently is if we have these scalable performant and efficient systems how do we fit them together in a way where they can play they can play nice with each other they can exchange data and you can build heterogeneous end to- end pipelines that are doing raw data processing feature engineering uh model inference model

9:21serving so that we can create these these endtoend data uh data preparation and machine learning pipelines that are doing performing very different very different tasks and so recently that's that's led us to create this idea of a composable data system where we look at the different contact points within the layers um of of a full stack data system and say how can we introduce

9:45Open Standards and open protocols um to facilitate data interchange reuse of components uh modularity where we can swap out and bring in new components to make things faster incrementally over time in the way that you might you know swap out a part like a hard drive or a computer memory or graphics card in a in a PC can we achieve that level um of of

10:08modularity in Data Systems um do in doing this we can also greatly reduce the overhead that's involved with building distributed systems which can make our scaleout story uh a lot a lot better in terms of efficiency and so our goal is to is to create a world where we can resist vertical integration and build a virtuous cycle where we um Can

10:31can um hypers specialize at the different layers of the stack similar to what's happened in the semiconductor industry um and uh get people really focused on building highquality reusable components um that make uh our data processing pipelines um easy to use and more efficient so there's a variety of projects that have popped up uh open source projects um that uh you know have

10:56have um certainly in in um a protocol I'll talk a little bit about and um a lot of the rest of this talk is going to be about Computing engines which partly why we're all we're all here because we um we have a lot of love for what's happening with with duck DB and uh and projects projects around around duck DB

11:15and I started thinking about this uh almost 10 years ago this is a slide from a talk that I gave in 2015 and the idea is that we want there to be this decoupling of to have people focusing on building really high quality user interfaces ergonomic apis and tools to

11:35build data pipelines and allow the folks who are really good at building execution engines and building storage engines to just focus on that problem say hey let's create a standardized API a way that you can give me data I can give you data um make things as efficient as possible and folks who are really good at building doing API design

11:53and thinking about the ergonomics of writing code and building building these data pipelines that we can that's that's what I specialize in that we can focus on that and not think so much about how do we make this fast and efficient you know in the data Science World things have been very fragmented historically and so you know after a couple of years

12:11of thinking about this you know why why shouldn't we have reusable libraries and systems that we could use across programming languages like why should be it why should it be that the python world has a whole uh Silo of tools that they build uh from the ground up um and that aren't reusable and our world or in

12:29the Java World um so this feels like something that and we've been we've been working for it but uh Once Upon a Time this was Pie in the Sky Thinking and it's taken a lot of a lot of work to get where we are today um one of the one of the big tools that has helped is this

12:46project Apache Arrow which has been around since uh 2016 so we realized uh

12:52right in this mid 2010's moment that we needed to have a interoperable table format colum or data for format that could be used portably across different processing engines programming languages and this has ended up being the glue that that ties together much um of of this new ecosystem and so duck DB also came around in this era there was a

13:14recognition that we needed a Cutting Edge colmer execution engine um full stack database system of course but um but but an Engine That Could deliver state-of-the-art performance and that could be used absolutely everywhere and so what's been really amazing is to see the combination uh of a Cutting Edge uh embeddable database engine along with um an interoperable memory format has enabled

13:40this new ecosystem where you can compile duct DB to wasm you can run it in the mobile browser if you have another part of your system that knows about Arrow you can feed large binary Blobs of Arrow to duct DB via the WM interface and there's no need to serialize to Json or to convert to some intermediate memory

13:57format because duck DB speaks Aon native and so you get a system that is um you know order orders of magnitude more efficient than what you might have built a decade ago I'm very much a duck DB uh

14:10Acolyte and um you know when I learned about the project in late 2019 early 2020 I was like I need to be a part of this and so um you know I I was lucky that that my company was able to become uh one of the first members of the duck DB foundation and to fund and support

14:27the work that the duck DB labs is doing and i' I've done everything I can you know in the last in recent years to to help support and grow the ecosystem uh around duct DB so one really cool thing that's happening now with these um with everyone piling on and building these uh modular embeddable execution engines is

14:48that we have a few of them so we have we have duct DB we also have data Fusion which is uh written in Rust um there's polers uh the folks at meta are building a system called veloc which they initially started using to accelerate Presto um and all of these all this is open source and what's what's being done

15:08is that these accelerated execution engines are being used to um provide uh

15:15faster accelerated versions of the apis that people are using so a lot of people have spark spark uh data bricks has their own proprietary accelerator for spark called Photon um there's at least

15:29there's at least two open- Source e efforts to accelerate spark there's a number of proprietary efforts that I'm aware of um so apple is has hired uh a

15:39big chunk of the data Fusion team and is working on a project called Comet to accelerate spark uh Intel has been working with meta on the gluten project to accelerate spark uh with velock so

15:51you know when you have the biggest companies in the world um you know piling on into this ecosystem I think that's that's a very good sign that this is where this is where things things are headed but one one quibble that I have is this problem of of apis and and API sprawls and so there's there's all this work going

16:13on to accelerate spark but well why you know why do we need spark why should we be why should we be stuck with spark and a lot of it is that people have written a lot of spark code and so they want their their code and the the workflows that they've developed to to be portable and reusable with without as much

16:29disruption to their existing um production pipelines but longer term it we would like not to be locked into one full stack full stack system um and we'd like to have the freedom to be able to choose um choose the configuration of components that yields the best uh cost efficiency or performance for a given for a given workflow and to be able to

16:54more gracefully scale up and scale down and and select the cost function that makes sense for for our workload now in theory SQL was supposed to be the thing that that that made this possible where we write SQL queries and then we have lots of different SQL databases which can be scale out Big Data Systems or small Data

17:16Systems that run on a single node um but SQL dialects and practice are not are not portable there's some uh trans query transpilation systems that that help with this but it's also so non-trivial to decide at in the moment which engine to use which engine will deliver the best performance um or the best or the best efficiency we have a similar

17:39conundrum happening in the python ecosystem right now what I would describe as the data frame API sprawl where you know on one hand we have folks who want to emulate the pandis API for

17:53um engine portability to be able to essentially do pandas for Big Data um now of course pandis was designed for data that has been loaded into memory and and uh in inside inside the python process this is a very difficult problem there's been a lot of work uh snowflake hired the modin team is working very actively on this um there's a new

18:15project called narwhals which is a um query portability layer using the polar API um I started a project almost 10 years ago Ibis which is a a uh portable data frame API solving this problem so we have a team of Ultron data that works on that works on Ibis and we've been working very actively to uh productionize and expand um this project

18:37to have a high degree of portability across uh different execution backends to try to bring together the best of modern SQL with uh fluent pythonic data frame API so if this is a problem that interests you I would encourage you to take a look um so you get this fluent dataframe API you can write write python functions um you know build Lex queries

18:59as though you were writing a real programming language um since SQL can be a bit hard to do that and we're our goal here is that we really want to facilitate that local development develop locally with duct EB build your ETL pipelines if you need to deploy them someplace else use a different execution engine a different Cloud infrastructure

19:16environment that we make that easy for you to do but our our emphasis is really facilitating and being um designing around that local development experience I'm hopeful that we can work toward toward this but um certainly today versus uh versus a decade ago we have a lot of new tools that help and I do believe that the future will be a

19:38multi-engine uh data stack where based on the data scale we will choose different tools and how to execute um but hopefully our apis and our workflows will um become you know more and more common uh so that we can work work um work locally and and deploy and deploy anywhere and I appreciate your attention and uh I look forward to the rest of the

19:58conference thanks for having me

20:03[Music]

More 2024 Talks

Big Is Not A Number: Dispelling The Myths Of Big Data

Big Is Not A Number: Dispelling The Myths Of Big Data

Jordan Tigani

Data Minimalism: Delivering Business Value For The 99%

Data Minimalism: Delivering Business Value For The 99%

Ravit S. Jain, Jake Thomas, Celina Wong, James Winegar, Josh Wills

Squeezing Maximum Roi Out Of Small Data

Squeezing Maximum Roi Out Of Small Data

Lindsay Murphy