Data Minimalism: Delivering Business Value For The 99% | Small Data SF 2024

2024Small Data SFpanel

Data Minimalism: Delivering Business Value For The 99%

Businesses run on data, from Big Tech to Consumer, the Enterprise, and the Public Sector. But is data volume really the key to success? Moderated by Ravit Jain, our panelists will share insights and stories from the trenches as users, leaders, and data experts building with analytics and AI to deliver meaningful business value.

Speakers

Ravit S. Jain

Ravit S. Jain

Founder & Host

Ravit is the Founder & Host of The Ravit Show, a platform and community dedicated to the intersection of media and technology. He has interviewed over 500+ Data and AI leaders to amplify their thought leadership and uncover emerging trends in the industry. Ravit is known for hosting and moderating live events and has built a thriving, global community and platform for data and AI practitioners who are committed to sharing knowledge, learning and growing together, and fostering meaningful connections within the tech community.

Jake Thomas

Jake Thomas

Manager of Data Foundations

Jake currently manages the Data Foundations team at Okta after transitioning from Principal Engineer on Okta's Defensive Cyber Operations team. He previously led data platform teams at Shopify and CarGurus, has taught various O'Reilly courses, and regularly contributes to data-oriented OSS projects.

Celina Wong

Celina Wong

CEO

Celina Wong is CEO at Data Culture, a data consultancy that offers data engineering and data visualization services. Prior to Data Culture, she has been 3x Head of Data/ Analytics with 2 successful startup exits. With 15 years of experience in the data and finance across several industries, Celina has been both the data practitioner and the business stakeholder. Understanding both sides has been the key to her success in building truly data-driven organizations.

James Winegar

James Winegar

CEO

James founded CorrDyn in 2015 with the goal of providing clients the level of service and expertise traditionally provided by only the largest consulting firms. He holds a Master’s Degree in Data Science from the University of California, Berkeley, where he is also on the faculty, teaching Machine Learning at Scale, Fundamentals of Data Engineering, and Machine Learning Operations. As Software Product & Platform Engineering Senior Manager and Infrastructure Principal at Accenture, James led very large infrastructure optimization and integration projects for many of the world's largest companies across cloud, on-premise, security, and data. His extensive expertise spans a variety of industries including finance, telecom, logistics, consumer packaged goods, non-profit/association, healthcare billing/management, and entertainment.

Josh Wills

Josh Wills

Technical Staff

Josh Wills is a software engineer at DatologyAI and the creator of the dbt-duckdb open-source project. He has worked on data pipelines both large and small for more than 20 years at places like Slack, Cloudera, and Google.

0:00[Music]

0:27Hello everyone. I'm very excited to be chatting today with our amazing panelist out here. You've already heard the introduction. Uh I know we have 30 minutes and I heard we'll be sweating here so we're all set but then super excited to be chatting about various things but starting with what the hell is small data, right? We all have been

0:47hearing about it since like 8:00 a.m. 9:00 a.m. I guess I've been interviewing folks like I've done like almost nine interviews. I've chatted with enterprise leaders, vendors, community folks, kind of excited about what the hell is data.

1:02Do you want to pick it up, Josh? >> No, I have no desire to. I want I want to disagree with whatever the first person says basically as much as possible to make it a good interesting panel. That's that's kind of my aim. So ask someone else and then I'll disagree with them.

1:15>> So I think I'll go with Jake for >> Oh my god. >> Thank you. Josh disagrees with everything I say anyway. So do this is perfect. Um, oh man. Uh, could go a lot of different directions with this, but uh, what is it? What is small data? I mean, I'm I'm trying to not regurgitate everything everyone else said, but uh,

1:37for me it is two things. One, uh,

1:42very large workloads if you stuff them all together broken out into more digestible uh, chunks. So ingest transformation etc

1:54etc uh is one obvious one but also like serving you know userf facing applications that have to be sharted by users or cells in deployments or different companies because compliance reasons. So uh that is kind of what I've centered around.

2:12>> So that's completely wrong. And so anyway, um what small data actually is for me having been through a few different eras of this stuff is in the big data era, we didn't care about individual computers. Like an individual machine didn't matter because you had several thousand of them doing whatever it is you were doing. So like no one

2:31cared about any individual machine. And for me, small data is like a like a reawakening like kind of in a in a great awakening sort of sort of sense of the fact that like you know our laptops are really really good. Like they're really powerful. We can use them for stuff. Um and I feel I was I was telling the

2:50panelists I feel a little bad being here because I like I'm literally running a 60 terabyte Spark pipeline like right now. I'm like a token representative from big data here at the small data conference. Um, but the the unifying thing for me with doing like big data, doing AI stuff is that we once again care about the power of an individual

3:08machine. When you have when you only have 256 H100s, every one of them is precious, right? They all matter.

3:16They're all sort of part of the job. They're all part of the workflow. And that's that for me is is sort of the key differentiator. We care about individual machines. We are excited about the potential. And we are writing software in a way where we're optimizing the potential of a single machine. We're not just not just focused on lots and lots

3:30of dumb individual machines anymore. That's that's what it is for me. >> Selena, what do you think about it? It's uh I know you kind of also work around event data teams. Kind of curious to learn about what your thoughts are when it comes to small data. I know you've worked for big data companies out there.

3:46So kind of curious to learn more about it. >> Yeah, I mean I've worked for big enterprise companies like American Express which I mean their P&L is like trillions of dollars, right? like getting outside of I think uh we've been talking a lot about tooling and technology when I think about small versus big data what it comes down to is

4:06even if you're working in chunks of data as we're talking about it's really about what the data finds most valuable right I think small data to me equates to that hot data right I think one of the earlier speakers spoke about hot versus cold data I think small data is really what's hot data right I mean I I know we

4:25heard the term term like data scientists are sexy and whatnot, but it's really what that layer of data that you're actually using and working with to drive business value because if you're not driving business value, I would worry about your job safety, right? So, when I think about small data, to me, it equates to hot data and what's the data

4:44that's actually driving business decisions, not what's sitting in storage, >> right, Jake? >> Yeah.

4:51>> So, yeah, gems. >> Yeah, >> there are too many gems. We got Jake, we got James, we got Josh.

4:57So, >> yeah. I mean, like, you know, we can quote Jordan from earlier today, right?

5:02Or, you know, like, >> you know, do I have to do a shuffle? You know, that there's that there's the pets versus cattle argument.

5:09>> Pets versus. >> Yeah. So, like, you know, I agree with Selena on, you know, hey, like you got to actually do something useful with it, otherwise you're wasting your time. So, you know, we're all speaking the same language.

5:22>> Yeah. No, I think that kind of makes a lot of sense in terms of what you all are kind of talking about. I'm kind of also curious to learn a little about you know when we talk about small data or big data it's about also about the right sizing data infrastructure investments right that kind of goes into the

5:37business. So how can businesses identify the right amount of data they need to make impactful decisions without falling into the trap of overengineering which is one big problem at a lot of data enterprises out there. So curious to learn your thoughts, Selena?

5:55>> Yeah. I mean, I guess if I go first, then you're going to go and disagree, right, Josh?

5:59>> Oh, yeah. I'm going to totally disagree with whatever you say. >> Great. Because get ready for this one.

6:03>> Okay, good. Give me give me a hard one. >> I think that uh right size is in my mind, it's about strategic planning. And I know that people hate the word strategic, but I I really do think, you know, having been head of, uh, data three times and having worked, you know, in a big enterprise setting as well as

6:22at its startups, uh, and now serving clients across the spectrum, it's about, it's really aligning on what are you trying to answer, right? It's cool and all that you get to go and play with these tools and you have a lot of data to work with, but what's really important is at your organizations and the companies you're at today, do you

6:39know what's actually driving the business? Do you know how you're actually making money or what the top costs for your business are and how that latters to the data that you're working with? Right? I think that's how you rightsize your data infrastructure investment because you have to think about well, what's your impact and how is it tied back to your business's P&L?

6:59um and how your company makes money and also loses money, right? The cost side of things. And then go into the lens of what should I be investing in? How should I think about my data structure?

7:10What's actually important here? Um because no one's going to give you a gold star for developing something on the side that doesn't really matter.

7:21>> It's a challenge, but I think I can take this one. >> So, I work I work at an AI startup. So, the laws of economics do not apply to me at all in any way, shape, or form. Um, making money, impact of things, ridiculous stuff to care about. Does totally doesn't matter.

7:35>> Um, I think the trap of overengineering, that's actually a fun one. Um, I think I don't I think there's something like something like very American about the big data effort. Like everything Americans do is like we're going to be the biggest thing in the world. This is going to be the to-do app to dominate all other to-do apps. this is what we do

7:55here. This is what we're about. We are world changing always. We are ambitious always in all the things we do. And that is sort of the core of the big data ethos. It's like that conversation like that meme with the rest of development when it's like Lindsay and Tobias and they're sitting on the bed and like Lindsay's like, "But did those people

8:10need Snowflake?" And Tobias is like, "No, they didn't need it at all. These people somehow delude themselves into thinking, but we might need Snowflake.

8:20I thought that would have been funnier." No, not really. Okay, that's fine. That's all right. um that that's like it's it's sort of the the potential like what if right I I think for me like engineering systems like I used to work at a like a climate tech company doing managed charging for EVs right and when you're engineering a system you look at

8:38your current load and then you plan basically for 10x whatever your load is right now like that's it you plan for 10x like what if we had 10 times as much data 10 times as many cars all that kind of stuff and you build a system that can handle 10x the load right that's what you And I think for me again, I'm back

8:55to my theme of like hardware rules everything around me. >> Um, the hardware that can handle 10x the load from where you're at now has come down significantly over the last few years. You really don't need that much hardware to handle 10 times whatever it is you're doing now. And that that to me is the big unlock here. So again,

9:11everything Selena said was totally wrong. I'm embarrassed for her that she said it, etc., etc., etc. Yeah. Anyway, >> Josh, I'll come back to you, but I know Jake can disgree to you as well. Let's try.

9:24No, I mean I I could take this in a couple different directions. >> Disagree with I care a lot about money uh and saving money >> and like I I find uh

9:37it is very easy like you can fall into overengineering but you can also fall very quickly into like overprovisioning and oversizing for some hypothetical you know 10x magical unicorn thing and that usually doesn't happen. So like a lot of my time of late has been uh in two different directions like one uh these very large implementations of system X

10:01that's highly overprovisioned and like it's very easy to start pulling workloads out. So, uh, in almost all of

10:09my cases thus far, uh, it hasn't been about the data, but rather like, hey, if I take this little hammer, that's all that's needed to tap this little nail over here. Uh, I can save like a ridiculous amount of cost. I can save a ridiculous amount of complexity on and on and on. Uh, so it's like overprovisioning and cost-saving.

10:31Um, but also once that's once the toolbox is minimal, it's a lot easier to carry a hammer around than to carry a sledgehammer around or like a jackhammer. Uh, and I can take this hammer and go to like feder moderate or fed ramp high or IL4 and five with the same hammer and the tool sets don't need

10:50to change across like very different data volumes or very different compliance mandates, etc., etc. So, it's been very interesting. So, that's kind of my take there. Those are interesting insights. I'm pretty sure Josh will have a contradictory statement to it, but then we'll we'll move on for now, Josh.

11:09Um, James, I have an interesting question for you because I chatted with you earlier and you mentioned you work closely with a lot of enterprises out there. So, and when I think about enterprises, I think about big data to be honest. So, when you think about big data, obviously that also gets me to a question about quality versus quantity.

11:28We often hear more data leads to better insights. But how do you approach ensuring data quality over quantity especially for enterprise businesses?

11:38How does it all work? >> Yeah, I think like what we were talking about there is like what is the quantity that you actually care about which is your, you know, cold versus hot and things like that. So like the data you actually care about for most use cases is from today, yesterday, you know, over the last week or whatever. And so like,

11:57you know, defining your problem in terms of like what your actual requirements are and what you try to do. So like do I need to go look at the last 16 years of data to answer this question? Probably not. I probably just need to look at the last week or hour or day or whatever. And so like you know when you're when you're

12:13thinking about those problems like what what are your actual requirements and then like how do you you know what's your sledgehammer versus hammer that you're bringing to the table, right? And so like for the vast majority of use cases, you can just bring your little hammer or maybe a screwdriver.

12:30>> Josh, what do you think? >> I guess like my question is what do you do then when you need a sledgehammer?

12:36Like what do you what do you because I mean sledgeham sledgehammer problems they happen, you know, and it's like I wish that sledgehammers were easier to carry around, >> but they are like they're pretty heavy weight, right? So like when a when a sledgehammer I don't know. I mean, I like I kind of I'm torn here. Obviously,

12:52I want to be contrarian, right? But I agree with Jake's point. Like, a lot of people are carrying around sledgehammers and they look ridiculous, right? Like they look insane because they're super worried about, you know, the day when like they really need that sledgehammer.

13:05>> But sometimes that happens to people. Sometimes they they need that sledgehammer. And then like I'm kind of like, well, what do you do? What like what is the plan? What what what do you all do when the sledgehammer problem arrives?

13:15>> You walk over to the shed and you grab the sledgehammer. >> But it's like you need to build. But I mean like the whole joke of the sledgehammer right is you need the this is weird that we've gotten to this metaphor but that's okay. Like I need to build the tool shed first, right? I am literally right now like not literally

13:29right now because I'm on the stage with y'all but like today I am provisioning a Kubernetes cluster to run Spark on Kates with Carpenter to do my auto node provisioning and stuff. I have to do a ton of work to make it easy in some sense to run these gigantic workloads like that. like it's not a, you know,

13:48the great thing about ductb is it's there. It's a hammer. You can carry it with you. It's easy. You can bring it wherever it needs to go. This is like not true for the big data workload. So, I'm still like, what do you do? Like, what do you do when these things show up on your door? But that's my question.

14:00>> Josh, I feel like you're uh handing us a slow pitch softball here. >> Am I really? That was >> like if you separate compute from storage and standardize the storage and make it interchangeable across small hammer, medium hammer, and large hammer, uh, you have something pretty powerful.

14:15I was I was a tabular investor so I have like some interest here. >> Happens to be pretty useful.

14:21>> I'm I'm so glad to hear that. That makes me really happy as I feel that's like I get to do like the VC humble brag kind of thing like so happy to be a part of the journey anyway. Yeah.

14:30>> But I mean that's great like that's that's fantastic and that was my dream for tabular like when like when Ryan was getting the company going it was like this is this is exactly what I wanted. I didn't want people to be locked in anymore any >> Yeah, that was fantastic.

14:42>> Very interesting. >> This is not controversial enough, right? We got to give something else. Give us something give us something harder.

14:47Yeah. >> Selena, anything on the data quality versus quantity? >> Yeah. Uh I think you know I think we can all agree that uh you know aside from the contrary stuff like that more data doesn't always lead to better insights.

15:02And an example I have is that uh the last company that I was head of data at, we were a startup, right? So we were no older than five six years old, right?

15:13And we were acquired by Proctor and Gamble in 2022. >> And so what I'm trying to get to is that they acquired us um it was a skincare startup and we were actually building it with data and tech in mind which is what made us super attractive to Proctor and Gamble. And why is that? Even though Proctor and Gamble has been around for a

15:35long time, you know, you'll probably have to fact check me of how long they've been around. and they're probably sitting on top of a lot of big data out there.

15:44>> They one of the and I think I can say this but whatever. Um, one of the uh things that attracted them to us was that we were able to take our data and draw customer insights from them to drive our growth, right? Like they were sitting on all this big data but unable to think about this holistically, right?

16:04because of the silo of their data, the way that they were, you know, things were just sitting in storage and they were unable to get to that layer of active usable data to drive their businesses. And so when I when we were acquired and I was at Proctor and Gamble, this is a classic marketing problem that still exists today. A lot

16:26of people want to know about marketing attribution, like how are my leads coming to my, you know, to my website or buying from us, right?

16:34And um and at the time uh I basically

16:38the company's name was was Tula or is Tula. It's still around today. Um but at Tula we had we had bought an MTA product and we were searching for an MM product.

16:48And for those of you out there that don't know about MM um it's media mix modeling. Uh it would help us understand if you invest in paid media channels such as Google search you know if you're on Instagram, Tik Tok, all that stuff.

17:03How is it driving to not only your DTOC website but to Amazon and to all these retail businesses that we were selling to? And so even though we had five to six years of data, traditional MM out there were like uh we need, you know, we need a lot of big data to make sense of this and drive insight to tell you how

17:19to invest in your marketing strategy. Whereas there's some new MM tooling out there that's like we only need, you know, two years at most of data. In fact, priors, you know, Beijian MMM takes into account priors. So, even if your company hasn't been around for a long time or times are moving pretty fast for your business, you can bring in

17:39those priors to assess what actually should be brought into the model, right? And so what Proctor and Gamble realized was even though they're sitting on years upon years of data and they're using big companies out there like Neielson and and whatnot for MM, they were turning their eyes and heads to us because they're like, "Wait, you don't need this

17:59big data to actually have great insights and drive growth at the pace that we were driving Tula at." And so, um, prior to coming to data culture, they're like, "We'd love for you to stay and do this across our brands because Proctor and Gamble has plenty of brands that I that are probably sitting in your houses and

18:18apartments today, right? >> And it's only being sold at Walmart or Target, but they want to take that data and harness it, right?" And uh I know there was a joke about unlocking your data, but enterprises love those words of how do I unlock my data with a smaller set of data versus sitting on all this big data but not doing much

18:36with it. >> Yeah. No, I think thanks for sharing that uh use case there. I'm kind of also curious to learn a little about like the big question about the tool and technologies, right? Uh so are there any specific tools or approaches that you recommend for businesses aiming to adopt a minimalist you know data strategy or the culture? How do they compare to the

18:57more enterpriseheavy tools out there and just get there into the market? Josh. >> Oh man. Tools and technologies. I mean like let's see. Um I mean I'm I'm biased here. I mean I guess like I you know I I wrote uh dbt.db DB, stuff like that, which I think there's at least a few users of dbt ductb here. It's been nice

19:17to see some of y'all. Like, thank you so much for using my stuff. Um, DuckDB is great. Like I would have done things at Slack like so much differently had duct DB existed because it was kind of the situation where like the theme here and like kind of like what Jake has done in Octa. We didn't have like a big

19:33data problem so much as we had like a relatively large number of small data problems >> and just being able to like have the tooling and the architecture that would let us refactor all of this sort of stuff that like where big data was literally like the only option for working with lots of small data, >> right? that now we can just kind of like

19:50work with lots of small data directly and stuff like that. So I mean I think um yeah it sort of begins and ends with duct DB for me and then like anything else around that like whatever your own personal preference is is is fantastic but yeah I like I could not imagine life without it any more so than I can

20:06imagine life without claude or cursor or any of the other modern tools I use now for AI stuff.

20:11>> Sorry. >> All right, James. Yeah, I think like with this is like almost to his point like we have so much more options available to us now. We have interop between everything because we have you know separation of storage and compute you know iceberg >> iceberg's good >> you know all that other stuff and so you

20:30know how everything's much more pluggable than it used to be. You used to have to take a tool and that was the tool you used you brought that to bear.

20:40So if you had a problem that was untenable >> for you know with cheaper tools or whatever then that was the tool you ended up using for everything because you were you know locked into a choice of your overall uh stack and now because we have like interop across like the query execution layer and the storage format and all that other type of stuff

21:00like we are like allowing the option to compose our approach to different problems instead of like oh I'm bringing just bigquery to the table or just data bricks to the table or just snowflake to the table because they work for these problems that I have to solve but you know it's like the one two% of problems.

21:18>> Yeah, Jake. >> Yeah, I think uh my perspective here is not necessarily uh tool set specific but rather like uh

21:30number one uh knowing where you want to be and like knowing what the goal is and knowing the tools very well. uh and turning the levers to optimize in a direction. Uh do I want you know all of this at the expense of high cost then I can have that and I can go this route.

21:52But uh if I want to reduce cost or reduce latency or I can sacrifice some durability or you know on and on and on I think there's a lot of value in flipping those levers correctly. Um, and an easy example is like I have, you know, at work dozens and dozens of Snowflake instances and we have to serve

22:11like five metrics on a dashboard that everyone hits when they log in as admin.

22:16Uh, and creating a or keeping a extra small single cluster Snowflake warehouse up to service literally five numbers on a dashboard uh 247 is like a few hundred grand uh like not an option. So like uh

22:31I know that use case. I know the tools and you know you can kind of match uh you know mix and match with the understanding that like doing that too often uh or diverging too far uh is going to increase complexity over the long run. So, you know, uh I mean duct DB is cool.

22:49>> It's okay. >> There's there's all kind like iceberg is cool. Delta is cool.

22:54>> Yeah. So, >> it's it's more on the >> but knowing them. >> Yeah, knowing them kind of makes it more easier and it also kind of depends on the use case. I I would say >> yeah, >> Selena, >> I mean um I was going to try to be contrarian, but I guess I'll add additional tools from from some of the

23:12examples of clients that I've been working with now. Right. Please. >> Sorry, Josh. Did you have a contrarian thought?

23:18>> No, I'm actually like legitimately curious now. >> Oh, >> I mean, I'll be contrarian after, but like for the moment, I'm very curious.

23:24>> Love it. Um, well, the example is I'm currently working with a client who's got thousands of DBT models, right? So, if you're sitting in the room today and you're like, and you just walked into a job or you're currently sitting at a job where you're dealing with all these models and sitting there going, "What the hell is this table doing here? And

23:42why do we need this table?" and your DAG just looks like worse than Medusa, right? Like I don't even know if it's a DAG anymore.

23:50>> Um, we were brought in to untangle this web and to be honest with you, I looked at it and I go, "Oh, hell no. What did we just walk into?" Um, and then when I thought about, you know, data minimalism, small data, like how do I make this something smaller? uh we were working on updating their model to

24:09redefine, right, a new metric that they were introducing to the business. Guess what? It took 10 hours and it was still going. We were updating the model and it took 10 hours to test our change. And at that point, I'm like, well, my my client's going to be pissed for two reasons, right? The cost and how long

24:26it's taking, right? Because if this change doesn't work, we got to go back to the drawing board and do it over again. And so, um, I don't know if Ty is in the room, but shout out to like SQL Mesh and Tobico data. That was a tool that I thought to bring into the mix because I'm like, how do we bring SQL

24:42Mesh into the mix where SQL Mesh is free, right? And and I was telling my client, I'm like, hey, I know you're running on DBT cloud right now, but you can use SQL Mesh locally to test this change without having to run through, you know, to rerun all these models to check if it worked or not. And so that's

25:00how that's how we were, you know, that's how we're actually going to go test. We haven't done it yet, so maybe I'll have to report back, but that's how we were trying to make a big problem something small and sizable to work with.

25:13>> Thanks, Selena. Uh we'll follow you for the report back for sure. And uh kind of also curious to learn a little about you know the challenges. We've been kind of talking about fine small data could be the next big thing but what about the challenges of going minimal uh what are some common challenges that you guys

25:32have seen already and you all are seeing that oh businesses do face these challenges when they kind of try to scale down their operations because sometimes it's like very big data working now how do you get it down there are problems challenges what are those some challenges how do we overcome those challenges. Jake, >> yeah, I think the thing that I've seen

25:54consistently here uh is data team like once upon a time you would premputee stuff over here. You would like pipeline a Reddus database out to your application and it would serve metrics and it would be very cheap and simple and on and on. Uh and like there's you know application programming, you know, product engineering over here. Uh and then data

26:16teams which like don't really speak that >> uh you know but also conversely like you know tell these people that you can put a OLAP database into a process and it will just like eliminate pipelines and stuff uh is like laughable. So like uh there's a lot of like uh I don't know conceptual philosophical challenges >> uh that I'm like like when you can start

26:40connecting those dots uh you know it's both challenging but it's like you know laughable once it starts clicking because it happens so fast that like oh yeah like I don't need all this stuff or

26:54this stuff uh to service a similar outcome. So I think that's that's uh consistently my challenge like you know telling the data the architect of the data team that like I can do this in process on files in S3 is like no you can't like okay sure and then just going away for a while and being silent until

27:15like it's just spreading like wildfire. So it's it's very uh you know yeah >> I see Josh is agreeing there. So, he has a point. They know.

27:23>> No, it's I I I do and I don't. I mean, I do and I don't. I think I I think about like what are we going to miss from the big data era when it's gone is kind of my question. Like I like like whenever I'm in an organization and there's like a reorg, you know, you know, exe you

27:34know, executives love reorgs, right? Literally nothing makes them happier. It makes them feel useful. I'm always curious when you do a reorg like what am I giving up? Like what is the invisible thing I take for granted that I'm I'm I'm not going to have anymore once this is gone. Um, and I think I'm not sure, but at least like what I feel now, like

27:50doing AI stuff is like I very much feel when I lose one of my H100s.

27:55Like I feel it, right? In a way that I didn't really ever care in the like big data days. Like I didn't care about losing a node. I lost a node. Like who cares? The whole system is designed to be resilient to it. I don't even have to think about it. It's like not even a factor for me. And I worry like as we

28:11sort of shrink things down, the stuff that's left becomes much more important and like it failing like how do how do we kind of preserve the reliability and the sort of indifference to failure that we got used to I guess and that that's maybe my big question for the small data and again there's good people working on

28:26this kind of stuff and I'm I'm optimistic but it's it's my worry. It's what I it's what I worry we're going to miss.

28:32>> Yeah, >> I really miss uh uh batching bolts in Storm. Like I really >> Oh, I remember Storm. I remember Storm.

28:38Sure. Exactly. Well, Nathan's got the new thing. If you want to use his new thing, you can do that if you really if you just like you're tired of being happy. I don't do whatever you want.

28:46Anyway, >> James, what do you think? >> I think like like a lot of time I'm going to completely change from what they've been talking about, but like you know, one of the things that you can do is like your your engineering complexity versus your like um like cost of your

29:02system, right? So like you know if you're running like a big batch job because it was easy to write a big batch job like you engineering time is very low relatively speaking but like you know now you want to get it into like you know let's just say duct DB for whatever reason well maybe like the server that you can run it on is you

29:21know smaller than the actual size of the data set and so now you got to change this batch process into an incremental process. By changing that into an incremental process, you made it a more challenging engineering problem.

29:33>> By making it a more challenging engineering problem, you switch to where your complexity is living. Right? You had your distributed shuffle for you know your larger scale system, but now you have to like incremental model and like how to how to manage like the engineering complexity for the smaller scale system.

29:49>> Love it. Selena, >> I think I'll summarize it with two points. challenges of going minimal are very similar to remember when everyone was watching Marie Condo, right? You got a lot of clutter. So, you got a lot of data you're dealing with. So, that's data overload, right? That's equal to all the clutter that's living in your

30:08house today. I'm guilty. Um, so that's one of the challenges is how do you even get started on it? Uh, and then I think the second part is resource constraint, right? similar to the Marie Condo concept. They're calling upon Marie Condo because you yourself either don't have the time or don't have the capacity to go through that clutter. And I think

30:27in in our data world, I mean, if I'm as head of data, right, and if you're on a data team today or, you know, running a data product, you really want to, you know, do you really want to spend the resource on uh dealing with your, you know, dealing with decluttering, right?

30:46And if you're sitting on a data team, how do you prove that value to someone?

30:49It's like dealing with technical debt, right? You know, it's there and you hate it. And every time you got to go through your process, you're like, I got to deal with this. But at the same time, it's hard to prove the value of of like, why do I have to deal with this right now?

31:02>> That's right. >> I love I love that analogy and I'm going to use it to dunk on you now. That's awesome. Um, no, because everyone saw Marie Condo had kids, right? And now suddenly she's not so into minimalism anymore, right? Because we're all all of us are like people got kids, right? It's like you're going to be one of those

31:15parents who like we're not going to have too many toys. We're not going to have too many books. Do all the parents here have too many toys? I have too many toys. Does everyone have too many books?

31:24I have too many books. Right. Yeah. Thank you. Appreciate that. Thank you, Renee. >> Yeah.

31:29>> He's nodding along, but I'm like, okay, eventually you come back to let's put these in boxes and let's get this organized again.

31:36>> But I I do the same but I mean I feel like I do the same thing. Like I go to a new company and I'm like, okay, this time I'm going to do it differently.

31:43this time I'm gonna be minimalist, right? But then I get more and more like one of the jokes we had at Slack was that like you could basically predict our AWS costs perfectly based on how many engineers we hired was the single best predictor of how much money we would spend on AWS. The customers blah blah blah. How many engineers do you

31:59have? That's how much money you're going to spend on AWS, right? How many little side projects and ideas people have and they want to try and all that kind of things. How many times do you like incur some technical debt or let them go off and do some compute because it makes them happy and they won't bother you

32:14anymore and that kind of stuff. It's like this the same sort of truth in some ways.

32:18>> Yeah. I think the last Marie Condo thing is we should we should go hard on this.

32:22We should I mean like the whole famous line of like what sparks joy like does it spark joy right? I'm not saying that you should go to your CEO and say this data project doesn't spark joy and therefore we should toss it out.

32:33>> I think you should though. That's I mean I I don't know >> what I'm saying about spark joy is it's analogous to is it actually important to the business right or the organization or the product you're building.

32:43>> No one in the data world uses the word spark and joy in the same sentence.

32:46Seriously like literally no one. Like I I run spark jobs all day. It does not spark joy in me in any way. It's what I happen to be good at. It's my one useful skill. But man, it does not make me happy. I >> feel like you you just found everybody's joy right there. They're like, "Oh, finally somebody said this." It was sort

33:02of the subtext of the whole context. I'm just the guy who says things out loud because I can. Anyway, >> all right. This is good discussion.

33:09We'll keep the discussion going during the drinks as well after this. But one last question for all of you here and I'll just need like one sentence for it and that's about where do you see I know it's going to be not one sentence but then you all can try. Where do you see small data going in the next one to two

33:28years? I guess Josh, we'll start from you real quick. >> That's a great question. I mean, it would be amazing if it could dethrone the AI hype cycle, you know, like that would be super cool. I don't see exactly how that happens, though. Like, really, like, you know what I mean? I mean, hopefully I I I suspect small data will

33:46land somewhere between AI and crypto on the hype cycle. That's a fairly safe prediction. Yeah. I'll say, >> you say next 20 years, we're going to listen about small data. No, >> like what what do you mean? I don't understand. like small data will be spoken for the next 20 years if we get that hype.

34:01>> Well, I mean the joke at Google was like Google didn't call data, you know, Google had a word for big data. Data, right? Eventually, we will have a word for small data. Data >> is it like it's done. We don't think about it anymore. That's that's always my hope. Yeah.

34:15>> Okay. >> Yeah. The goalpost moves every year, right? So, we're just going to move the goalpost and then, you know, 10 years from now, it'll be wherever the goalpost is based on the hardware we got.

34:25>> Yeah. Awesome, Selena. >> I would say in one to two years, I think that small data is still going to stay in this niche space, right? Because I mean, think about how long it took for us to adopt the term big data. It wasn't until probably like Harvard Business Review made it a big thing. And then

34:41executives who don't know anything about data is now they're now saying, "I need a data scientist to deal with my big data." >> Okay. You know, maybe in five to 10 years. Um, who knows? But that's my take. Jake, >> I think it'll be uh very similar to SQLite uh literally everywhere.

35:01>> I hope so. That'd be awesome. >> That's awesome. >> Yeah, >> that's the panel for you guys. Thank you very much uh to all the panelists here for everyone attending this session.

More 2024 Talks

Big Is Not A Number: Dispelling The Myths Of Big Data

Big Is Not A Number: Dispelling The Myths Of Big Data

Jordan Tigani

Squeezing Maximum Roi Out Of Small Data

Squeezing Maximum Roi Out Of Small Data

Lindsay Murphy

Pysheets: The Spreadsheet Ui For Python

Pysheets: The Spreadsheet Ui For Python

Chris Laffra