When not to use Spark?
Speaker

Holden Karau
Apache Spark PMC and Author of O'Reilly books on Apache Spark and Machine Learning
Holden is a transgender Canadian open source developer with a focus on Apache Spark, and related "big data" tools. By day (and night, go go startup life) she works on brining large language models and other AI tools to help healthcare users deal with insurance through https://www.fighthealthinsurance.com & https://www.fightpaperwork.com.She is the co-author of Learning Spark, High Performance Spark, and a few others. She is a committer and PMC on Apache Spark. She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal.
0:00[music]
0:05[music]
0:11So I am going to talk about when not to use spark right spark is not suitable for everything. Um, it is super suitable for some things, right? Like, uh, I love Spark, but a lot of time, a lot of the frustration in my life comes from people misapplying Spark. So, I'm going to do the thing which you wouldn't normally
0:32expect, which is as someone who works on Spark, try and convince you not to use it.
0:38Um, and I want to be clear like I am pretty biased, right? So, I think Spark is going to apply in a lot of situations in part because I would like you to buy several copies of my books. Um, right.
0:50I've worked on it for about a decade. I have tattoos from my from my books about
0:56Spark and like that that really sort of gives you an idea of my bias here. Um, another thing that I think sort of matters is I've tended to work at very large companies with fight health insurance sort of being an exception. Uh although we're a small company, we still have really massive data. It turns out there are a huge number of health
1:15insurance denials out there. Um arguably too big to fit on a single machine. So you know that that is a pretty good application of Spark. But you know uh not everyone is going to be working at that same scale and it's completely valid, right? If your analytics happen to fit in memory, this is fantastic, right? The other thing I tend to write
1:35Python and Scala code. I love those languages. If you don't like working in Python or Scala or Java, right, like that's that's something to keep in mind.
1:45Spark is really good in Python and Scala and we do have new APIs through Spark Connect for other languages, but it's not as good in experience, right? Um the other thing is like you know work on things that are pretty close to to big data. Um, so this talk is about why not to use Spark, but we are going to start
2:05with a little bit of the why. Um, and here's a cloud that you cannot run Spark on. I did try. Um, turns out at Electric Daisy Carnival, they do not like it when you try and deploy uh software to their lights. Um, okay, cool. So, you know,
2:23the first reason to use Spark is when your data is just too big to fit in memory, right? And what does too big to fit in memory mean, right? Like one option is I can't open it in Excel. I would say probably not yet big data but a good starting point. If you can open it in Excel probably, you know,
2:41don't use spark some exceptions. If you can open it in pandas, that's also probably going to be a lot easier, right? Similarly, if you can work in mother duck, it's going to be a lot easier when it's on a single machine.
2:52But if it's not going to fit on a single machine, this is probably a good sign that it might be time to use Spark.
2:59The other thing here is right, things fit more, right? Uh it's it's like our house. Uh we lived in a very tiny place and we had a very tiny dog. Um however, the dog started to grow and so we moved to a bigger place. Uh and unlike San Francisco, it's actually cheaper to get bigger computers now. Um, it turns out
3:19it is more expensive to rent bigger apartments unfortunately, but that's that's okay, right? Like you can get a bigger house for your dog. You can get a bigger computer for your analytics and you don't have to spend all of your money on it, unlike us with with our dog because living in San Francisco is just just a little expensive. Um the other
3:39one, the other good reason to use Spark even when it does fit on a single computer is if your analytics are too slow or if you're doing something more complex than analytics, right? Um for example, if you're calling a bunch of LLMs, if you're trying to do a bunch of like model fine-tuning, even if your data fits on a single machine, trying to
3:58run it all on a single machine might not be good, right? And so what we can do is we can sort of take our dog and we can put our dog on something faster. in this case, Spark, right? Um or my wife uh carrying the dog in a little backpack.
4:12He surprisingly does not hate this. Um yeah. So the thing though is there are more ways to go fast, right? Like Spark is one of the many ways you can take things and make them go fast. And it's not ideal for everything, right? Just like my Tiger 900, I love it. It's fantastic, but I would not ride it to
4:32Seattle right now, right? um it would be very uncomfortable. Similarly, you know, Spark is not going to do well for all problems. Um but yeah, we can get a 55 CPU server on eBay
4:47for like not all that much money, right? And and back when we started with Spark, that would have cost all of the money I had and and then some, right? Now, we can actually get 55 CPUs and it's like, well, maybe single machine parallelism is the right way to do this. Uh, it's still work though, right? It's not free.
5:06Having 55 CPUs still means you have to write your code in a way that's a little different than if you were just doing linear processing, but it's still probably going to be less work than switching to Spark entirely. Oh, sorry.
5:17I said 55 CPUs. I meant 56. My bad. I forgot. Zero indexing. We're not living in Julia. It's fantastic. Okay. The other reason that some people choose Spark is they want to use machine learning. Yay. Very fancy. Uh, we will get money from venture capitalists. If anyone knows how to do that, please uh send me a message on LinkedIn. It hasn't
5:37worked out super well for my plan so far, but one of these days I'll figure out how to uh tell them that we're doing generative AI and receive large sums of money. Um, but the thing is, right, Spark Spark does ML. It doesn't do generative AI and ML is just not cool anymore. Um, but more more seriously,
5:54right, like Spark does have some machine learning libraries. In practice though, the machine learning libraries it has haven't really been the winner. Um, what's happened is we have tools like TensorFlow and we have all kinds of other wonderful machine learning tools that you tend to use with Spark or on the output of Spark rather than using Spark's built-in machine learning tools.
6:16Um, so you know this is cool and you should definitely buy Audi's book um from this morning. She has a book about using Spark for ML. So corporate credit card people, please go and buy several copies. But you know, otherwise maybe you you don't have to use Spark for your machine learning.
6:31Okay, now let's talk about all the things that are terrible. Yay.
6:37No one. Okay, some excitement. Okay, so the why what are all the things that are terrible? Right. Um this is this is sort of a prelude. We've got a very sad Ducati. Very sad. That's gasoline.
6:50Bad news bears. We've got testing. at least that came back. Well, um, and then we've got a bunch of RAM. Okay, fantastic.
6:57So, how many people here like writing Python code? Okay, keep your hand up if you like reading Java stack traces from your Python code.
7:07Okay, yeah, this this one is a bit of a lowblow, right? But like Spark is this fantastic thing, but when anything goes wrong, you get a Java stack trace. And the problem is if you're writing Python code, reading a Java stack trace might not be your core competency and it probably shouldn't be, right? So Spark's debugability is at the very best kind of
7:30painful because of some architectural decisions we've made. Um, and that doesn't even get into the part where we have an error on a different computer, right? So the the the saying about distributed systems are distributed systems mean that the failure of a computer that I didn't realize existed results in the failure of my code, right? And so now you'll get
7:51these things and these errors might not even be related to anything that your software is doing. It could just be that US East1 decided that it was going to go to the spa for the day, right? Um and then you just get a bunch of Java stack traces from your Python code and you know you check Twitter to see what's
8:06happening with US East one. Uh, sorry. Um, you check don't go don't go to Twitter anymore. Uh, blue sky, you know, insert favorite less far right social networking site here. Okay. So, on the flip side though, debugging is actually getting easier. Um, and this these these pictures do make some sense uh because there's a tool called Spark Lens. Um,
8:26and here we have lenses. Uh, it's a bit of a stretch. Uh but right it's fantastic and we actually are getting better at making Spark debugable but it's still not 100%. And inherently it's always going to be worse than a single machine right because we always have another machine somewhere can cause failures. So it's just going to be more
8:48difficult. Um there's a fantastic profiling PR from uh one of my former colleagues and it's amazing. It's PR number 52679.
8:57If anyone's, you know, on GitHub right now just being like, "Let me go take a look." Um, it's fantastic. But the thing is, right, and and I think this really illustrates it. It's taking existing single machine tools and making it so we can apply in a distributed fashion. And we're always going to be a little bit behind the state-of-the-art because the
9:15state-of-the-art debugging tools are built for single machines. And that's just going to be sort of inherent in this problem.
9:22How many people here use Spark and have never had an out of memory exception?
9:28Okay, that is zero people. Um, right. So, this is this is also once again kind of a low blow. Um, out of memory exceptions can happen in single machine cases, but it's a little bit harder to figure out what's going on when you've got a whole bunch of different containers and one of them just happens to have an out of memory exception. Um,
9:48we can buy more memory. Here we have me buying more memory so that I can try and do some data processing. Um, this actually went into a 4U server that I'm currently decommissioning. If anyone happens to be interested in value servers, give me a call. Um, but more seriously, right, like out of memory exceptions are kind of rough in Spark.
10:06Um, the other one is too large. And that that's fundamentally the same problem that I had with this Ducatti multistrada. It turns out that while it's amazing and looks incredibly cute, uh I love the red color, um I am not that tall and uh riding a motorcycle in heels is a bad idea. So um pretty frequently I would drop my Ducati
10:27Multistraa and Spark will pretty frequently drop your large records on the floor because it's designed for record level parallelism. It's not designed to handle really really big records. It's designed to handle tons of tiny small records. Um, part of me wants to say that a one gig record size ought to be enough for anyone, but those statements tend to age pretty poorly.
10:49Uh, like 300 something kilobytes should be enough for anyone. Those statements just don't last, right? Um, so this is a good reason to not use Spark testing.
11:01Okay, so we should all probably test our data code more and realistically even the single machine data code that I see is probably not tested as much as it should be, right? Like how many people here think they have really good test coverage of their data pipelines. Okay, we got two and one of them works at a
11:19company making testing software at least. The second one might as well. Do you work at a company making testing software? No. Okay, so we've got one person who their day job is not testing, but they have good test coverage, right?
11:34And so this is this is the thing, right? Spark makes it difficult to test your code, and it's something that we all know we should do, but imagine that you had to like fight a plushy stuffed animal to go and brush your teeth at the end of the night. You're probably not brushing your teeth, right? We all know
11:52we should do this, but the more barriers we put in the way, the less likely you are to actually test.
11:59Ops burden. Who loves waking up at four o'clock in the morning? This used to be me, but then I got ADHD medication, which I did not take today.
12:09Um, but right, the thing is, Spark gives you many exciting opportunities to be woken up in the middle of the night by US East one. Yay. Um, and this is
12:22actually my my US mobility pager. For a brief period in time, I did actually have a Motorola pager back when I worked at Amazon. The [snorts] sound of a Motorola pager haunts me to this day.
12:32Um, I I believe in Seattle, if you play that sound, uh, you can probably find a lot of people just like not having a good day. It is it is rough, right? So, a lot of employers, a lot of people who have paid me money in the past will solve this for you if you give them money, right? Like if you give data
12:50bricks a bunch of money, yeah, they'll they'll keep Spark running for you. It's great. On the flip side, do you want to pay someone to deal with this? Um, alternatively, if you're at a startup, uh, there's a really good chance that they're going to decide to not pay someone to deal with this and give you exciting opportunities for
13:08career growth and advancement through a pager. Um, which is also not real, but you know, it's it's great. You can pretend. You can pretend. Yay. Okay. Uh back to back to other things that don't work. Okay. So, here we go. We've got me listening to uh let's say some music.
13:25Yes. And then over here we have a bunch of people listening to some music. Right now, Spark, you know, it's not like a listening party where one person at a time listens to it. We we pass the headphones down. Um it's just it's much more like a concert where we shove everyone we can in, hope the fire
13:41marshall doesn't notice, and play a bunch of music really quickly. Um, this is cool, but it means that if we needed to do these things in order, right? If it was important that, you know, the person with the red hair heard the music after the person with the backwards hat, we can't really make those guarantees, right? And that's kind of important for
14:00a lot of things, right? uh if our bank were to just randomly order transactions. [clears throat] Wells Fargo. Actually, I guess that's not so much randomly ordered as sorted in the least convenient way possible. But right, like imagine imagine that like things were just randomly ordered in the financial network. This might be what we describe as bad um and could could be
14:20very unpleasant. The other thing is the real world the real world gets in the way of so many cool things, right? And this is why I largely spend my time uh with my wife, who's amazing, I love her, at places that are not the real world, like Disneyland, Casabonita, uh in Denver.
14:39Fantastic. They're currently on strike, though, so do not go to Casabonita right now. Um they deserve to be paid more.
14:46And uh Beyond Wonderland down in Southern California, right? Okay. So, what what does this have to do with with data processing? What what it does is
14:57when we try and do things in the real world, we can't just run them multiple times. Similarly, like let's go back to our bank, right? For example, if we were doing some kind of side effect inside of a for each, right? We would be kind of upset if they just withdrew money from our bank account twice because the
15:14partition happened to fail and it had to be re-executed, right? Um, this is a
15:20little weird, but it it's because Spark's approach to resiliency is to say, "Hey, things don't fail that often.
15:28I'll just retry it whenever it fails." And that works really, really well when things don't fail all that often and retrying it is item potent or does nothing, right? So, doing it twice is totally fine. Whereas, for example, if we fly down to Disneyland twice, my bank account will notice that I have flown down to Disneyland twice and I will have
15:50someh [sighs] not exactly unpleasant conversations, but I'll have to like cut back on something else like, you know, sparkly dresses or something and that's just terrible, right? We we we have to we have to do things in order. Um, sorry, side effects matter. Okay, so why does this matter, right? In general, what what do we need to come out of this,
16:09right? So, Spark is not the one true ring. It is not going to solve all of our problems. And anyone who tells you whatever they have is the one true solution to all of your data problems is probably trying to sell you something.
16:21And they're probably not doing a very good job of selling it to you either.
16:24Right? Spark is really, really awesome, but it is not for everything. If we try and use Spark to process financial payments, this is probably going to go pretty poorly. If we try and use Spark to handle records that are too large, it's going to fall over. If we try and use Spark without an ops team, I'm gonna
16:43get woken up at three o'clock in the morning. And all of these things will be terrible. But the important thing here is that one machine parallelism can get you pretty far, right? We can do a lot of really cool things with the 56 CPU machine. And you can buy those or rent those for incredibly affordable prices.
17:02And in fact, I am currently trying to offload one 56 CPU machine. So, if anyone is looking for a 2U 56 CPU machine as well, give me a call. Um, scaling before you need to has a real cost, right? I see a lot of people use Spark because they're like, I'm going to have really big data, right? Because
17:22it's the it's the American fallacy, right? We are all temporarily poor millionaires who will very soon be millionaires, of course, and you know, uh, stock market go up, number go up, great success. But just because we might have really big data in the future does not mean that we need to use really huge scaling tools right away. Um the most important
17:46message that I want to leave you with today is that you should buy several copies of my books. Right? If you go to amazon.com or anywhere that fine books are sold, uh you can buy several copies today with your corporate credit card.
18:01Okay. Um, I looks like I am actually early. So, I have one minute and 48 seconds left. Uh, but we're gonna go to this last slide and I am probably going to disappear unless anyone has a question for my remaining 1 minute and 35 seconds. Does anyone have a like one minute long question?
18:22No. Okay. Thank you all so much for coming right after lunch. If anyone wants to find me to ask me questions in private, maybe you know you've got a Spark deployment and you've got an out of memory exception and you want someone to explain it to you. Uh definitely don't look for me. Uh but if you have a
18:39more fun problem or for example, buckets of money, uh you can come find me in this sparkly dinosaur dress. I will be out in the hallway drinking coffee if it's still available or if it's not, I will be out in the hallway looking for coffee wishing it was still available.
18:55So thank you all. Thank you all. [music]


