2025Small Data SF

Projection Pushdown vs Predicate Pushdown: Rethinking Query Efficiency

We were told to scale compute. But what if the real problem was never about big data, but about bad data access? In this talk, we’ll unpack two powerful, often misunderstood techniques—projection pushdown and predicate pushdown—and why they matter more than ever in a world where we want lightweight, fast queries over large datasets. These optimizations aren’t just academic—they’re the difference between querying a terabyte in seconds vs. minutes. We’ll show how systems like Flink and DuckDB leverage these techniques, what limits them (hello, Protobuf), and how smart schema and storage design—especially in formats like Iceberg and Arrow can unlock dramatic speed gains. Along the way, we’ll highlight the importance of landing data in queryable formats, and why indexing and query engines matter just as much as compute. This talk is for anyone who wants to stop fully scanning their data lakes just to read one field.
Speaker
Adi Polak
Adi Polak

Author of O'Reilly Scaling Machine Learning with Spark

Confluent

Adi is an experienced Software Engineer and people manager. She has worked with data, AI and machine learning for operations and analytics for over a decade. As a data practitioner, she developed algorithms to solve real-world problems using machine learning techniques while leveraging expertise in distributed large-scale systems to build machine learning and data streaming pipelines. As a manager, Adi builds high-performance teams focused on trust, excellence, and ownership.Adi has taught thousands of students how to scale machine learning systems and is the author of the successful book Scaling Machine Learning with Spark and High Performance Spark 2nd Edition.

0:11So, just so I know where I am. Is that a big data conference? >> No, that's a small data conference.

0:20Okay, I'm in the right place. Uh, super excited to be here today. Has anyone ever heard about projection push down or predicate push down before?

0:29all of you and if you're new to that don't worry we'll cover all the things it's going to be you know very high level there's a lot of things that goes into query optimization and thinking about storage and compute and so on but the reality is we're going to keep it like the level where each one of you if

0:43you decide you can go and dive deeper into these topics and learn about them later on as well so one of the things that I'm doing a lot of profiling for queries also on my team and I'll get to it in a second and some of the things discovered when you do profiling of queries is that actually

1:02data movement kind of like dominates the execution cost and what does that mean? What is data movement when we're running a query? So just to give you some statistics 5 55% of uh the cost of a query in terms of timing comes from that aspect when I need to move data from in between machines when I need to copy some of the

1:26data and so on it becomes very very expensive. So if in the past we were told that we need to scale compute now we know that sometimes the bottleneck is not exactly the compute but sometime something in between the storage of you know the storage that we have and how is our query engine decides what to read and

1:45where to read and so throughout the years if you've been in the big data area you know that we build clusters we scale pipelines even recently we started throwing GPUs on them as anyone Everyone heard about things uh like CUDA perhaps a little bit here and there. So we started running GPUs in them and we moved from a system

2:07that it's only CPU bound to how about we'll combine CPU and GPU in order to run our analytics. Um but the reality of

2:17the things it was never about scaling horizontally. was more about scaling up when we need to scale up the machines that we need and thinking about how we don't scan the entire data lake just to read one column.

2:32So what if the problem was never big data? What if was just bad data access on our behalf? Uh so I'm a deep poll. I'm very excited to be here. I worked in the data analytics for more than 15 years now. I started my career in the machine learning space a little bit what we call the winter of machine learning previous

2:52wave um and then I moved all the way to big data and today I'm with confluent and I'm doing uh data streaming which is also a branch of distributed systems um and I moved from being an IC after 10 years to being a pickle manager I wrote two books for O'Reilly uh two of them talks about spark uh some of them are in

3:12the swag room so you can grab them if you'd like uh and this is my LinkedIn in in the cure code if you're interested in connecting.

3:20All right. In order to understand how we can optimize things like and think through about storage and think about push down different types of pushdowns, we need to understand the journey of a query. So this is a very simplified journey of a query of course but the reality is they're all kind of running the same thing with different

3:37optimizations that goes into them. First thing we need to write a SQL or some code that's going to create the query.

3:44Second thing that we have, we have a parser. So we're going to take this query, we're going to parse it and we're going to try and understand what are the different operations that we need to run. And so after the parser, there's going to be a planner. The planner is going to take all that information of what the different operations that it

3:59needs to run. And then it's going to check statistics, what statistics we have on the data, what optimizations we can do there. um other related aspects like um mean max of the values of the do the values really exist there's some rule engine that goes into it can we prune some of these operators that we actually have to read all of that or not

4:22so the planner here becomes critical for us and then we're going to move into the storage layer so everything the planner planned for before moves into an actual execution plan with the storage layer and the storage layer is going to compute what is necessary and pass it on to the compute engine. So you see a lot of the work before we actually start to

4:44compute and the need of the compute itself actually happens on one machine because the query the parser the planner all of them can run on one simple machine and of course there is a correlation and a combination here a partnership between the storage layer and the compute engine at the end.

5:02the IO part can become really expensive when we're reading from multiple uh when when multiple machines needs to read the data and that involves serialization and dialization of the data over our networking and that often what dominates latency every time we need to read the data from more and more machines it creates more latency for us it becomes a

5:24bottleneck something to think about it when we are thinking about plannering and query executions and how does it all connect to push down predicates um projection push down excuse me. So imagine you have like 500

5:40page book kind of like a data set that we have but you only need the pages with the recipes in it. So out of that 500 maybe you need 50 pages. Um so this is exactly what projection push down means.

5:53It's kind of like you can imagine the librarian or chat GPT kind of hands you just the exact pages that you need and not necessarily the entire book. So if

6:05we'll simplify it and say it in a very easy words, query engine only reads the columns you actually need.

6:16that is going to reduce I/IO because if you can imagine we already backing it up with uh we're blocking it up with uh where we have the planner. So before we're reaching the storage the planner already knows that it doesn't need to read all of these files. It has statistics. It knows where the file lives. It get the metadata and it knows

6:34it doesn't need to read all these files. So already in here when we have this projection push down and we decide on reading only specific columns this is where we're getting the first optimization. Now storage has to comply.

6:45I have to have a column there. Storage is going to work together with my projection push down. And of course I have to have the data about what exists there. It couldn't be just perk files on S3. I have to have another layer that's going to give me that information so I can withdraw that.

7:03This essentially makes it everything faster and cheaper. One thing I will say, please don't do select star.

7:13If you do it, please stop. Easy to say, hard to convince the rest of the company. I know. Um, but yeah, if you just said like start the end of the day, it needs to bring everything. So, it's it's going to be a full scan either way. Um, this approach, if we optimize, it's going to mean less memory, less CPU

7:33used, which is what we want, faster query, and also lower cost because every machine we're spinning, compute at the end of the day is more expensive than storage. Nothing we can do about it.

7:42storage is relatively cheap. Um, and so if we can reduce compute, we can reduce costs. Now the question is what if what if it fails? What if I have kind of like a nested row that within the nested row I have things like JSON?

7:58With JSON, we cannot skip columns. So we need to make sure our data is structured in a way that actually complies with what we want to achieve in the queries.

8:07So if you can avoid storing JSONs as much as possible, I recommend doing it.

8:12Coming from a data streaming world, I know there's a lot of format that don't always translate well between data streaming and analytics, but we'll talk about it a little bit later. Uh what I'll show you kind of like a diagram of more technology nested together. Another thing that doesn't translate well is UDFs, userdefined functions. These are functions that we decide and define as

8:32part of our SQL query. these doesn't translate well because most query engines don't know how to handle them.

8:38So that means it's a black box for the query engine. It's really hard to run any rules on top of them. Some of the rules that exist today, it's more statistical based. So if I'm running if uh my production runs the same query or the same job every hour, every 10 hours, every day, then I'll get some statistics

8:55about it and then my machines in my uh warehouse can optimize it a little bit, but still I won't necessarily get all the benefits that I would have had if I was using um a function that the engine is familiar with.

9:10So something to think about. Now, how is that different from predicate push down? Predicate push down means essentially uh filtering the data as closest to the source as possible on the row level which means we're pushing kind of like where is the wear condition that we usually live there. You can see select star don't do that but the wear for

9:30country is going to help us uh filter

9:34some of these rows because it know it only wants a specific rows with the country is United States and it's going to filter it later. So why it matters? It's still reading less rows, but here we shift it from reading less columns to reading in less rows. Um less IO and compute cost just because we're doing the same thing. And it's

9:55especially powerful if our data is partitioned or indexed. Again, we still need to have a data index. We need to think about our storage and how we're doing that. Where is it going to fail?

10:05Again, JSONs, complex expressions or UDFs, and connectors that doesn't support filtering. Now, this is a big one. There's a lot of connectors today in the industry. I see a lot of connectors from Kafka to iceberg, from a bunch of technologies to a bunch of technologies. Many of these technologies don't necessarily support these filtering. So, make sure you do when you

10:28decide which connector to use, think about what you need and think if you need this filtering or not because it's going to be a game changer to your performance.

10:38Let's summarize a little bit. So we have projection push down which means we need to skip columns we don't need and then we have predicate push down which means we need to skip row that you don't need.

10:50There's a little bit of a diagram when I have the query I have the different filter condition or related columns whereas the predicate push down comes into play. So the row group stats is going to filter it by the rows that I need to read versus the projection push down that after I filter the rows the specific rows that I need or if my

11:07storage is columnar after I filter the columns that I need it's only going to take the rows that I need in there again depending how my uh storage is organized and this diagram is perk because most of our analytics we live in perk world or columner space but these two tackle different things that altogether give us the performance

11:26that we Not there's a lot of solutions that does it well. There's an full ecosystem. I just brought a couple of them. I brought one.

11:34It's kind of like considered state-of-the-art. The stickle server old um maybe some people are still using it but very old works well know how to do all these optimizations. One of the most interesting query engines in the world with all the different capabilities. Uh as well as Oracle. Oracle is extremely known for publishing lots of great articles in that space. If you are a

11:57database nerd, you like the internals, they have a lot a lot of things and also other solutions that we know from the open source space like Apache Fling that mostly known for data streaming, but there's also a batch world, Apache Spark that was mentioned before and of course uh duct DB that also give us these uh capability. Each one of them is slightly

12:17different in implementation. Some of them has catalog stats that it knows to collect. Some of them has file level pruning. again need to know what we're doing and kind of have an understanding of where we want to head with that.

12:32Now the real bottleneck and I kind of like sprinkle it through the talk today uh but roughly we can think about four different uh blocks of uh bottlenecks that can happen in there in between my query planner and my storage one I don't have any statistics um I have a query I but I just I didn't collect any

12:50statistics about the data I don't have a job that runs once an hour once a day that kind of makes sure I have everything up to date and that means that I don't have it, I'm not going to expose it. And the query engine essentially has no clue where your data lives. So it has to go and scan

13:10everything. So it kind of like it's a it's a you know safe fail failsafe mechanism. It's going to scan everything and hoping to find a good match.

13:20Now it could be also uh that the data is outdated. We're doing it running it once a week but actually our data is being updated every five minutes. And so again, data is missing and outdated.

13:31It's a problem. One of the issues I had, I think was maybe 12 years ago. Um, we had a nightly job that updates the metadata and there was just a change in time. So the the hour moved, what we had recently, and I had a job that ran at 1:00 a.m. So at midnight potentially, uh, the metadata job ran and update

13:55everything. And at 1:00 a.m. I had a job that was supposed to run, but because of the hour changes, we missed that metadata update. So the job that ran ran on an old metadata from the previous days and produced the wrong results.

14:11Now I dug into the profiler, I dug into the query engine. Everything looks legit. I dug into the metadata. You know, the metadata looks legit because again, I saw it afterwards because I had to debug what happened with that query.

14:23And only then I spoke with a couple of folks like hey we just moved the clock and there is an hour difference in between this job. So that means that the query that I ran ran on uh un um up

14:36to-ate missing outdated uh metadata. It was a big one. It was very hard to find and it's again it's not going to show up in uh just every day but think about it that you need to update the data.

14:48Um encoded columns that don't have schema can also be a little bit of a problem. You need to read and parse every record. Uh lack of mean max data.

14:58When our query engine decides and starts to think through the files that it has in it, uh it goes into like if I have a filter on a specific uh topic or a specific uh value, let's say an integer, it needs to have some minmax information in there so it will know how to prune.

15:16For example, if I have values of 15, 0 to 15, and I'm asking for a query, you know, in my query, the filter is greater than 20, then you know, it's unrelevant for me. I'll just a good query planner will say don't run it because it's you're just going to, you know, do a full scan for nothing. So, that's a

15:34really important part with the the minmax data as well. Um the last part that I missing in there was actually uh another part related to the storage in the column there but you'll have the slide so you can able to to see it.

15:50So let's take it full flow from ingesting to querying. Let's say I have a solution like Apache Kafka and I believe everyone here heard about Apache Kafka before like existed for a while now. Um so if I'm ingesting data I have an app I'm collecting data through Kafka or I'm moving data in between different uh data warehouses or data solutions

16:09that I have where I capture my data and then I'm doing some operational analytics uh with Flink I can decide to create some indexes with Flink as well just before I'm writing that data into an iceberg tables in between that part I also need to update the metadata. So there's a metadata and um iceberg supports has a

16:33support for kind of like an iceberg catalog. Iceberg catalog only give us uh what to implement. So it gives us the interface. We still need to implement it and use catalog. It supports uh iceberg things like blue for example uh that you can use. You'll need to configure it but you can use it uh and other data

16:51warehouses that give us the implementation of these catalog. But again, this is something that we need because the catalog is going to give us the statistics. It's going to help the query make the decisions as we query.

17:03Now, there's another layer in there and there's a good reason for it. This is arrow. Now, a lot of our solutions today need where we have the storage, we need to have multiple query engines, different types of query engines reading that data. And if I have all this data already lives in arrow, I am able to read from multiple engines and exchange

17:24that data that complies to uh the arrow requirements and that's going to cost me that's going to save me a lot of cost as well on data exchanges. Um many solutions already use it internally. Uh in the big data world back in the days it was a gamecher but it's still here and it's here to stay and we see it more

17:43and more with new data format. if you've been exploring a little bit of data format in the machine learning space, things like lens and other and others.

17:52Um, arrow is a big deal. So, we want to support that. We want to continue having that and we want to make sure it's part of what we do. And lastly, when I have my storage all shiny and nice with all the metadata, I can use solutions like DUTTP relatively, you know, costefficient for me to query that data

18:09and I can give it to anyone in the company to now query the data where they want because I can run it in my web browser. It's super efficient, super easy to use, of course, as long as I have everything else uh sorted out.

18:25So, let's kind of put things together. If we want to design for queryable data, we need to think about the format. How are we saving that data? We need to make sure we have a schema in there, right?

18:35Perk, arrow, iceberg, whatever you choose, you know, make sure you have a schema and you able to do the push downs as necessary. Keep metadata fresh.

18:47Don't be me or you know, I was I was fine, but you know, the clock change, it happened. Um, make sure the data data metadata is fresh. You have the minmax values, you have the file stats, you have the schema versions, everything that the engine knows uh and necessary for it to decide what to skip because every time we're skipping reading from

19:07the from uh storage, we're saving money because we're not bringing it into the compute and we're saving on IO uh bottleneck as well.

19:15Now indexing still matters even in a world where we have a lot of optimizations under the hood and kind of like the query profile or you know collect a lot of statistics about our queries and how to optimize it. It still needs a little bit of a push with indexing. It needs that support of you know here's how we're going to index the

19:34data. This is my primary secondary and so on. So push downs kind of like help us reduce what needs to be reduced through these indexes. Um, essentially indexes speed everything that we do and there's a lot of types of indexes and there's a lot of types of like gains uh from from that space as well. Uh things

19:57like partition evolutions uh thinking about how to split it, zordering, clustering and so on. Um different algorithms that exist based on based on your needs and many of them already implemented in table formats that you might you know use today. Uh so just make sure to use them um as well.

20:17Hopefully I was able to convince you that you know this is not a big data conference.

20:23Uh this is in order for us to improve our queries and really leverage the technologies that we have today. We need to think about good data access and what our query needs in order to be efficient. Um it never ends there. Uh we can make it better, faster, more efficient. um and definitely win the day. I can share you know a little bit

20:43about Confluent. In Confluent we have uh extremely low latency. I don't know if you have heard about our like single digit uh millisecond uh for processing a lot of Java events and some of the reason for that is because we have uh we implemented uh rack awareness solution.

21:01So even if we don't want to exchange data but if we do we know exactly which rack to send it to which is the closest in the data in the data in a sorry in our data warehouse.

21:14So in this game of of course after a lot of research and a lot of learnings how to build a cloud that is efficient um and so on. So again think about good data access and where is the location that your data is stored. Thank you so much.

More 2025 Talks
View all