Know Thy Customer: Why Tpc Is Not Enough | Small Data SF 2024

2024Small Data SF

Know Thy Customer: Why Tpc Is Not Enough

Data warehouse benchmarks, just like database systems, need to become more holistic and stop focusing solely on query engine performance in favor of customer-centric indicators of usability and performance. Database research and development is heavily influenced by benchmarks, such as the industry-standard TPC-H and TPC-DS for analytical systems. However, these twenty-year-old benchmarks neither capture how databases are deployed nor what workloads modern cloud data warehouse systems face today. This talk will summarize well-known, confirm suspected, and unearth novel discrepancies between TPC-H/DS and actual work-loads using empirical data and telemetrics from Amazon Redshift.

Speaker

Gaurav Saxena

Gaurav Saxena

Principal Engineer

Gaurav Saxena is a Principal Engineer at Amazon Web Services currently working on Amazon Redshift, Amazon's fully managed, petabyte-scale data warehouse service. Gaurav is leading Redshift's move towards a touch-free autonomous system and has architected several features in this space, like automatic workload management, intelligent scaling, and automatic materialized views. Gaurav has published many research papers and patents in the field of ML for systems. Gaurav received his masters in computer science from the University of California, San Diego, and his bachelors from the Indian Institute of Technology, Madras, India.

0:00[Music]

0:15Today we'll talk about why our benchmarks, the favorite ones, TPC is just not enough for our customers.

0:25I will talk about the insights from this paper we published at VLDB 2024 why TPC is not enough and the accompanying data set to it called redset.

0:37Redset contains query patterns workload patterns data patterns from Amazon Redshift customers. So put all this then in context I will go back a little bit and talk about evolution of business analytics use cases over time. I will motivate that a new breed of use cases are coming up which challenge databases of today and these use cases are not adequately

1:08represented by TPC benchmarks. I'll present a case study derived from reds set and show that query part patterns and the data patterns we have observed in the reds set do provide this insight so let's go a little bit back in 1980s the only way for a business analyst to come up with business insights was to walk up to a DBA explain

1:36the business problem to them the DBA would write a SQL query submit to the database, publish reports on paper like charts and tables.

1:48In 1990s, this began to change.

1:52Business analyst uh got access to personal computers in office and first business intelligence software started to show up like Cognos reports, business objects. For the first time, business analysts could drag and drop tables and columns on on their PC and the SQL would

2:13get generated. They would run these SQL against the database and the charts and tables would show up on the screen and they would do iteratively.

2:24TVAs in this era uh became like data engineers of today. They would curate and create these data sets for business analysts. To capture this workflow, TPC created the first benchmark uh for decision support systems, TPCD, back in 994. It was a very simple star schema model. It had 17 ad hoc queries on the on the on the

2:52tables and two refresh functions to update the table. This was 1990s was an era of big data.

3:01The data was increasing fast. So the refresh from the first refresh of this this benchmark was TPC which was released in 999.

3:13Here the data set increased from 3 terabyte in TPCD to almost up to 200 terabyte for TPC. the number of ad hoc queries increased to 23 and they became a lot more complex.

3:28The in in 20 in 2000s the complexity of

3:33data modeling in kept increasing as business intelligence went on to newer and newer departments newer and newer industries. In the next refresh in in 2012 called TPCDS,

3:48TPC created a much more complex data model based on snowflake schema and it had multiple sources of data which the queries can draw from. The number of ad hoc queries was increased to 99.

4:03The queries became even more complex but overall the flow of information remained the same.

4:12until in late 2010s when business an business intelligence software started to get delivered online. At this time the business intelligence business intelligence uh software started reaching what we call line of business users like support sales

4:33uh operations. Some some some businesses even put the analytics on user data within uh within their customerf facing uh websites. For example, I wanted to cancel my AX card. I called the support and the support person told me I had spent so many thousands of dollars over last few months, racked up so many hundred thousands of points.

5:00Why don't I keep it longer? Now this person has access to my data as soon as I call them.

5:07Similarly, when I go and on my uh

5:11T-Mobile web app, I see a historical analysis of the bills I have paid, the uh monthly uh analysis of how much internet I have used and this gets refreshed every few minutes.

5:26So the third uh part of the new workloads which are coming up includes continuous injection and continuous transformations.

5:36Now let's flip the lens from which we are looking at it. Let's look at it from a database point of view. What does a database see from uh when when these workload arrive at a database?

5:49Here I show you the query distribution

5:53by runtime for red fleet. On the x-axis

5:57is the bucket by runtime 0 to 10 millisecond, 10 millisecond to 100 millcond and so on. On the y-axis is the percentage of queries which fall into each of these buckets.

6:10The green line shows the red shift fleet data and blue line shows TPCH and the uh the yellow line shows TPCDS. As you would see most of the queries fall

6:26below 1 minute in fact 99% of the queries are which redu sees are less than 60 seconds. But there is a long tale of queries which stretches over hours and hours.

6:42Now TPC and TPCDS and the 3 TB was the scale we used captures only a part of this distribution.

6:52Arguably if we used 100 gig or or 100 terabyte it could shift left or right but then it would never cover the whole distribution.

7:05Now to dive deeper into this, let's split this workload into two. The long tail of long queries and dominant short queries. And the third part which is not shown here is the continuous injection and transformation. And let's see uh what kind of uh patterns do we see on red shift fleet for these. First let's look at how much of the database capacity or

7:30or resources are these queries using. This is the same chart. On the x-axis is the buckets by runtime. Y-axis is percentage of uh sum of execution time.

7:43Meaning that we took the queries in that bucket, summed up the execution time in that bucket for for those queries and divided it by the total execution time of the queries.

7:56You see the and look at the 10-second to 1 minute uh bucket. Most of the queries as I said earlier 99% of the queries fall in those buckets but the but they do not use the the database capacity most of the database capacity or the hardware hardware is used for long queries the long tail.

8:19Interestingly enough TPCH and TPCS is able to capture that part the long the long tail part here but not so much the short.

8:30Now let's look at some of the plausible reasons why these long queries use so much of the database capacity.

8:39The first aspect we look at is query complexity. On an average red shift queries are not very complex. Like if you see here on x-axis the there are operators SQL operators and on the y-axis is the number of operators.

8:58So uh average uh red shift query has about two table scans meaning either they are used as as a join or a union or a subquery and so on.

9:08Only half the queries have have joints and aggregations. But if you look at red shift queries which are greater than 60 seconds shown here by by the dark green uh boxes

9:22this number jumps up by 3x. The number of queries uh the query scanning on an average for these uh queries the tables are six tables are scanned on an average the joints increase by 3x

9:38and if you look at TPC which are shown here by blue boxes and TPCDS which is shown here by uh the yellow boxes we see they match closer to the red shift queries which are greater than 60 seconds and not so close to the average case.

9:57The next dimension we look at is data complexity. Meaning that these long queries they first they're complex. Second we can we we see that

10:09they may be scanning a more number of rows. This is shown by how the row the

10:15how the tables are distributed over bucket of rows. On the x-axis we bucket the rows like 1 to 10 rows, 10 to 100 rows and so on.

10:24on the y-axis is the percentage of tables which fall into each bucket. Again, we see the long tail effect here.

10:34About 70% of red shift tables have less

10:38than 1 million rows. But there is a long tail of tables which have more than 1 billion rows.

10:48Now let's compare it to TPCH and TPCDS data. Now they cannot cover the full range of this distribution because the number of tables is just not enough. But what we see is for smaller uh on the smaller side the dimension tables.

11:03But the most interesting part is that the TPCH and TPCDS are able to cover the long tail of of the of the number of rows present in this uh in red shift tables.

11:17Let's pres let's look at one more uh aspect of data complexity which is column skew. In general the skew in red shift tables is very high on we I show we show it here like on the y-axis is a measure of skew.

11:35It it is basically measuring how different the col the column value distribution is as compared to uniform distribution. So the higher the number the higher the divergence.

11:49We see here the red shift fleet data is shown here in blue line at and 50% of the columns already have 100x the skew than a uniform distribution while for TPCDS the 100x is

12:04reached at 99th percentile. TPCH does not have much skew because it has it it mostly has a uniform distribution.

12:14And again we see the long tail here. The 99th percentile of red shift reaches very high numbers all the way up to 10^ 14 or 15.

12:23So what so what did we learn? We learned that there is a long tail of queries.

12:28They use uh most of the database resources and that's probably because these are complex queries running on large tables and run into data SKs.

12:43which leads to uh which does not make them too scalable. So how should a database work with this kind of a mixed workload? It is obvious that

12:55um that the warehouse needed to run these long queries should be a big warehouse. But if we run the the short

13:05dominant workload also on such a warehouse, the cost would become prohibitive. Therefore, today's database systems allow workloads to be split.

13:16They allow the customer can create a big warehouse for their long queries and and the smaller smaller warehouse and a numerous of them to address the short dominant queries.

13:29But this needs orchestration on part of the customers. This isn't the best experience. There are MLdriven techniques today which can split the workload create warehouses which are commensurate in size to these uh to these workloads.

13:47Now let's switch our uh focus to the other part of the workload which is the short query workload. What are the attributes we see for these short query workloads?

13:59One of the most important ones we see are the query repetition. And that makes sense because these short query workload is coming from line of business users from customers. They're not coming from analysts who would arrange tables in different ways or or pick different columns. These are canned queries. So the query repetition we see is very high. On the y-axis we we plot the

14:21repetition percentage and on the x-axis the percentile of clusters. The first chart shows the query repetition.

14:31The 50th percentile of clusters has about 50% of the queries repeating on a per day basis which is represented here by red and green line. On a weekly and monthly basis this number goes up to 75%. Which is represented here by the orange and the blue line. Now if you take a step back and remove let's say

14:52filter values from the queries and consider only query templates this repetition percentage then goes all the way up to 80%.

15:02Another attribute of these workloads is burstiness or spikiness. This chart shows a measure of spikiness on the y-axis and percentile of clusters on on the x-axis. So, so spike so this measure is basically comparing number of queries run in the current period as compared to the last period and higher the number higher this difference. So for 50th percentile of

15:30the cluster it's only about 10% different on a daily basis which is the first chart but if we consider on a hourly basis this number jumps up to 40%.

15:41So these workloads are bursty. So what kind of features should uh a database have to address these workloads? First to address the repetitiveness the idea is to cache cache cache and not compute. These these

15:59um database systems have um have features like result cache which can cache data when the query comes in first and if it comes again it can return the same data. These systems also have materialized views. So if a query comes and uh and a materialized view can be used to answer that query, the query gets rewritten automatically to use the

16:23materialized views. This reduces the compute the cost and improves the latency of such queries.

16:30Now to address the bursty nature of this workload, the idea is to autoscale create serverless architectures around these databases so that when there is a burst of queries coming come in, these uh serverless instances can be spun up within a few seconds which and address the spikiness using one or more of warehouses and then go back down when uh

16:57there's no workload left. Now let's look at the third part of the workload which was continuous injection and transformations from the database system. It looks like write queries and

17:10for red shift we see this split evenly between read and writes. We see almost 50% of read queries and 40% write queries. 10% are system queries. And when we compare it to TPCH and TPCDS, this TPCH has almost 80% of the queries which are read queries queries and 20% which are right queries. What this implies for a database is that the right

17:37scalability is as important as read scalability means that they have to they have to create data warehouses which run in parallel which split the workload and are able to write on the same data set.

17:54With this let's try to summarize what do our customers want. They want database systems which can which can satisfy the traditional business analyst use cases which is to crunch large amount of data for complex queries as fast as possible.

18:10Second, these database system should be able to handle the canned queries coming from line of business users and customers. These are short queries repetitive and bursty in nature. And finally these databases database systems should be able to handle a large amount of rights.

18:31Therefore we need a database system which can handle long and heavy tail queries, short, repetitive and bury workloads and as many writes as reads.

18:40And we need new benchmarks which can measure these databases and uh provide customers with feedback on which database to choose. Thank you.

18:53[Applause]

More 2024 Talks

Big Is Not A Number: Dispelling The Myths Of Big Data

Big Is Not A Number: Dispelling The Myths Of Big Data

Jordan Tigani

Data Minimalism: Delivering Business Value For The 99%

Data Minimalism: Delivering Business Value For The 99%

Ravit S. Jain, Jake Thomas, Celina Wong, James Winegar, Josh Wills

Squeezing Maximum Roi Out Of Small Data

Squeezing Maximum Roi Out Of Small Data

Lindsay Murphy