Partition Wise Joins

http://blogs.oracle.com/datawarehousing/entry/partition_wise_joins

Some say they are the holy grail of parallel computing and PWJ is the basis for a shared nothing system and the only join method that is available on a shared nothing system (yes this is oversimplified!). The magic in Oracle is of course that is one of many ways to join data. And yes, this is the old flexibility vs. simplicity discussion all over, so I won't go there... the point is that what you must do in a shared nothing system, you can do in Oracle with the same speed and methods.

The Theory

A partition wise join is a join between (for simplicity) two tables that are partitioned on the same column with the same partitioning scheme. In shared nothing this is effectively hard partitioning locating data on a specific node / storage combo. In Oracle is is logical partitioning.

If you now join the two tables on that partitioned column you can break up the join in smaller joins exactly along the partitions in the data. Since they are partitioned (grouped) into the same buckets, all values required to do the join live in the equivalent bucket on either sides. No need to talk to anyone else, no need to redistribute data to anyone else... in short, the optimal join method for parallel processing of two large data sets.

PWJ's in Oracle

Since we do not hard partition the data across nodes in Oracle we use the Partitioning option to the database to create the buckets, then set the Degree of Parallelism (or run Auto DOP - see here) and get our PWJs. The main questions always asked are:

How many partitions should I create?
What should my DOP be?

In a shared nothing system the answer is of course, as many partitions as there are nodes which will be your DOP. In Oracle we do want you to look at the workload and concurrency, and once you know that to understand the following rules of thumb.

Within Oracle we have more ways of joining of data, so it is important to understand some of the PWJ ideas and what it means if you have an uneven distribution across processes.

Assume we have a simple scenario where we partition the data on a hash key resulting in 4 hash partitions (H1 -H4). We have 2 parallel processes that have been tasked with reading these partitions (P1 - P2). The work is evenly divided assuming the partitions are the same size and we can scan this in time t1 as shown below.

Now assume that we have changed the system and have a 5th partition but still have our 2 workers P1 and P2. The time it takes is actually 50% more assuming the 5th partition has the same size as the original H1 - H4 partitions.

In other words to scan these 5 partitions, the time t2 it takes is not 1/5th more expensive, it is a lot more expensive and some other join plans may now start to look exciting to the optimizer. Just to post the disclaimer, it is not as simple as I state it here, but you get the idea on how much more expensive this plan may now look...

Based on this little example there are a few rules of thumb to follow to get the partition wise joins.

First, choose a DOP that is a factor of two (2). So always choose something like 2, 4, 8, 16, 32 and so on...

Second, choose a number of partitions that is larger or equal to 2* DOP.

Third, make sure the number of partitions is divisible through 2 without orphans. This is also known as an even number...

Fourth, choose a stable partition count strategy, which is typically hash, which can be a sub partitioning strategy rather than the main strategy (range - hash is a popular one).

Fifth, make sure you do this on the join key between the two large tables you want to join (and this should be the obvious one...).

Translating this into an example:

DOP = 8 (determined based on concurrency or by using Auto DOP with a cap due to concurrency) says that the number of partitions >= 16.

Number of hash (sub) partitions = 32, which gives each process four partitions to work on. This number is somewhat arbitrary and depends on your data and system. In this case my main reasoning is that if you get more room on the box you can easily move the DOP for the query to 16 without repartitioning... and of course it makes for no leftovers on the table...

And yes, we recommend up-to-date statistics. And before you start complaining, do read this post on a cool way to do stats in 11.

Partition Wise Joins II

http://blogs.oracle.com/datawarehousing/entry/partition_wise_joins_ii

One of the things that I did not talk about in the initial partition wise join post was the effect it has on resource allocation on the database server.

When Oracle applies a different join method - e.g. not PWJ - what you will see in SQL Monitor (in Enterprise Manager) or in an Explain Plan is a set of producers and a set of consumers. The producers scan the tables in the the join. If there are two tables the producers first scan one table, then the other. The producers thus provide data to the consumers, and when the consumers have the data from both scans they do the join and give the data to the query coordinator.

Now that behavior means that if you choose a degree of parallelism of 4 to run such query with, Oracle will allocate 8 parallel processes. Of these 8 processes 4 are producers and 4 are consumers. The consumers only actually do work once the producers are fully done with scanning both sides of the join.

In the plan above you can see that the producers access table SALES [line 11] and then do a PX SEND [line 9]. That is the producer set of processes working. The consumers receive that data [line 8] and twiddle their thumbs while the producers go on and scan CUSTOMERS. The producers send that data to the consumer indicated by PX SEND [line 5]. After receiving that data [line 4] the consumers do the actual join [line 3] and give the data to the QC [line 2].

BTW, the myth that you see twice the number of processes due to the setting PARALLEL_THREADS_PER_CPU=2 is obviously not true. The above is why you will see 2 times the processes of the DOP.

In a PWJ plan the consumers are not present. Instead of producing rows and giving those to different processes, a PWJ only uses a single set of processes. Each process reads its piece of the join across the two tables and performs the join.

The plan here is notably different from the initial plan. First of all the hash join is done right on top of both table scans [line 8]. This query is a little more complex than the previous so there is a bit of noise above that bit of info, but for this post, lets ignore that (sort stuff).

The important piece here is that the PWJ plan typically will be faster and from a PX process number / resources typically cheaper. You may want to look out for those plans and try to get those to appear a lot...

CREDITS: credits for the plans and some of the info on the plans go to Maria, as she actually produced these plans and is the expert on plans in general... You can see her talk about explaining the explain plan and other optimizer stuff over here:

ODTUG in Washington DC, June 27 - July 1
On the Optimizer blog
At OpenWorld in San Francisco, September 19 - 23

Happy joining and hope to see you all at ODTUG and OOW...

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。