循序漸進(jìn)講解Oracle數(shù)據(jù)庫(kù)的Hash join
Oracle Hash join 是一種非常高效的join 算法,主要以CPU(hash計(jì)算)和內(nèi)存空間(創(chuàng)建hash table)為代價(jià)獲得最大的效率。Hash join一般用于大表和小表之間的連接,我們將小表構(gòu)建到內(nèi)存中,稱為Hash cluster,大表稱為probe表。
效率
Hash join具有較高效率的兩個(gè)原因:
1.Hash 查詢,根據(jù)映射關(guān)系來(lái)查詢值,不需要遍歷整個(gè)數(shù)據(jù)結(jié)構(gòu)。
2.Mem 訪問速度是Disk的萬(wàn)倍以上。
理想化的Hash join的效率是接近對(duì)大表的單表選擇掃描的。
首先我們來(lái)比較一下,幾種join之間的效率,首先 optimizer會(huì)自動(dòng)選擇使用hash join。
注意到Cost= 221
SQL> select * from vendition t,customer b WHERE t.customerid = b.customerid;
100000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3402771356
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 106K| 22M| 221 (3)| 00:00:03 |
|* 1 | HASH JOIN | | 106K| 22M| 221 (3)| 00:00:03 |
| 2 | TABLE ACCESS FULL| CUSTOMER | 5000 | 424K| 9 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL| VENDITION | 106K| 14M| 210 (2)| 00:00:03 |
--------------------------------------------------------------------------------
不使用hash,這時(shí)optimizer自動(dòng)選擇了merge join。。
注意到Cost=3507大大的增加了。
SQL> select /*+ USE_MERGE (t b) */* from vendition t,customer b WHERE t.customerid = b.customerid;
100000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 1076153206
-----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 106K| 22M| | 3507 (1)| 00:00:43 |
| 1 | MERGE JOIN | | 106K| 22M| | 3507 (1)| 00:00:43 |
| 2 | SORT JOIN | | 5000 | 424K| | 10 (10)| 00:00:01 |
| 3 | TABLE ACCESS FULL| CUSTOMER | 5000 | 424K| | 9 (0)| 00:00:01 |
|* 4 | SORT JOIN | | 106K| 14M| 31M| 3496 (1)| 00:00:42 |
| 5 | TABLE ACCESS FULL| VENDITION | 106K| 14M| | 210 (2)| 00:00:03 |
-----------------------------------------------------------------------------------------
那么Nest loop呢,經(jīng)過(guò)漫長(zhǎng)的等待后,發(fā)現(xiàn)Cost達(dá)到了驚人的828K,同時(shí)伴隨3814337 consistent gets(由于沒有建索引),可見在這個(gè)測(cè)試中,Nest loop是最低效的。在給customerid建立唯一索引后,減低到106K,但仍然是內(nèi)存join的上千倍。
SQL> select /*+ USE_NL(t b) */* from vendition t,customer b WHERE t.customerid = b.customerid;
100000 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2015764663
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 106K| 22M| 828K (2)| 02:45:41 |
| 1 | NESTED LOOPS | | 106K| 22M| 828K (2)| 02:45:41 |
| 2 | TABLE ACCESS FULL| VENDITION | 106K| 14M| 210 (2)| 00:00:03 |
|* 3 | TABLE ACCESS FULL| CUSTOMER | 1 | 87 | 8 (0)| 00:00:01 |
HASH的內(nèi)部
HASH_AREA_SIZE在Oracle 9i 和以前,都是影響hash join性能的一個(gè)重要的參數(shù)。但是在10g發(fā)生了一些變化。Oracle不建議使用這個(gè)參數(shù),除非你是在MTS模式下。Oracle建議采用自動(dòng)PGA管理(設(shè)置PGA_AGGREGATE_TARGET和WORKAREA_SIZE_POLICY)來(lái),替代使用這個(gè)參數(shù)。由于我的測(cè)試環(huán)境是mts環(huán)境,自動(dòng)內(nèi)存管理,所以我在這里只討論mts下的hash join。
Mts的PGA中,只包含了一些棧空間信息,UGA則包含在large pool中,那么實(shí)際類似hash,sort,merge等操作都是有l(wèi)arge pool來(lái)分配空間,large pool同時(shí)也是auto管理的,它和SGA_TARGET有關(guān)。所以在這種條件下,內(nèi)存的分配是很靈活。
Hash連接根據(jù)內(nèi)存分配的大小,可以有三種不同的效果:
1.optimal 內(nèi)存完全足夠
2.onepass 內(nèi)存不能裝載完小表
3.multipass workarea executions 內(nèi)存嚴(yán)重不足
下面,分別測(cè)試小表為50行,500行和5000行,內(nèi)存的分配情況(內(nèi)存都能完全轉(zhuǎn)載)。
Vendition表 10W條記錄
Customer表 5000
Customer_small 500,去Customer表前500行建立
Customer_pity 50,取Customer表前50行建立
表的統(tǒng)計(jì)信息如下:
SQL> SELECT s.table_name,S.BLOCKS,S.AVG_SPACE,S.NUM_ROWS,S.AVG_ROW_LEN,S.EMPTY_BLOCKS FROM user_tables S WHERE table_name IN ('CUSTOMER','VENDITION','CUSTOMER_SMALL','CUSTOMER_PITY') ;
TABLE_NAME BLOCKS AVG_SPACE NUM_ROWS AVG_ROW_LEN EMPTY_BLOCKS
CUSTOMER 35 1167 5000 38 5
CUSTOMER_PITY 4 6096 50 37 4
CUSTOMER_SMALL 6 1719 500 36 2
VENDITION 936 1021 100000 64 88打開10104事件追蹤:(hash 連接追蹤)
ALTER SYSTEM SET EVENTS ‘ 10104 TRACE NAME CONTEXT,LEVEL 2’;
測(cè)試SQL
SELECT * FROM vendition a,customer b WHERE a.customerid = b.customerid;
SELECT * FROM vendition a,customer_small b WHERE a.customerid = b.customerid;
SELECT * FROM vendition a,customer_pity b WHERE a.customerid = b.customerid;
小表50行時(shí)候的trace分析:
*** 2008-03-23 18:17:49.467
*** SESSION ID:(773.23969) 2008-03-23 18:17:49.467
kxhfInit(): enter
kxhfInit(): exit
*** RowSrcId: 1 HASH JOIN STATISTICS (INITIALIZATION) ***
Join Type: INNER join
Original hash-area size: 3883510
PS:hash area的大小,大約380k,本例中最大的表也不過(guò)250塊左右,所以內(nèi)存完全可以完全裝載
Memory for slot table: 2826240
Calculated overhead for partitions and row/slot managers: 1057270
Hash-join fanout: 8
Number of partitions: 8
PS:hash 表數(shù)據(jù)連一個(gè)塊都沒裝滿,Oracle仍然對(duì)數(shù)據(jù)進(jìn)行了分區(qū),這里和以前在一些文檔上看到的,當(dāng)內(nèi)存不足時(shí)才會(huì)對(duì)數(shù)據(jù)分區(qū)的說(shuō)法,發(fā)生了變化。
Number of slots: 23
Multiblock IO: 15
Block size(KB): 8
Cluster (slot) size(KB): 120
PS:分區(qū)中全部行占有的cluster的size
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 270
Estimated build size (KB): 0
Estimated Build Row Length (includes overhead): 45
# Immutable Flags:
Not BUFFER(execution) output of the join for PQ
Evaluate Left Input Row Vector
Evaluate Right Input Row Vector
# Mutable Flags:
IO sync
kxhfSetPhase: phase=BUILD
kxhfAddChunk: add chunk 0 (sz=32) to slot table
kxhfAddChunk: chunk 0 (lbs=0x2a97825c38, slotTab=0x2a97825e00) successfuly added
kxhfSetPhase: phase=PROBE_1
qerhjFetch: max build row length (mbl=44)
*** RowSrcId: 1 END OF HASH JOIN BUILD (PHASE 1) ***
Revised row length: 45
Revised build size: 2KB
kxhfResize(enter): resize to 12 slots (numAlloc=8, max=23)
kxhfResize(exit): resized to 12 slots (numAlloc=8, max=12)
Slot table resized: old=23 wanted=12 got=12 unload=0
*** RowSrcId: 1 HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 8
Number of partitions which could fit in memory: 8
Number of partitions left in memory: 8
Total number of slots in in-memory partitions: 8
Total number of rows in in-memory partitions: 50
(used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 66960
### Partition Distribution ###
Partition:0 rows:5 clusters:1 slots:1 kept=1
Partition:1 rows:6 clusters:1 slots:1 kept=1
Partition:2 rows:4 clusters:1 slots:1 kept=1
Partition:3 rows:9 clusters:1 slots:1 kept=1
Partition:4 rows:5 clusters:1 slots:1 kept=1
Partition:5 rows:9 clusters:1 slots:1 kept=1
Partition:6 rows:4 clusters:1 slots:1 kept=1
Partition:7 rows:8 clusters:1 slots:1 kept=1
PS:每個(gè)分區(qū)只有不到10行,這里有一個(gè)重要的參數(shù)Kept,1在內(nèi)存中,0在磁盤
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
PS:hash join的第一階段,但是要觀察更多的階段,需提高trace的level,這里略過(guò)
Revised number of hash buckets (after flushing): 50
Allocating new hash table.
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Requested size of hash table: 16
Actual size of hash table: 16
Number of buckets: 128
Match bit vector allocated: FALSE
kxhfResize(enter): resize to 14 slots (numAlloc=8, max=12)
kxhfResize(exit): resized to 14 slots (numAlloc=8, max=14)
freeze work area size to: 2359K (14 slots)
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of rows (may have changed): 50
Number of in-memory partitions (may have changed): 8
Final number of hash buckets: 128
Size (in bytes) of hash table: 1024
kxhfIterate(end_iterate): numAlloc=8, maxSlots=14
*** (continued) HASH JOIN BUILD HASH TABLE (PHASE 1) ***
### Hash table ###
# NOTE: The calculated number of rows in non-empty buckets may be smaller
# than the true number.
Number of buckets with 0 rows: 86
Number of buckets with 1 rows: 37
Number of buckets with 2 rows: 5
Number of buckets with 3 rows: 0
PS:桶里面的行數(shù),最大的桶也只有2行,理論上,桶里面的行數(shù)越少,性能越佳。
Number of buckets with 4 rows: 0
Number of buckets with 5 rows: 0
Number of buckets with 6 rows: 0
Number of buckets with 7 rows: 0
Number of buckets with 8 rows: 0
Number of buckets with 9 rows: 0
Number of buckets with between 10 and 19 rows: 0
Number of buckets with between 20 and 29 rows: 0
Number of buckets with between 30 and 39 rows: 0
Number of buckets with between 40 and 49 rows: 0
Number of buckets with between 50 and 59 rows: 0
Number of buckets with between 60 and 69 rows: 0
Number of buckets with between 70 and 79 rows: 0
Nmber of buckets with between 80 and 89 rows: 0
Number of buckets with between 90 and 99 rows: 0
Number of buckets with 100 or more rows: 0
### Hash table overall statistics ###
Total buckets: 128 Empty buckets: 86 Non-empty buckets: 42
PS:創(chuàng)建了128個(gè)桶,Oracle 7開始的計(jì)算公式
Bucket數(shù)=0.8*hash_area_size/(hash_multiblock_io_count*db_block_size)
但是不準(zhǔn)確,估計(jì)10g發(fā)生了變化。
Total number of rows: 50
Maximum number of rows in a bucket: 2
Average number of rows in non-empty buckets: 1.190476
小表500行時(shí)候的trace分析
Original hash-area size: 3925453
Memory for slot table: 2826240
。。。
Hash-join fanout: 8
Number of partitions: 8
。。。
### Partition Distribution ###
Partition:0 rows:52 clusters:1 slots:1 kept=1
Partition:1 rows:63 clusters:1 slots:1 kept=1
Partition:2 rows:55 clusters:1 slots:1 kept=1
Partition:3 rows:74 clusters:1 slots:1 kept=1
Partition:4 rows:66 clusters:1 slots:1 kept=1
Partition:5 rows:66 clusters:1 slots:1 kept=1
Partition:6 rows:54 clusters:1 slots:1 kept=1
Partition:7 rows:70 clusters:1 slots:1 kept=1
PS:每個(gè)partition的行數(shù)增加
。。。
Number of buckets with 0 rows: 622
Number of buckets with 1 rows: 319
Number of buckets with 2 rows: 71
Number of buckets with 3 rows: 10
Number of buckets with 4 rows: 2
Number of buckets with 5 rows: 0
。。。
### Hash table overall statistics ###
Total buckets: 1024 Empty buckets: 622 Non-empty buckets: 402
Total number of rows: 500
Maximum number of rows in a bucket: 4
Average number of rows in non-empty buckets: 1.243781
小表5000行時(shí)候的trace分析
Original hash-area size: 3809692
Memory for slot table: 2826240
。。。
Hash-join fanout: 8
Number of partitions: 8
Nuber of slots: 23
Multiblock IO: 15
Block size(KB): 8
Cluster (slot) size(KB): 120
Minimum number of bytes per block: 8160
Bit vector memory allocation(KB): 128
Per partition bit vector length(KB): 16
Maximum possible row length: 270
Estimated build size (KB): 0
。。。
### Partition Distribution ###
Partition:0 rows:588 clusters:1 slots:1 kept=1
Partition:1 rows:638 clusters:1 slots:1 kept=1
Partition:2 rows:621 clusters:1 slots:1 kept=1
Partiton:3 rows:651 clusters:1 slots:1 kept=1
Partition:4 rows:645 clusters:1 slots:1 kept=1
Partition:5 rows:611 clusters:1 slots:1 kept=1
Partitio:6 rows:590 clusters:1 slots:1 kept=1
Partition:7 rows:656 clusters:1 slots:1 kept=1
。。。
# than the true number.
Number of buckets with 0 rows: 4429
Number of buckets with 1 rows: 2762
Number of buckets with 2 rows: 794
Number of buckets with 3 rows: 182
Number of buckets with 4 rows: 23
Number of buckets with 5 rows: 2
Number of buckets with 6 rows: 0
。。。
### Hash table overall statistics ###
Total buckets: 8192 Empty buckets: 4429 Non-empty buckets: 3763
Total number of rows: 5000
Maximum number of rows in a bucket: 5
PS:當(dāng)小表上升到5000行的時(shí)候,bucket的rows最大也不過(guò)5行。注意,如果bucket行數(shù)過(guò)多,遍歷帶來(lái)的開銷會(huì)帶來(lái)性能的嚴(yán)重下降。
Average number of rows in non-empty buckets: 1.328727
結(jié)論:
Oracle數(shù)據(jù)庫(kù)10g中,內(nèi)存問題并不是干擾Hash join的首要問題,現(xiàn)今硬件價(jià)格越來(lái)越便宜,內(nèi)存2G,8G,64G的環(huán)境也很常見。大家在針對(duì)hash join調(diào)優(yōu)的過(guò)程,更要偏重于partition和bucket的數(shù)據(jù)分配診斷。
