海量数据sql优化实践一例
使用intersetct
select
user_id
from aaa_log
where statis_date>=20091001
and statis_date<=20091031
intersect
select user_id from aaa_log
intersect
select user_id from aaa_log
where statis_date>=20091001
and statis_date<=20091031
执行计划:
SELECT STATEMENT, GOAL = ALL_ROWS 147881 352550000 11986700000
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10002
INTERSECTION
SORT UNIQUE 73940 352550000 5993350000
PX RECEIVE 3685 352550000 5993350000
PX SEND HASH SYS :TQ10000 3685 352550000 5993350000
PX BLOCK ITERATOR 3685 352550000 5993350000 7 1 32
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 7 6849 7840
SORT UNIQUE 73940 352550000 5993350000
PX RECEIVE 3685 352550000 5993350000
PX SEND HASH SYS :TQ10001 3685 352550000 5993350000
PX BLOCK ITERATOR 3685 352550000 5993350000 12 1 32
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 12 6849 7840
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10002
INTERSECTION
SORT UNIQUE 73940 352550000 5993350000
PX RECEIVE 3685 352550000 5993350000
PX SEND HASH SYS :TQ10000 3685 352550000 5993350000
PX BLOCK ITERATOR 3685 352550000 5993350000 7 1 32
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 7 6849 7840
SORT UNIQUE 73940 352550000 5993350000
PX RECEIVE 3685 352550000 5993350000
PX SEND HASH SYS :TQ10001 3685 352550000 5993350000
PX BLOCK ITERATOR 3685 352550000 5993350000 12 1 32
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 12 6849 7840
使用内连接
select
distinct
user_id
from aaa_log t1, aa_log t2
where 1=1
and t1. user_id=t2. user_id
and t1.statis_date>=20091001
and t1.statis_date<=20091031
and t2.statis_date>=20090901
and t2.statis_date<=20090930
from aaa_log t1, aa_log t2
where 1=1
and t1. user_id=t2. user_id
and t1.statis_date>=20091001
and t1.statis_date<=20091031
and t2.statis_date>=20090901
and t2.statis_date<=20090930
执行计划:
SELECT STATEMENT, GOAL = ALL_ROWS 600780126 3880 131920
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10001 600780126 3880 131920
HASH UNIQUE 600780126 3880 131920
PX RECEIVE 31480377 38997282262394 1.3259075969214E15
PX SEND HASH SYS :TQ10000 31480377 38997282262394 1.3259075969214E15
PX PARTITION HASH ALL 31480377 38997282262394 1.3259075969214E15 6 1 32
HASH JOIN 31480377 38997282262394 1.3259075969214E15
PX PARTITION RANGE ITERATOR 3685 352550000 5993350000 8 215 245
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 8 6849 7840
PX PARTITION RANGE ITERATOR 3672 352550000 5993350000 10 185 214
TABLE ACCESS FULL DW AAA_LOG3672 352550000 5993350000 10 5889 6848
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10001 600780126 3880 131920
HASH UNIQUE 600780126 3880 131920
PX RECEIVE 31480377 38997282262394 1.3259075969214E15
PX SEND HASH SYS :TQ10000 31480377 38997282262394 1.3259075969214E15
PX PARTITION HASH ALL 31480377 38997282262394 1.3259075969214E15 6 1 32
HASH JOIN 31480377 38997282262394 1.3259075969214E15
PX PARTITION RANGE ITERATOR 3685 352550000 5993350000 8 215 245
TABLE ACCESS FULL DW AAA_LOG 3685 352550000 5993350000 8 6849 7840
PX PARTITION RANGE ITERATOR 3672 352550000 5993350000 10 185 214
TABLE ACCESS FULL DW AAA_LOG3672 352550000 5993350000 10 5889 6848
观察这个执行计划发现,使用内连接的消耗远远大于使用intersect。究其原因主要是log表中user_id有重复的,这样导致HASH JOIN的时候有笛卡尔积。
在没有重复记录的时候一般是内连接比intersect快。
使用exits
select
distinct
user_id
from aaa_log t1
where 1=1
and t1.statis_date>=20091001
and t1.statis_date<=20091031
and exists( select 1 from aaa_log t2
where t2. user_id=t1. user_id
and t2.statis_date>=20090901
and t2.statis_date<=20090930
)
from aaa_log t1
where 1=1
and t1.statis_date>=20091001
and t1.statis_date<=20091031
and exists( select 1 from aaa_log t2
where t2. user_id=t1. user_id
and t2.statis_date>=20090901
and t2.statis_date<=20090930
)
执行计划
SELECT STATEMENT, GOAL = ALL_ROWS 11003 3880 131920
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10001 11003 3880 131920
HASH UNIQUE 11003 3880 131920
PX RECEIVE 8003 352550000 11986700000
PX SEND HASH SYS :TQ10000 8003 352550000 11986700000
PX PARTITION HASH ALL 8003 352550000 11986700000 6 1 32
HASH JOIN SEMI 8003 352550000 11986700000
PX PARTITION RANGE ITERATOR 3691 352550000 5993350000 8 215 245
TABLE ACCESS FULL DW AAA_LOG 3691 352550000 5993350000 8 6849 7840
PX PARTITION RANGE ITERATOR 3672 352550000 5993350000 10 185 214
TABLE ACCESS FULL DW AAA_LOG 3672 352550000 5993350000 10 5889 6848
PX COORDINATOR
PX SEND QC (RANDOM) SYS :TQ10001 11003 3880 131920
HASH UNIQUE 11003 3880 131920
PX RECEIVE 8003 352550000 11986700000
PX SEND HASH SYS :TQ10000 8003 352550000 11986700000
PX PARTITION HASH ALL 8003 352550000 11986700000 6 1 32
HASH JOIN SEMI 8003 352550000 11986700000
PX PARTITION RANGE ITERATOR 3691 352550000 5993350000 8 215 245
TABLE ACCESS FULL DW AAA_LOG 3691 352550000 5993350000 8 6849 7840
PX PARTITION RANGE ITERATOR 3672 352550000 5993350000 10 185 214
TABLE ACCESS FULL DW AAA_LOG 3672 352550000 5993350000 10 5889 6848
通过比较返现这种情况下使用exits最快,使用interset次之,但是差距不大,使用内连接特别差,对于海量数据可能几天也算不出来。
转载于:https://blog.51cto.com/xqy1522/240308