Friday, 26 March 2010

Which SQL is causing my Cluster Waits?

I have a 2 node RAC cluster under test, and the throughput isn't as much as hoped. I needed to work out what SQL was causing all my cluster waits...

First step - identify the snapshots that cover the time period of the testing by looking in dba_hist_snapshot (I've not put in the SQL for this as it's fairly obvious)


My two snapshots are 541 and 542.

Next do something with a dual purpose, check the number of waits associated with each class (why chase things that won't have a big impact) and get the wait_class_id..

SQL >col wait_class for a20
SQL> select wait_class_id, wait_class, count(*) cnt
from dba_hist_active_sess_history
where snap_id between 541 and 542
group by wait_class_id, wait_class
order by 3;

WAIT_CLASS_ID WAIT_CLASS                  CNT
------------- -------------------- ----------
   2000153315 Network                       1
   3290255840 Configuration                 1
   1740759767 User I/O                      2
   1893977003 Other                        22
   3875070507 Concurrency                  27
   4108307767 System I/O                   47
   4217450380 Application                  48
   3386400367 Commit                      265
                                         1045
   3871361733 Cluster                    1273

10 rows selected.


Well, the majority of my waits are cluster related (though next on my hit list will be the commit related events). Let's find out which cluster waits are giving me problems

SQL> col event for a30
SQL> select event_id, event, count(*) cnt from dba_hist_active_sess_history
where snap_id between 541 and 542 and wait_class_id=3871361733
group by event_id, event
order by 3;


  EVENT_ID EVENT                                 CNT
---------- ------------------------------ ----------
1742950045 gc current retry                        1
 661121159 gc cr multi block request               1
3905407295 gc current request                      1
2705335821 gc cr block congested                   3
3897775868 gc current multi block request          3
 105117041 gc buffer busy release                  3
1520064534 gc cr block busy                        5
3785617759 gc current block congested              7
2685450749 gc current grant 2-way                 12
1912606394 gc buffer busy acquire                 62
2277737081 gc current grant busy                  73
2701629120 gc current block busy                 112
 737661873 gc cr block 2-way                     360
 111015833 gc current block 2-way                630

14 rows selected.



So most of my waits are 'gc cr block 2-way' and 'gc current block 2-way'. Take their event_id and see which SQL was doing most of the waiting for those events.


SQL> select sql_id, count(*) cnt from dba_hist_active_sess_history
where snap_id between 541 and 542
and event_id in ( 737661873,111015833 )
group by sql_id
having count(*)>200
 order by 2


SQL_ID CNT
------------- ----------
3dh7u84crfyw6 526


So now we have the SQL ID, we can get the text associated with it, and hopefully come up with a fix on the SQL.

SQL> select sql_text from dba_hist_sqltext where sql_id='3dh7u84crfyw6';

SQL_TEXT
--------------------------------------------------------------------------------
INSERT INTO TRANSACTION_LOG( TRL_ID, TRL_BUSINESS_DATE, TRL_SYSTEM_TIMESTAMP, TR

So the majority of our cluster waits is caused by the insert into the transaction log...


The two waits are related to transferring and requesting blocks in memory from the other instance. The Current block is where we're competing for blocks to update. If a block has fewer rows, then the chance that both instances will need the same block is reduced.

This leaves me with 2 simple fixes that doesn't need the application fiddling with:-
1. Reduce row density (tweak PCTFREE)
2. Reduce my blocksize from 8k to 2k.

I chose the 2k blocksize option by creating a new tablespace with the smaller blocksize and a new buffer pool to support it. It increased throughput by 10%

No comments:

Post a Comment