Last week we had a weird issue, all of sudden a batch job, which runs monthy, took lots of time to finish, in fact it didn’t finish at all since we decided to stop it.
First investgation showed that the explain plan of the query changed, this is normal and happens from time to time, this is what the CBO does, for good and sometimes as in this case for bad.
the explain plan used to execute the query used parallelism, which in our case wasn’t beneficial at all.
this is how the plan looked like :
————————————————————————————————————————————————————————————–
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | TQ |IN-OUT| PQ Distrib | A-Rows | A-Time | Buffers | Reads |
————————————————————————————————————————————————————————————–
|* 1 | VIEW | | 1 | 1 | 26 | 4106 (1)| 00:00:50 | | | | 1 |00:03:08.04 | 823K| 799K|
PLAN_TABLE_OUTPUT
——————————————————————————————————————————————————————————————————–
|* 2 | COUNT STOPKEY | | 1 | | | | | | | | 1 |00:03:08.04 | 823K| 799K|
|* 3 | TABLE ACCESS BY INDEX ROWID| R_J | 1 | 1 | 21 | 4 (0)| 00:00:01 | | | | 1 |00:03:08.04 | 823K| 799K|
|* 4 | INDEX RANGE SCAN | R_J_EPOCH | 1 | 1 | | 3 (0)| 00:00:01 | | | | 1 |00:03:08.04 | 823K| 799K|
| 5 | SORT AGGREGATE | | 1 | 1 | 15 | | | | | | 1 |00:03:08.04 | 823K| 799K|
| 6 | PX COORDINATOR | | 1 | | | | | | | | 1 |00:03:08.04 | 823K| 799K|
| 7 | PX SEND QC (RANDOM) | :TQ10000 | 1 | 1 | 15 | | | Q1,00 | P->S | QC (RAND) | 1 |00:03:08.03 | 823K| 799K|
| 8 | SORT AGGREGATE | | 1 | 1 | 15 | | | Q1,00 | PCWP | | 1 |00:03:08.03 | 823K| 799K|
| 9 | PX BLOCK ITERATOR | | 1 | 5227 | 78405 | 4102 (1)| 00:00:50 | Q1,00 | PCWC | | 5 |00:02:09.54 | 823K| 799K|
|* 10 | TABLE ACCESS FULL | R_J | 1 | 5227 | 78405 | 4102 (1)| 00:00:50 | Q1,00 | PCWP | | 5 |00:02:09.54 | 823K| 799K|
sorry for the bad formatting …
the normal plan looked like this :
PLAN_TABLE_OUTPUT
——————————————————————————————————————————————————————————————————–
————————————————————————————————————————————————————
| Id | Operation | Name | Starts | E-Rows |E-Bytes| Cost (%CPU)| E-Time | A-Rows | A-Time | Buffers | Reads |
————————————————————————————————————————————————————
|* 1 | VIEW | | 1 | 1 | 26 | 4337 (1)| 00:00:53 | 1 |00:00:00.01 | 12 | 2 |
|* 2 | COUNT STOPKEY | | 1 | | | | | 1 |00:00:00.01 | 12 | 2 |
|* 3 | TABLE ACCESS BY INDEX ROWID | R_J | 1 | 1 | 21 | 4 (0)| 00:00:01 | 1 |00:00:00.01 | 12 | 2 |
|* 4 | INDEX RANGE SCAN | R_J_EPOCH | 1 | 1 | | 3 (0)| 00:00:01 | 1 |00:00:00.01 | 11 | 2 |
| 5 | SORT AGGREGATE | | 1 | 1 | 15 | | | 1 |00:00:00.01 | 8 | 2 |
|* 6 | TABLE ACCESS BY INDEX ROWID| RJ | 1 | 5227 | 78405 | 4333 (1)| 00:00:52 | 5 |00:00:00.01 | 8 | 2 |
|* 7 | INDEX RANGE SCAN | R_J_ID | 1 | 5227 | | 25 (0)| 00:00:01 | 6 |00:00:00.01 | 3 | 0 |
PLAN_TABLE_OUTPUT
The problem was that manualy reexcuting the query, didn’t show the messed up plan, it took the right plan !!!
sometimes good plan sometimes bad plan ….
so i couldn’t set 10053 event because it wasn’t reproducible…
During two days I tried to reproduce the issue but i couldn’t.
During two days I tried to reproduce the issue but i couldn’t.
i tried it again with some random values when the load was higher
and now could in a consistent way reproduce it.
By comparing the CBO traces is saw that the cpu_count was different .
A couple of weeks ago the sysadmin implemented solaris cpu pools to dynamically assign CPU resources, sometime more sometime less ….
this caused the optimizer to take the wrong plan when more than 11 cpu’s were assigned….
Case closed 😉