Planning Manager诊断过程和10046事件
Posted on May 26, 2010
在一次Planning Manager性能问题的诊断中,用到了Oracle ERP环境中非常典型的诊断操作和过程,特全程记录于此。
当发现某个程序存在性能问题时,首先当然是快速检查一下SESSION的相关信息,看看主要在执行或等待些什么:
SELECT stat.sid,
n.name,
n.class,
stat.value
FROM v$sesstat stat,
v$statname n
WHERE stat.statistic# = n.statistic#
AND stat.sid = 121
ORDER BY upper(n.name);
这里观察到bytes received via SQL*Net from client, bytes sent via SQL*Net to client 和 SQL*Net roundtrips to/from client 都在飙升。这对应了两个events:SQL*Net message to client和SQL*Net message from client,这两个事件数值大并不意味着网络状况不好,前者仅仅表示Oracle数据库将数据放入TCP send buffer的时间,而后者表示从客户端接收到反馈指令的时间,两者反映了SQL*Net和数据库之间的时间消耗,而不是网络传输的时间消耗。基于对系统架构的了解,排除SQL*Net的性能问题,那么两个事件预示了什么呢?
SELECT * FROM v$sess_io v WHERE v.sid = 61;
观察会话中的数据库读取情况,CONSISTENT_GETS 值在持续增长。结合上面SQL*Net的情况,大致可以判断出该程序在反复执行某个查询语句,每次获取的数据量很小但频率很高。我们知道,客户端在读取数据时,可能影响的因素有Array Fetch Size、SQL*Net中的SDU和数据库服务器每次读取的数据量,简单计算SQL*Net每次传输大致的数据量后将问题定位到SQL的性能问题上。
开始跟踪Planning Manager的全过程。Planning Manager 是一种特殊的并发程序(Immediate),它直接依附于并发管理器MRP Manager执行。但是作为并发请求,它的常规诊断方式是一样的。如果有计划管理器在运行,则先停掉它,然后启用Trace:

关于详细操作和Trace文件的查找可参考另一篇文章《跟踪(Trace)并发请求》
进入MRP计划管理员设置界面,启动Planning Manager。对于其他独立的并发请求,可以直接手工提交。等待Planning Manager处理完预先准备的接口数据后,立即取消该请求,并取消Trace。
接下来就是对Trace文件的解读。对于并发请求的Trace,其实就是对应的一次10046事件(event),该事件可简单分为以下四个级别:
- Level 1: 跟踪SQL
- Level 4: Level 1 + 跟踪绑定变量
- Level 8: Level 1 + 跟踪等待事件
- Level 12: Level 4 + Level 8 (跟踪SQL, 绑定变量和等待事件)
手工方式的话可通过下面的命令启用:
alter session set events='10046 trace name context forever, level 8' ;
它和下面的命令类似:
ALTER SESSION SET SQL_TRACE = TRUE
只是前一个命令,可以根据需要指定不同的level,比如此处的level 8,表示跟踪到SQL和等待事件,而不获取绑定变量的数据。也可以通过包DBMS_System来启用,例如:
EXEC DBMS_System.Set_Sql_Trace_In_Session(sid, serial#, true ); EXEC DBMS_System.Set_Ev(sid, serial#, event, level, name);
在某些场合下,针对并发请求的跟踪可能会产生不止一个Trace文件,此时第一个Trace文件中会指明第二个Trace文件标识,如:
PARSE #41:c=0,e=565,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=19728618248238 EXEC #41:c=0,e=781,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=19728618249248 ===================== PARSING IN CURSOR #41 len=54 dep=1 uid=44 oct=42 lid=44 tim=19728618250006 hv=4121095646 ad='f7b931f8' ALTER SESSION SET TRACEFILE_IDENTIFIER='XZB_CR8909488' END OF STMT PARSE #41:c=0,e=249,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=19728618250001 *** TRACE DUMP CONTINUES IN FILE ……/udump/test_ora_14443_XZB_CR8909488.trc ***
先来大致看一下Trace文件中都有些什么。
Dump file ……/udump/test_ora_14443.trc Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.6.0 - Production ORACLE_HOME = /u08/test/proddb/9.2.0 System name: SunOS Node name: erptest Release: 5.10 Version: Generic_118833-03 Machine: sun4u Instance name: TEST Redo thread mounted by this instance: 1 Oracle process number: 59 Unix process pid: 14443, image: oracle@erptest (TNS V1-V3)
这是文件头,显示了相关环境信息。接着往下翻,关键在具体某条SQL的执行情况上,如下:
===================== PARSING IN CURSOR #75 len=332 dep=0 uid=44 oct=3 lid=44 tim=19728619902280 hv=3581312695 ad='f2b4eb00' select count(*) into :b0 from wip_requirement_operations wro ,mtl_system_items items where ((((((wro.inventory_item_id=items.inventory_item_id and wro.organization_id=items.organization_id) and items.bom_item_type=4) and wro.wip_entity_id=:b1) and wro.wip_supply_type=6) and wro.organization_id=:b2) and wro.inventory_item_id=:b3) END OF STMT PARSE #75:c=10000,e=2221,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=0,tim=19728619902273 EXEC #75:c=0,e=2841,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728619905363 WAIT #75: nam='db file sequential read' ela= 6392 p1=66 p2=135783 p3=1 …… FETCH #74:c=0,e=1248,p=0,cr=36,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777309437 WAIT #74: nam='SQL*Net message to client' ela= 6 p1=1952673792 p2=1 p3=0 WAIT #74: nam='SQL*Net message from client' ela= 338 p1=1952673792 p2=1 p3=0 EXEC #74:c=0,e=256,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777310344 WAIT #74: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0 WAIT #74: nam='SQL*Net message from client' ela= 205 p1=1952673792 p2=1 p3=0 EXEC #75:c=0,e=110,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777310795 WAIT #75: nam='SQL*Net message to client' ela= 2 p1=1952673792 p2=1 p3=0 FETCH #75:c=30000,e=26138,p=0,cr=2028,cu=0,mis=0,r=1,dep=0,og=4,tim=19728777336976 WAIT #75: nam='SQL*Net message from client' ela= 278 p1=1952673792 p2=1 p3=0 FETCH #74:c=0,e=749,p=0,cr=36,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777338168 WAIT #74: nam='SQL*Net message to client' ela= 2 p1=1952673792 p2=1 p3=0 WAIT #74: nam='SQL*Net message from client' ela= 304 p1=1952673792 p2=1 p3=0 EXEC #74:c=0,e=112,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777338769 WAIT #74: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0 WAIT #74: nam='SQL*Net message from client' ela= 227 p1=1952673792 p2=1 p3=0 EXEC #75:c=0,e=272,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777339533 WAIT #75: nam='SQL*Net message to client' ela= 7 p1=1952673792 p2=1 p3=0 FETCH #75:c=30000,e=27212,p=0,cr=2028,cu=0,mis=0,r=1,dep=0,og=4,tim=19728777366810 WAIT #75: nam='SQL*Net message from client' ela= 415 p1=1952673792 p2=1 p3=0 ……
这里显示了cursor语句的执行情况,发现编号为#74和#75的语句执行次数非常多,并且间隔执行,每次execute速度很快,fetch消耗时间相对较多,每次读取后立即发送给客户端。(格式的解读参考附注)
由于无法查看源代码,也没有任何技术文档,所以这里,大胆猜测这两个语句在一个LOOP中执行的可能性非常大。Tkprof格式化后,迅速定位最耗时的那个步骤:
select count(*) into :b0
from
wip_requirement_operations wro ,mtl_system_items items where
((((((wro.inventory_item_id=items.inventory_item_id and wro.organization_id=
items.organization_id) and items.bom_item_type=4) and wro.wip_entity_id=:b1)
and wro.wip_supply_type=6) and wro.organization_id=:b2) and
wro.inventory_item_id=:b3)
call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.00 0 0 0 0
Execute 8890 0.89 0.92 0 0 0 0
Fetch 8890 223.66 231.45 1951 18028921 0 8890
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 17781 224.56 232.37 1951 18028921 0 8890
Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 44
Rows Row Source Operation
------- ---------------------------------------------------
8890 SORT AGGREGATE
0 NESTED LOOPS
8890 TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B
8890 INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (object id 18977994)
0 TABLE ACCESS BY INDEX ROWID WIP_REQUIREMENT_OPERATIONS
61341000 INDEX RANGE SCAN WIP_REQUIREMENT_OPERATIONS_N1 (object id 18977444)
Elapsed times include waiting on following events:
Event waited on Times Max. Wait Total Waited
---------------------------------------- Waited ---------- ------------
db file sequential read 1951 0.08 11.65
SQL*Net message to client 8891 0.00 0.03
SQL*Net message from client 8891 0.01 2.82
latch free 19 0.01 0.12
到这里终于确定了主要问题所在,下一步就是对该语句进行优化。由于无法修改代码,因此只能从执行计划角度考虑。检查该语句的执行计划,仔细检查发现虽然利用了索引WIP_REQUIREMENT_OPERATIONS_N1,看上去很漂亮,但是仔细检查索引字段发现这是个组合索引,若走这个索引或导致大量的逻辑块读。到此,整个诊断过程也顺利结束了。
附Trace格式解读:
PARSING IN CURSOR len Length of SQL statement. dep Recursive depth of the cursor. uid Schema user id of parsing user. oct Oracle command type. lid Privilege user id. ela Elapsed time. 8i: in 1/1000th of a second, 9i: 1/1'000'000th of a second tim Timestamp. hv Hash id. ad SQLTEXT address (see?v$sqlarea?and?v$sqltext). PARSE, EXEC or FETCH #n n = number of cursor c cpu time e elapsed time p physical reads cr consistant reads cu current mode reads mis miss in cache r rows processed dep recursive depth og optimizer goal tim time
详见 http://www.adp-gmbh.ch/ora/misc/trace_file_format.html
有用的链接:
- 《SQL*Net Performance Tuning Using Underlying Network Protocols》
- 《SQL*Net message to client wait isn’t really what it’s thought to be》
Update
2010-05-27 其实是可以看到源代码的,下一篇文章会详细阐述。
» Filed Under Oracle