Planning Manager诊断过程和10046事件

在一次Planning Manager性能问题的诊断中,用到了Oracle ERP环境中非常典型的诊断操作和过程,特全程记录于此。

当发现某个程序存在性能问题时,首先当然是快速检查一下SESSION的相关信息,看看主要在执行或等待些什么:

SELECT stat.sid,
       n.name,
       n.class,
       stat.value
  FROM v$sesstat  stat,
       v$statname n
 WHERE stat.statistic# = n.statistic#
   AND stat.sid = 121
 ORDER BY upper(n.name);

这里观察到bytes received via SQL*Net from client, bytes sent via SQL*Net to clientSQL*Net roundtrips to/from client 都在飙升。这对应了两个events:SQL*Net message to clientSQL*Net message from client,这两个事件数值大并不意味着网络状况不好,前者仅仅表示Oracle数据库将数据放入TCP send buffer的时间,而后者表示从客户端接收到反馈指令的时间,两者反映了SQL*Net和数据库之间的时间消耗,而不是网络传输的时间消耗。基于对系统架构的了解,排除SQL*Net的性能问题,那么两个事件预示了什么呢?

SELECT * FROM v$sess_io v WHERE v.sid = 61;

观察会话中的数据库读取情况,CONSISTENT_GETS 值在持续增长。结合上面SQL*Net的情况,大致可以判断出该程序在反复执行某个查询语句,每次获取的数据量很小但频率很高。我们知道,客户端在读取数据时,可能影响的因素有Array Fetch Size、SQL*Net中的SDU和数据库服务器每次读取的数据量,简单计算SQL*Net每次传输大致的数据量后将问题定位到SQL的性能问题上。

开始跟踪Planning Manager的全过程。Planning Manager 是一种特殊的并发程序(Immediate),它直接依附于并发管理器MRP Manager执行。但是作为并发请求,它的常规诊断方式是一样的。如果有计划管理器在运行,则先停掉它,然后启用Trace:

关于详细操作和Trace文件的查找可参考另一篇文章《跟踪(Trace)并发请求

进入MRP计划管理员设置界面,启动Planning Manager。对于其他独立的并发请求,可以直接手工提交。等待Planning Manager处理完预先准备的接口数据后,立即取消该请求,并取消Trace。

接下来就是对Trace文件的解读。对于并发请求的Trace,其实就是对应的一次10046事件(event),该事件可简单分为以下四个级别:

  • Level 1: 跟踪SQL
  • Level 4: Level 1 + 跟踪绑定变量
  • Level 8: Level 1 + 跟踪等待事件
  • Level 12: Level 4 + Level 8 (跟踪SQL, 绑定变量和等待事件)

手工方式的话可通过下面的命令启用:

alter session set events='10046 trace name context forever, level 8' ;

它和下面的命令类似:

ALTER SESSION SET SQL_TRACE = TRUE

只是前一个命令,可以根据需要指定不同的level,比如此处的level 8,表示跟踪到SQL和等待事件,而不获取绑定变量的数据。也可以通过包DBMS_System来启用,例如:

EXEC DBMS_System.Set_Sql_Trace_In_Session(sid, serial#, true );
EXEC DBMS_System.Set_Ev(sid, serial#, event, level, name);

在某些场合下,针对并发请求的跟踪可能会产生不止一个Trace文件,此时第一个Trace文件中会指明第二个Trace文件标识,如:

PARSE #41:c=0,e=565,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=19728618248238
EXEC #41:c=0,e=781,p=0,cr=0,cu=0,mis=0,r=0,dep=1,og=4,tim=19728618249248
=====================
PARSING IN CURSOR #41 len=54 dep=1 uid=44 oct=42 lid=44 tim=19728618250006 hv=4121095646 ad='f7b931f8'
ALTER SESSION SET TRACEFILE_IDENTIFIER='XZB_CR8909488'
END OF STMT
PARSE #41:c=0,e=249,p=0,cr=0,cu=0,mis=1,r=0,dep=1,og=4,tim=19728618250001

*** TRACE DUMP CONTINUES IN FILE 
……/udump/test_ora_14443_XZB_CR8909488.trc *** 

先来大致看一下Trace文件中都有些什么。

Dump file ……/udump/test_ora_14443.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /u08/test/proddb/9.2.0
System name:	SunOS
Node name:	erptest
Release:	5.10
Version:	Generic_118833-03
Machine:	sun4u
Instance name: TEST
Redo thread mounted by this instance: 1
Oracle process number: 59
Unix process pid: 14443, image: oracle@erptest (TNS V1-V3)

这是文件头,显示了相关环境信息。接着往下翻,关键在具体某条SQL的执行情况上,如下:

=====================
PARSING IN CURSOR #75 len=332 dep=0 uid=44 oct=3 lid=44 tim=19728619902280 hv=3581312695 ad='f2b4eb00'
select count(*)  into :b0  from wip_requirement_operations wro ,mtl_system_items items 
where ((((((wro.inventory_item_id=items.inventory_item_id and wro.organization_id=items.organization_id) and items.bom_item_type=4) 
and wro.wip_entity_id=:b1) and wro.wip_supply_type=6) 
and wro.organization_id=:b2) and wro.inventory_item_id=:b3)
END OF STMT
PARSE #75:c=10000,e=2221,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=0,tim=19728619902273
EXEC #75:c=0,e=2841,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728619905363
WAIT #75: nam='db file sequential read' ela= 6392 p1=66 p2=135783 p3=1 
……
FETCH #74:c=0,e=1248,p=0,cr=36,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777309437
WAIT #74: nam='SQL*Net message to client' ela= 6 p1=1952673792 p2=1 p3=0
WAIT #74: nam='SQL*Net message from client' ela= 338 p1=1952673792 p2=1 p3=0
EXEC #74:c=0,e=256,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777310344
WAIT #74: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0
WAIT #74: nam='SQL*Net message from client' ela= 205 p1=1952673792 p2=1 p3=0
EXEC #75:c=0,e=110,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777310795
WAIT #75: nam='SQL*Net message to client' ela= 2 p1=1952673792 p2=1 p3=0
FETCH #75:c=30000,e=26138,p=0,cr=2028,cu=0,mis=0,r=1,dep=0,og=4,tim=19728777336976
WAIT #75: nam='SQL*Net message from client' ela= 278 p1=1952673792 p2=1 p3=0
FETCH #74:c=0,e=749,p=0,cr=36,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777338168
WAIT #74: nam='SQL*Net message to client' ela= 2 p1=1952673792 p2=1 p3=0
WAIT #74: nam='SQL*Net message from client' ela= 304 p1=1952673792 p2=1 p3=0
EXEC #74:c=0,e=112,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777338769
WAIT #74: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0
WAIT #74: nam='SQL*Net message from client' ela= 227 p1=1952673792 p2=1 p3=0
EXEC #75:c=0,e=272,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=4,tim=19728777339533
WAIT #75: nam='SQL*Net message to client' ela= 7 p1=1952673792 p2=1 p3=0
FETCH #75:c=30000,e=27212,p=0,cr=2028,cu=0,mis=0,r=1,dep=0,og=4,tim=19728777366810
WAIT #75: nam='SQL*Net message from client' ela= 415 p1=1952673792 p2=1 p3=0
……

这里显示了cursor语句的执行情况,发现编号为#74和#75的语句执行次数非常多,并且间隔执行,每次execute速度很快,fetch消耗时间相对较多,每次读取后立即发送给客户端。(格式的解读参考附注)

由于无法查看源代码,也没有任何技术文档,所以这里,大胆猜测这两个语句在一个LOOP中执行的可能性非常大。Tkprof格式化后,迅速定位最耗时的那个步骤:

select count(*)  into :b0  
from
 wip_requirement_operations wro ,mtl_system_items items where 
  ((((((wro.inventory_item_id=items.inventory_item_id and wro.organization_id=
  items.organization_id) and items.bom_item_type=4) and wro.wip_entity_id=:b1)
   and wro.wip_supply_type=6) and wro.organization_id=:b2) and 
  wro.inventory_item_id=:b3)


call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.01       0.00          0          0          0           0
Execute   8890      0.89       0.92          0          0          0           0
Fetch     8890    223.66     231.45       1951   18028921          0        8890
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    17781    224.56     232.37       1951   18028921          0        8890

Misses in library cache during parse: 1
Optimizer goal: CHOOSE
Parsing user id: 44  

Rows     Row Source Operation
-------  ---------------------------------------------------
   8890  SORT AGGREGATE 
      0   NESTED LOOPS  
   8890    TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B 
   8890     INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (object id 18977994)
      0    TABLE ACCESS BY INDEX ROWID WIP_REQUIREMENT_OPERATIONS 
61341000     INDEX RANGE SCAN WIP_REQUIREMENT_OPERATIONS_N1 (object id 18977444)


Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  db file sequential read                      1951        0.08         11.65
  SQL*Net message to client                    8891        0.00          0.03
  SQL*Net message from client                  8891        0.01          2.82
  latch free                                     19        0.01          0.12

到这里终于确定了主要问题所在,下一步就是对该语句进行优化。由于无法修改代码,因此只能从执行计划角度考虑。检查该语句的执行计划,仔细检查发现虽然利用了索引WIP_REQUIREMENT_OPERATIONS_N1,看上去很漂亮,但是仔细检查索引字段发现这是个组合索引,若走这个索引或导致大量的逻辑块读。到此,整个诊断过程也顺利结束了。

附Trace格式解读:

PARSING IN CURSOR
len	Length of SQL statement.
dep	Recursive depth of the cursor.
uid	Schema user id of parsing user.
oct	Oracle command type.
lid	Privilege user id.
ela	Elapsed time. 8i: in 1/1000th of a second, 9i: 1/1'000'000th of a second
tim	Timestamp.
hv	Hash id.
ad	SQLTEXT address (see?v$sqlarea?and?v$sqltext).

PARSE, EXEC or FETCH
#n	n = number of cursor
c	cpu time
e	elapsed time
p	physical reads
cr	consistant reads
cu	current mode reads
mis	miss in cache
r	rows processed
dep	recursive depth
og	optimizer goal
tim	time

详见 http://www.adp-gmbh.ch/ora/misc/trace_file_format.html

有用的链接:

Update
2010-05-27 其实是可以看到源代码的,下一篇文章会详细阐述。