小编给大家分享一下Oracle 11.2.0.3数据库CJQ进程造成row cache lock等待事件影响job无法停止怎么办,希望大家阅读完这篇文章之后都有所收获,下面让我们一起去探讨吧!
一, 现象描述
某生产系统应用人员反映自己发起的job异常,没有正常运行,导致相关表无数据,自己尝试停止job但未成功,也无法重新发起job。
二, 问题分析
job的登录模式一般为’DBMS_SCHEDULER’,通过PL/SQL工具登录运行如下语句进行查看
select S.STATUS,
S.INST_ID,
S.SID,
S.PADDR,
S.SQL_ID,
S.SQL_EXEC_START,
S.ACTION,
S.LAST_CALL_ET,
S.BLOCKING_INSTANCE,
S.BLOCKING_SESSION,
S.EVENT,
S.WAIT_CLASS
from v$session S
where module = 'DBMS_SCHEDULER';
查询到一个被kill的job,状态为killed,并未释放,通过下面的语句查询到spid后,通过登录后台kill -9 的方式杀掉了系统进程。
Select p.spid from v$session s,v$processes p where p.addr=s.paddr and s.sid=508
之后让应用再次发起job,但job依然无法执行,通过后台发现该job发生了等待事件row cache lock,等待会话为1209,P1值为11,通过下面的语句查看无返回结果,不能定位row cache lock发生在哪里
Select * from v$rowcache where cache#=11;
查询1209会话发现为系统后台进程CJQ0, CJQn和Jnnn均为为oracle作业队列进程,后台进程为什么会锁正常的job呢?
Select * from v$session where sid=1209;
应用人员通过PL/SQL提供的JOB任务发现存在如下异常现象,
(上面的截图缺少了任务CSH_BRANCH_D_1DAY,因为截取的是已经处理后的图像,其实在出现问的时候是有的,只不过忘了截取了)
当前运行很多job,但没有都没有session_id,感觉像是僵尸job,既然无法通过会话杀掉job,
我们尝试通过dbms_scheduler包进行停止,但无法停止,而且依然出现了row cache lock事件,依然被1209后台进程锁住。
Row cache lock无法通过P1定位到具体的锁对象,后台进程为什么会锁住job呢,通过上面的结果初步分析是因为之前有很多僵死的job没有被正常停止,导致再次发起同样的job任务或对该job进行处理就会出现锁的情况,而且是被后台进程CJQn锁住。
SQL>Show parameter processes
通过查看参数job_queue_processes值为1000,aq_tm_processes为10000,并未达到峰值
进一步分析
登录后台通过DEBUG命令收集相关信息分析
用sysdba登录到数据库上:
$sqlplus / as sysdba
或者
$sqlplus -prelim / as sysdba <==当数据库已经很慢或者hang到无法连接
以下为操作截屏,采用的hanganalyze 3,systemstate dump 258级别。
相关说明:
Level 的含义:
1-2:只有hanganalyze输出,不dump任何进程
3:Level2+Dump出在IN_HANG状态的进程
4:Level3+Dump出在等待链里面的blockers(状态为LEAF/LEAF_NW/IGN_DMP)
5:Level4+Dump出所有在等待链中的进程(状态为NLEAF)
Oracle官方建议不要超过level 3,一般level 3也能够解决问题,超过level 3会给系统带来额外负担。
systemstate dump有多个级别:
2: dump (不包括lock element)
10: dump
11: dump + global cache of RAC
256: short stack (函数堆栈)
258: 256+2 –>short stack +dump(不包括lock element)
266: 256+10 –>short stack+ dump
267: 256+11 –>short stack+ dump + global cache of RAC
level 11和 267会 dump global cache, 会生成较大的trace 文件,一般情况下不推荐。
一般情况下,如果进程不是太多,推荐用266,因为这样可以dump出来进程的函数堆栈,可以用来分析进程在执行什么操作。
但是生成short stack比较耗时,如果进程非常多,比如2000个进程,那么可能耗时30分钟以上。这种情况下,可以生成level 10 或者 level 258, level 258 比 level 10会多收集short short stack, 但比level 10少收集一些lock element data.
以下为hanganalyze的截取输出,可以看到主要等待为row cache lock,session id:7/porcess id:128被session id:1209/process id: 49锁住,和上面的分析一致。
Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'rdbms ipc message'<='row cache lock'
Chain 1 Signature Hash: 0xfe4c0898
Chain 1:
——————————————————————————-
Oracle session identified by:
{
instance: 2 (cmbd.cmbd2)
os id: 43516004
process id: 128, oracle@jkptdb2
session id: 7
session serial #: 47563
}
is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0xb
p2: 'mode'=0x0
p3: 'request'=0x5
time in wait: 0.077179 sec
heur. time in wait: 16 min 18 sec
timeout after: 2.922821 sec
wait id: 13467
blocking: 0 sessions
wait history:
* time between current wait and wait #1: 0.000045 sec
1. event: 'row cache lock'
time waited: 3.000020 sec
wait id: 13466 p1: 'cache id'=0xb
p2: 'mode'=0x0
p3: 'request'=0x5
* time between wait #1 and #2: 0.000030 sec
2. event: 'row cache lock'
time waited: 3.000016 sec
wait id: 13465 p1: 'cache id'=0xb
p2: 'mode'=0x0
p3: 'request'=0x5
* time between wait #2 and #3: 0.000032 sec
3. event: 'row cache lock'
time waited: 3.000016 sec
wait id: 13464 p1: 'cache id'=0xb
p2: 'mode'=0x0
p3: 'request'=0x5
}
and is blocked by
=> Oracle session identified by:
{
instance: 2 (cmbd.cmbd2)
os id: 19792004
process id: 49, oracle@jkptdb2 (CJQ0)
session id: 1209
session serial #: 9
}
which is waiting for 'rdbms ipc message' with wait info:
{
p1: 'timeout'=0x14
time in wait: 0.168064 sec
heur. time in wait: 2.171697 sec
timeout after: 0.031936 sec
wait id: 72756966
blocking: 1 session
wait history:
* time between current wait and wait #1: 0.000350 sec
1. event: 'rdbms ipc message'
time waited: 0.200014 sec
wait id: 72756965 p1: 'timeout'=0x14
* time between wait #1 and #2: 0.000304 sec
2. event: 'rdbms ipc message'
time waited: 0.200016 sec
wait id: 72756964 p1: 'timeout'=0x14
* time between wait #2 and #3: 0.000307 sec
3. event: 'rdbms ipc message'
time waited: 0.200041 sec
wait id: 72756963 p1: 'timeout'=0x14
}
Chain 1 Signature: 'rdbms ipc message'<='row cache lock'
Chain 1 Signature Hash: 0xfe4c0898
以下为systemstate dump(部分截取信息)输出如下:
上锁进程(session id:129):
PROCESS 49: CJQ0
下面的描述大致为:系统进程49,也就是session id 129锁住一个会话是sid:7,等待事件为row cache lock,和上面的分析吻合。
There are 1 sessions blocked by this session.
Dumping one waiter:
inst: 2, sid: 7, ser: 47563
wait event: 'row cache lock'
p1: 'cache id'=0xb
p2: 'mode'=0x0
p3: 'request'=0x5
row_wait_obj#: 520, block#: 11728, row#: 0, file# 1
min_blocked_time: 0 secs, waiter_cache_ver: 21656
下面可以看到CJQ0进程持有S锁,而对象正是=CSH_BRANCH_D_1DAY_JOB,也就是我们上面打算停止的job名称。
kssdmh: so 0x7000001b029ced8 (type 75, "row cache enqueues") may not have children
—————————————-
SO: 0x7000001b02dd868, type: 75, owner: 0x700000191e0d808, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x7000001aa6dc268, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0
row cache enqueue: count=2 session=7000001adb58460 object=7000001810e7238, mode=S
savepoint=0x351e5fa5
row cache parent object: address=7000001810e7238 cid=11(dc_objects)
hash=af8ef272 typ=21 transaction=0 flags=00000002
objectno=352585 ownerid=57 nsp=1
name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S
status=VALID/-/-/-/-/-/-/-/-
request=N release=FALSE flags=8
instance lock=QI af8ef272 41d157e6
set=0, complete=FALSE
data=
00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000
00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
被锁进程(session id:7)
PROCESS 128:
—————————————-
Short stack dump:
ksedsts()+360<-ksdxfstk()+44<-ksdxcb()+3384<-sspuser()+116<-47dc<-sskgpwwait()+32<-skgpwwait()+180
<-ksliwat()+11032<-kslwaitctx()+180<-kqrigt()+1424<-kqrLockAndPinPo()+532<-kqrpre1()+864<-kqrpre()+28
<-jskaObjGet()+1024<-jskaJobRun()+2788<-jsiRunJob()+776<-jsaRunJob()+780<-spefcmpa()+600
<-spefmccallstd()+1340<-pextproc()+168<-peftrusted()+164<-psdexsp()+276<-rpiswu2()+480
<-psdextp()+632<-pefccal()+588<-pefcal()+260<-pevm_FCAL()+144<-pfrinstr_FCAL()+96
<-pfrrun_no_tool()+96<-pfrrun()+1032<-plsql_run()+716<-peicnt()+320<-kkxexe()+508
<-opiexe()+17168<-kpoal8()+1856<-opiodr()+720<-ttcpip()+1028<-opitsk()+1508<-opiino()+940
<-opiodr()+720<-opidrv()+1132<-sou2o()+136<-opimai_real()+608<-ssthrdmain()+268
<-main()+204<-__start()+112
当前正在等待事件row cache lock
Current Wait Stack:
0: waiting for 'row cache lock'
cache id=0xb, mode=0x0, request=0x5
wait_id=13602 seq_num=13603 snap_id=1
wait times: snap=0.003162 sec, exc=0.003162 sec, total=0.003162 sec
wait times: max=3.000000 sec, heur=23 min 0 sec
wait counts: calls=1 os=1
in_wait=1 iflags=0x5a2
下面显示了该进程在请求一个X锁,请求的对象是正是CSH_BRANCH_D_1DAY_JOB,和上面的分析是温和的,CJQ进程对 CSH_BRANCH_D_1DAY_JOB持有一个S锁,而当前的进程需要对CSH_BRANCH_D_1DAY_JOB进程操作,需要一个X锁。
SO: 0x7000001b2c1be58, type: 75, owner: 0x70000019c72b078, flag: INIT/-/-/0x00 if: 0x1 c: 0x1
proc=0x7000001ab6cb410, name=row cache enqueues, file=kqr.h LINE:2064 ID:, pg=0
row cache enqueue: count=1 session=7000001ab826ce0 object=7000001810e7238, request=X
savepoint=0x893a
row cache parent object: address=7000001810e7238 cid=11(dc_objects)
hash=af8ef272 typ=21 transaction=0 flags=00000002
objectno=352585 ownerid=57 nsp=1
name=CSH_BRANCH_D_1DAY_JOB own=7000001810e7300[7000001b02dd910,7000001b02dd910] wat=7000001810e7310[7000001b2c1bf00,7000001b2c1bf00] mode=S
status=VALID/-/-/-/-/-/-/-/-
request=N release=FALSE flags=8
instance lock=QI af8ef272 41d157e6
set=0, complete=FALSE
data=
00000039 00154353 485f4252 414e4348 5f445f31 4441595f 4a4f4200 00000000
00000000 01000000 00000000 00000000 00000000 00000000 00000000 00000000
011044c6 00000000 07000001 810e7300
下面记录了操作语句:
begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
—————————————-
SO: 0x70000015d5f35c8, type: 78, owner: 0x7000001ab826ce0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x7000001ab6cb410, name=LIBRARY OBJECT LOCK, file=kgl.h LINE:8548 ID:, pg=0
LibraryObjectLock: Address=70000015d5f35c8 Handle=7000000ba27f998 Mode=N CanBeBrokenCount=1 Incarnation=1 ExecutionCount=1
User=7000001ab826ce0 Session=7000001ab826ce0 ReferenceCount=1 Flags=CNB/[0001] SavepointNum=5a52c6da
LibraryHandle: Address=7000000ba27f998 Hash=a53cfcdd LockMode=N PinMode=0 LoadLockMode=0 Status=VALD
ObjectName:Name=begin sys.dbms_scheduler.run_job('CMBD.CSH_BRANCH_D_1DAY_JOB', false); end;
三,问题定位
通过相关关键字row cache lock cjq在metlink搜索查看,最终定位到了一个BUG16994952 –Unable to unscheduled propagation due to CJQ self-deadlock/hang。通过BUG的描述发现和上面的分析吻合。
Bug 16994952 Unable to unschedule propagation due to CJQ self-deadlock / hang
This note gives a brief overview of bug 16994952.
The content was last updated on: 23-AUG-2015
Click here for details of each of the sections below.
Affects:
Product (Component) |
Oracle Server (Rdbms) |
Range of versions believed to be affected |
(Not specified) |
Versions confirmed as being affected |
· 11.2.0.3 |
Platforms affected |
Generic (all / most platforms affected) |
Fixed:
The fix for 16994952 is first included in |
· 12.1.0.1 (Base Release) |
Interim patches may be available for earlier versions – click here to check.
Symptoms: |
Related To: |
· Deadlock · Hang (Process Hang) · Waits for "row cache lock" · Stack is likely to include jskaObjGet |
· Advanced Queuing · Job Queues |
Description
Unable to unschedule propagation due to CJQ self-deadlock / hang.
Rediscovery Information:
This bug may be the cause if the CJQ0 process is stuck in a
self-deadlock waiting on "row cache lock" for a dc_objects
entry in X mode for an object that it already holds in S mode.
The stack is likely to include jskaObjGet
Workaround:
Kill the hung CJQ process
HOOKS "WAITEVENT:row cache lock" STACKHAS:jskaObjGet WAITEVENT:row cache lock STACKHAS:jskaObjGet LIKELYAFFECTS XAFFECTS_11.2.0.1 XAFFECTS_V11020001 AFFECTS=11.2.0.1 XAFFECTS_11.2.0.2 XAFFECTS_V11020002 AFFECTS=11.2.0.2 XAFFECTS_11.2.0.3 XAFFECTS_V11020003 AFFECTS=11.2.0.3 XAFFECTS_11.2.0.4 XAFFECTS_V11020004 AFFECTS=11.2.0.4 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_AQ TAG_DEADLOCK TAG_HANG TAG_JOBQ AQ DEADLOCK HANG JOBQ FIXED_12.1.0.1 SOLVED_SR=2 SOLVED_SR_BAND_2
Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support. |
References
Bug:16994952 (This link will only work for PUBLISHED bugs)
Note:245840.1 Information on the sections in this article
四,问题处理
Kill the hung CJQ process,杀掉CJQ进程
SQL>Select spid from v$session s,v$process p where p.addr=s.paddr and s.sid=129;
再次通过dbms_scheduler包进行停止,可以正常停止,依次停止僵死的job后,应用可以再次正常发起job,恢复正常。
看完了这篇文章,相信你对“Oracle 11.2.0.3数据库CJQ进程造成row cache lock等待事件影响job无法停止怎么办”有了一定的了解,如果想了解更多相关知识,欢迎关注亿速云行业资讯频道,感谢各位的阅读!
原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/205536.html