今天开发人员发来SIR,需要给生产库上一张表加字段,SQL语句如下,表名用 table_test 代替 ALTER TABLE table_test ADD COLUMN column_a integer NOT NULL DEFAULT 0;
当时偶查了一下这张表的记录数,才3000多条,这么小的表,接着查了下 pg_stata_activity,发现没有多少活动的会话,于是开始执行上面SQL,执行了大概10S之后,发现还没执行完,觉得奇怪,这么小的表,应该很快才对,估计是被HOLD住了, 详细操作过程如下。
表结构
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
|
testtf=# /d pg_class; Table "pg_catalog.pg_class" Column | Type | Modifiers
relname | name | not null relnamespace | oid | not null reltype | oid | not null reloftype | oid | not null relowner | oid | not null relam | oid | not null relfilenode | oid | not null reltablespace | oid | not null relpages | integer | not null reltuples | real | not null reltoastrelid | oid | not null reltoastidxid | oid | not null relhasindex | boolean | not null relisshared | boolean | not null relistemp | boolean | not null relkind | "char" | not null relnatts | smallint | not null relchecks | smallint | not null relhasoids | boolean | not null relhaspkey | boolean | not null relhasexclusion | boolean | not null relhasrules | boolean | not null relhastriggers | boolean | not null relhassubclass | boolean | not null relfrozenxid | xid | not null relacl | aclitem[] | reloptions | text[] | Indexes: "pg_class_oid_index" UNIQUE, btree (oid) "pg_class_relname_nsp_index" UNIQUE, btree (relname, relnamespace)
|
先查出表table_test 的oid
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
|
testtf=# select oid,relname from pg_class where relname='table_test'; oid | relname -------+--------- 16678 | table_test (1 row)
testtf=# /d pg_locks; View "pg_catalog.pg_locks" Column | Type | Modifiers --------------------+----------+----------- locktype | text | database | oid | relation | oid | page | integer | tuple | smallint | virtualxid | text | transactionid | xid | classid | oid | objid | oid | objsubid | smallint | virtualtransaction | text | pid | integer | mode | text | granted | boolean |
|
查询表table_test上持有的锁
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
|
testtf=# select locktype,database,pid,relation ,mode from pg_locks where relation=16678; locktype | database | pid | relation | mode ----------+----------+-------+----------+--------------------- relation | 16400 | 9905 | 16678 | AccessShareLock relation | 16400 | 9902 | 16678 | AccessShareLock relation | 16400 | 9127 | 16678 | AccessShareLock relation | 16400 | 9909 | 16678 | AccessShareLock relation | 16400 | 6781 | 16678 | AccessShareLock relation | 16400 | 9910 | 16678 | AccessShareLock relation | 16400 | 9903 | 16678 | AccessShareLock relation | 16400 | 6779 | 16678 | AccessShareLock relation | 16400 | 9904 | 16678 | AccessShareLock relation | 16400 | 9907 | 16678 | AccessShareLock relation | 16400 | 9436 | 16678 | AccessExclusiveLock relation | 16400 | 29393 | 16678 | AccessShareLock relation | 16400 | 9908 | 16678 | AccessShareLock relation | 16400 | 9912 | 16678 | AccessShareLock relation | 16400 | 9911 | 16678 | AccessShareLock relation | 16400 | 4615 | 16678 | AccessShareLock relation | 16400 | 9128 | 16678 | AccessShareLock (17 rows)
testtf=# select locktype,database,pid,relation ,mode from pg_locks where relation=16678; locktype | database | pid | relation | mode ----------+----------+-------+----------+----------------- relation | 16400 | 29393 | 16678 | AccessShareLock testtf=# select locktype,database,pid,relation ,mode from pg_locks where relation=16678; locktype | database | pid | relation | mode ----------+----------+-------+----------+----------------- relation | 16400 | 29393 | 16678 | AccessShareLock
|
备注:此时表 table_test 只有一个会话(pid=29393)持有 AccessShareLock 锁,这应该是个Select查询;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
|
testtf=# /d pg_stat_activity; View "pg_catalog.pg_stat_activity" Column | Type | Modifiers ------------------+--------------------------+----------- datid | oid | datname | name | procpid | integer | usesysid | oid | usename | name | application_name | text | client_addr | inet | client_port | integer | backend_start | timestamp with time zone | xact_start | timestamp with time zone | query_start | timestamp with time zone | waiting | boolean | current_query | text | testtf=# select usename,current_query ,xact_start,procpid from pg_stat_activity where procpid=29393; usename | current_query | xact_start | procpid ---------+-----------------------+-------------------------------+--------- wapnews | <IDLE> in transaction | 2010-07-08 13:53:07.906297+08 | 29393
|
备注:从上面可以看出 空闲事务 9436 持有共享锁”AccessShareLock”,这个事务从 2010-07-08 13:53:07 开始到现在还没有结束。
经和开发人员联系,可以KILL这个会话
1 2 3 4 5
|
testtf=# select pg_terminate_backend(29393) pg_terminate_backend ---------------------- t (1 row)
|
会话消失
1 2 3 4
|
testtf= usename | current_query | xact_start | procpid ---------+---------------+------------+--------- (0 rows)
|
再次执行DDL,执行成功
1 2
|
testtf=# ALTER TABLE table_test ADD COLUMN column_a integer NOT NULL DEFAULT 0; ALTER TABLE
|
总结
- 原因分析: 在生产库执行类似ALTER TABLE 的DDL时应该非常小心,因为此时加的是 Table 级的 ACCESS EXCLUSIVE 锁 ,ACCESS EXCLUSIVE和其它所有类型所有锁都冲突,包括SELECT,所以当执行Select这张表的事务还没有结束时,执行 Alter Table时,表table_test会因获取不到 ACCESS EXCLUSIVE 而发生等侍。
- 建议方法:当执行ALTER TABLE类似操作时,正确的方法是先查看 pg_locks 是否有这张表所持有的锁,如果有,等系统空闲的时候 再做这种操作
- 扩展:当在生产库执行DDL时,应该首先查看执行DDL对象当前所持有锁的情况,如果和执行DDL请求的锁冲突,应该等系统空闲的时候再执行DDL操作。
- 和开发人员沟通,程序代码及时提交事务。
原创文章,作者:254126420,如若转载,请注明出处:https://blog.ytso.com/236328.html