Copy 进程被 Kill 后数据库无法连接故障

昨天开发人员跑来,说是他的测试环境数据库无法连接,下面是详细过程。

故障现象

登陆到数据库主机,执行 psql

1
2
[postgresql@test_db ~]$ psql  
psql: FATAL: the database system is in recovery mode

果然,连 psql 命令都不能执行,根据报错信息,知道数据库在做恢复。

排查过程

查看 PostgreSQL 后台进程

1
2
3
4
5
6
7
[postgresql@test_db ~]$ ps -ef | grep post  
501 26024 28829 4 17:34 ? 00:00:25 postgres: startup process recovering 00000001000001CA0000001F
root 26476 25974 0 17:35 pts/0 00:00:00 su - postgresql
501 28829 1 0 Apr27 ? 00:00:12 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data
501 28830 28829 0 Apr27 ? 00:01:01 postgres: logger process
root 31526 31408 0 17:44 pts/1 00:00:00 su - postgresql
501 31656 31527 0 17:44 pts/1 00:00:00 grep post

备注:从上面输出,可以看到一个关键信息 “ recovering 00000001000001CA0000001F “ , 这说明数据库在做恢复,于是向开发人员了解一下情况,原来他正在执行一个 copy 操作,觉得太慢,后来在操作系统层面将 copy 进程 kill 了。原来如此。这就是PG 在做恢复的根本原因, 俺再一次和开发人员强调,PG的进程不能通过 kill 来杀。。。

试图关闭数据库

1
2
3
[postgresql@test_db ~]$ pg_ctl stop -m fast  
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down

很不幸,数据库无法正常关闭。

查看PostgreSQL 后台进程

1
2
3
4
5
6
7
8
9
10
11
[postgresql@test_db ~]$ ps -ef | grep post  
501 374 31527 0 17:46 pts/1 00:00:00 grep post
501 26024 28829 3 17:34 ? 00:00:25 postgres: startup process recovering 00000001000001CA0000001F
root 26476 25974 0 17:35 pts/0 00:00:00 su - postgresql
501 28829 1 0 Apr27 ? 00:00:13 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data
501 28830 28829 0 Apr27 ? 00:01:01 postgres: logger process
root 31526 31408 0 17:44 pts/1 00:00:00 su - postgresql
root 31952 31751 0 17:45 pts/3 00:00:00 su - postgresql

[postgresql@test_db ~]$ psql
psql: FATAL: the database system is shutting down

数据库还是老样子,不能连接。

查看 Csvlog

1
2
3
4
2011-05-23 17:48:53.006 CST,"mylog","skytflog",1457,"192.168.171.39:55019",4dda2d85.5b1,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""  
2011-05-23 17:48:53.007 CST,"mylog","skytflog",1458,"192.168.171.39:55020",4dda2d85.5b2,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""
2011-05-23 17:48:53.008 CST,"mylog","skytflog",1459,"192.168.171.39:55021",4dda2d85.5b3,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""
2011-05-23 17:48:53.298 CST,"mylog","skytflog",1460,"xxx.xxx.xxx.xx:48985",4dda2d85.5b4,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""

备注:CSV日志里有大量the database system is shutting down ,意思是 PostgreSQL 正在关闭,俺在想,是否应该先等等,或者数据库就关闭了。

查看数据库 WAL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
[postgresql@test_db pg_xlog]$ ls -lrt  
total 4920304
drwx------ 2 postgresql postgresql 4096 Feb 18 17:32 archive_status
-rw------- 1 postgresql postgresql 67108864 May 23 15:55 00000001000001CA00000022
-rw------- 1 postgresql postgresql 67108864 May 23 16:02 00000001000001CA0000002A
-rw------- 1 postgresql postgresql 67108864 May 23 16:05 00000001000001CA00000021
-rw------- 1 postgresql postgresql 67108864 May 23 16:06 00000001000001CA00000027
-rw------- 1 postgresql postgresql 67108864 May 23 16:08 00000001000001CA00000026
-rw------- 1 postgresql postgresql 67108864 May 23 16:16 00000001000001CA00000024
-rw------- 1 postgresql postgresql 67108864 May 23 16:16 00000001000001CA00000025
-rw------- 1 postgresql postgresql 67108864 May 23 16:20 00000001000001CA00000023
-rw------- 1 postgresql postgresql 67108864 May 23 16:22 00000001000001CA00000028
-rw------- 1 postgresql postgresql 67108864 May 23 16:23 00000001000001CA00000029
-rw------- 1 postgresql postgresql 67108864 May 23 16:24 00000001000001CA00000020
-rw------- 1 postgresql postgresql 67108864 May 23 16:25 00000001000001C90000001F
-rw------- 1 postgresql postgresql 67108864 May 23 16:26 00000001000001C900000020
-rw------- 1 postgresql postgresql 67108864 May 23 16:27 00000001000001C900000021
-rw------- 1 postgresql postgresql 67108864 May 23 16:28 00000001000001C900000022
-rw------- 1 postgresql postgresql 67108864 May 23 16:29 00000001000001C900000023
-rw------- 1 postgresql postgresql 67108864 May 23 16:30 00000001000001C900000024
-rw------- 1 postgresql postgresql 67108864 May 23 16:31 00000001000001C900000025
-rw------- 1 postgresql postgresql 67108864 May 23 16:33 00000001000001C900000026
-rw------- 1 postgresql postgresql 67108864 May 23 16:34 00000001000001C900000027
-rw------- 1 postgresql postgresql 67108864 May 23 16:35 00000001000001C900000028
-rw------- 1 postgresql postgresql 67108864 May 23 16:36 00000001000001C900000029
-rw------- 1 postgresql postgresql 67108864 May 23 16:37 00000001000001C90000002A
-rw------- 1 postgresql postgresql 67108864 May 23 16:42 00000001000001C90000002B
-rw------- 1 postgresql postgresql 67108864 May 23 16:45 00000001000001C90000002C
-rw------- 1 postgresql postgresql 67108864 May 23 16:46 00000001000001C90000002D
-rw------- 1 postgresql postgresql 67108864 May 23 16:46 00000001000001C90000002E
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C90000002F
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C900000030
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C900000031
-rw------- 1 postgresql postgresql 67108864 May 23 16:48 00000001000001C900000032
-rw------- 1 postgresql postgresql 67108864 May 23 16:48 00000001000001C900000033
-rw------- 1 postgresql postgresql 67108864 May 23 16:49 00000001000001C900000034
-rw------- 1 postgresql postgresql 67108864 May 23 16:49 00000001000001C900000035
-rw------- 1 postgresql postgresql 67108864 May 23 16:50 00000001000001C900000036
-rw------- 1 postgresql postgresql 67108864 May 23 16:51 00000001000001C900000037
-rw------- 1 postgresql postgresql 67108864 May 23 16:52 00000001000001C900000038
-rw------- 1 postgresql postgresql 67108864 May 23 16:53 00000001000001C900000039
-rw------- 1 postgresql postgresql 67108864 May 23 16:55 00000001000001C90000003A
-rw------- 1 postgresql postgresql 67108864 May 23 16:56 00000001000001C90000003B
-rw------- 1 postgresql postgresql 67108864 May 23 16:57 00000001000001C90000003C
-rw------- 1 postgresql postgresql 67108864 May 23 16:58 00000001000001C90000003D
-rw------- 1 postgresql postgresql 67108864 May 23 16:59 00000001000001C90000003E
-rw------- 1 postgresql postgresql 67108864 May 23 17:00 00000001000001CA00000000
-rw------- 1 postgresql postgresql 67108864 May 23 17:01 00000001000001CA00000001
-rw------- 1 postgresql postgresql 67108864 May 23 17:02 00000001000001CA00000002
-rw------- 1 postgresql postgresql 67108864 May 23 17:03 00000001000001CA00000003
-rw------- 1 postgresql postgresql 67108864 May 23 17:05 00000001000001CA00000004
-rw------- 1 postgresql postgresql 67108864 May 23 17:06 00000001000001CA00000005
-rw------- 1 postgresql postgresql 67108864 May 23 17:07 00000001000001CA00000006
-rw------- 1 postgresql postgresql 67108864 May 23 17:08 00000001000001CA00000007
-rw------- 1 postgresql postgresql 67108864 May 23 17:15 00000001000001CA00000008
-rw------- 1 postgresql postgresql 67108864 May 23 17:16 00000001000001CA00000009
-rw------- 1 postgresql postgresql 67108864 May 23 17:16 00000001000001CA0000000A
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000B
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000C
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000D
-rw------- 1 postgresql postgresql 67108864 May 23 17:18 00000001000001CA0000000E
-rw------- 1 postgresql postgresql 67108864 May 23 17:18 00000001000001CA0000000F
-rw------- 1 postgresql postgresql 67108864 May 23 17:19 00000001000001CA00000010
-rw------- 1 postgresql postgresql 67108864 May 23 17:19 00000001000001CA00000011
-rw------- 1 postgresql postgresql 67108864 May 23 17:20 00000001000001CA00000012
-rw------- 1 postgresql postgresql 67108864 May 23 17:21 00000001000001CA00000013
-rw------- 1 postgresql postgresql 67108864 May 23 17:22 00000001000001CA00000014
-rw------- 1 postgresql postgresql 67108864 May 23 17:23 00000001000001CA00000015
-rw------- 1 postgresql postgresql 67108864 May 23 17:24 00000001000001CA00000016
-rw------- 1 postgresql postgresql 67108864 May 23 17:25 00000001000001CA00000017
-rw------- 1 postgresql postgresql 67108864 May 23 17:26 00000001000001CA00000018
-rw------- 1 postgresql postgresql 67108864 May 23 17:27 00000001000001CA00000019
-rw------- 1 postgresql postgresql 67108864 May 23 17:29 00000001000001CA0000001A
-rw------- 1 postgresql postgresql 67108864 May 23 17:30 00000001000001CA0000001B
-rw------- 1 postgresql postgresql 67108864 May 23 17:31 00000001000001CA0000001C
-rw------- 1 postgresql postgresql 67108864 May 23 17:32 00000001000001CA0000001D
-rw------- 1 postgresql postgresql 67108864 May 23 17:33 00000001000001CA0000001E
-rw------- 1 postgresql postgresql 67108864 May 23 17:34 00000001000001CA0000001F

备注:注意近一个文件是 “00000001000001CA0000001F” , 再根据 “ postgres: startup process recovering 00000001000001CA0000001F “ 说明,PG已经恢复到最后一个日志文件了,猜测再等等,应该就恢复完成。大概过了10分钟左右, 后来发现PostgreSQL 果然停了,正好应验了前面的猜测,有点险。接着,启动数据库。

启动数据库

1
2
[postgresql@test_db pg_log]$ pg_ctl start -D $PGDATA  
server starting

运气不错,数据库可以正常启动。

连接测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[postgresql@test_db pg_log]$ ps -ef | grep post  
501 5665 1 3 17:56 pts/3 00:00:00 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data
501 5666 5665 1 17:56 ? 00:00:00 postgres: logger process
501 5672 5665 0 17:56 ? 00:00:00 postgres: writer process
501 5673 5665 0 17:56 ? 00:00:00 postgres: wal writer process
501 5674 5665 0 17:56 ? 00:00:00 postgres: autovacuum launcher process
501 5675 5665 0 17:56 ? 00:00:00 postgres: stats collector process
501 5679 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43189) idle
501 5680 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43190) idle
501 5681 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43191) idle
501 5682 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43192) idle
501 5689 31953 0 17:56 pts/3 00:00:00 grep post

[postgresql@test_db pg_log]$ psql postgres postgresql
psql (9.0.3)
Type "help" for help.
postgres=> /l
List of databases
Name | Owner | Encoding | Collation | Ctype | Access privileges
-----------+--------+----------+-----------+-------+-------------------
postgres | mylog | UTF8 | C | C |
skytflog | skyppa | UTF8 | C | C |
template0 | mylog | UTF8 | C | C | =c/mylog +
| | | | | mylog=CTc/mylog
template1 | mylog | UTF8 | C | C | =c/mylog +
| | | | | mylog=CTc/mylog
(4 rows)

汗,总算缓了口气,数据库正常启动,数据还在,再一次向开发人员强调,以后不要通过”kill “ 命令来杀 PostgreSQL 进程。

原创文章,作者:306829225,如若转载,请注明出处:https://blog.ytso.com/236399.html

(0)
上一篇 2022年1月24日
下一篇 2022年1月24日

相关推荐

发表回复

登录后才能评论