昨天开发人员跑来，说是他的测试环境数据库无法连接，下面是详细过程。

故障现象

登陆到数据库主机，执行 psql

1 2	[postgresql@test_db ~]$ psql psql: FATAL: the database system is in recovery mode

果然，连 psql 命令都不能执行，根据报错信息，知道数据库在做恢复。

排查过程

查看 PostgreSQL 后台进程

[postgresql@test_db ~]$ ps -ef | grep post  
501 26024 28829 4 17:34 ? 00:00:25 postgres: startup process recovering 00000001000001CA0000001F  
root 26476 25974 0 17:35 pts/0 00:00:00 su - postgresql  
501 28829 1 0 Apr27 ? 00:00:12 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data  
501 28830 28829 0 Apr27 ? 00:01:01 postgres: logger process  
root 31526 31408 0 17:44 pts/1 00:00:00 su - postgresql  
501 31656 31527 0 17:44 pts/1 00:00:00 grep post

备注：从上面输出，可以看到一个关键信息 “ recovering 00000001000001CA0000001F “ , 这说明数据库在做恢复，于是向开发人员了解一下情况，原来他正在执行一个 copy 操作，觉得太慢，后来在操作系统层面将 copy 进程 kill 了。原来如此。这就是PG 在做恢复的根本原因，俺再一次和开发人员强调，PG的进程不能通过 kill 来杀。。。

试图关闭数据库

1
2
3

[postgresql@test_db ~]$ pg_ctl stop -m fast  
waiting for server to shut down............................................................... failed  
pg_ctl: server does not shut down

很不幸，数据库无法正常关闭。

查看PostgreSQL 后台进程

[postgresql@test_db ~]$ ps -ef | grep post  
501 374 31527 0 17:46 pts/1 00:00:00 grep post  
501 26024 28829 3 17:34 ? 00:00:25 postgres: startup process recovering 00000001000001CA0000001F  
root 26476 25974 0 17:35 pts/0 00:00:00 su - postgresql  
501 28829 1 0 Apr27 ? 00:00:13 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data  
501 28830 28829 0 Apr27 ? 00:01:01 postgres: logger process  
root 31526 31408 0 17:44 pts/1 00:00:00 su - postgresql  
root 31952 31751 0 17:45 pts/3 00:00:00 su - postgresql
  
[postgresql@test_db ~]$ psql  
psql: FATAL: the database system is shutting down

数据库还是老样子，不能连接。

查看 Csvlog

2011-05-23 17:48:53.006 CST,"mylog","skytflog",1457,"192.168.171.39:55019",4dda2d85.5b1,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""  
2011-05-23 17:48:53.007 CST,"mylog","skytflog",1458,"192.168.171.39:55020",4dda2d85.5b2,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""  
2011-05-23 17:48:53.008 CST,"mylog","skytflog",1459,"192.168.171.39:55021",4dda2d85.5b3,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""  
2011-05-23 17:48:53.298 CST,"mylog","skytflog",1460,"xxx.xxx.xxx.xx:48985",4dda2d85.5b4,1,"",2011-05-23 17:48:53 CST,,0,FATAL,57P03,"the database system is shutting down",,,,,,,,,""

备注：CSV日志里有大量the database system is shutting down ，意思是 PostgreSQL 正在关闭，俺在想，是否应该先等等，或者数据库就关闭了。

查看数据库 WAL

[postgresql@test_db pg_xlog]$ ls -lrt  
total 4920304  
drwx------ 2 postgresql postgresql 4096 Feb 18 17:32 archive_status  
-rw------- 1 postgresql postgresql 67108864 May 23 15:55 00000001000001CA00000022  
-rw------- 1 postgresql postgresql 67108864 May 23 16:02 00000001000001CA0000002A  
-rw------- 1 postgresql postgresql 67108864 May 23 16:05 00000001000001CA00000021  
-rw------- 1 postgresql postgresql 67108864 May 23 16:06 00000001000001CA00000027  
-rw------- 1 postgresql postgresql 67108864 May 23 16:08 00000001000001CA00000026  
-rw------- 1 postgresql postgresql 67108864 May 23 16:16 00000001000001CA00000024  
-rw------- 1 postgresql postgresql 67108864 May 23 16:16 00000001000001CA00000025  
-rw------- 1 postgresql postgresql 67108864 May 23 16:20 00000001000001CA00000023  
-rw------- 1 postgresql postgresql 67108864 May 23 16:22 00000001000001CA00000028  
-rw------- 1 postgresql postgresql 67108864 May 23 16:23 00000001000001CA00000029  
-rw------- 1 postgresql postgresql 67108864 May 23 16:24 00000001000001CA00000020  
-rw------- 1 postgresql postgresql 67108864 May 23 16:25 00000001000001C90000001F  
-rw------- 1 postgresql postgresql 67108864 May 23 16:26 00000001000001C900000020  
-rw------- 1 postgresql postgresql 67108864 May 23 16:27 00000001000001C900000021  
-rw------- 1 postgresql postgresql 67108864 May 23 16:28 00000001000001C900000022  
-rw------- 1 postgresql postgresql 67108864 May 23 16:29 00000001000001C900000023  
-rw------- 1 postgresql postgresql 67108864 May 23 16:30 00000001000001C900000024  
-rw------- 1 postgresql postgresql 67108864 May 23 16:31 00000001000001C900000025  
-rw------- 1 postgresql postgresql 67108864 May 23 16:33 00000001000001C900000026  
-rw------- 1 postgresql postgresql 67108864 May 23 16:34 00000001000001C900000027  
-rw------- 1 postgresql postgresql 67108864 May 23 16:35 00000001000001C900000028  
-rw------- 1 postgresql postgresql 67108864 May 23 16:36 00000001000001C900000029  
-rw------- 1 postgresql postgresql 67108864 May 23 16:37 00000001000001C90000002A  
-rw------- 1 postgresql postgresql 67108864 May 23 16:42 00000001000001C90000002B  
-rw------- 1 postgresql postgresql 67108864 May 23 16:45 00000001000001C90000002C  
-rw------- 1 postgresql postgresql 67108864 May 23 16:46 00000001000001C90000002D  
-rw------- 1 postgresql postgresql 67108864 May 23 16:46 00000001000001C90000002E  
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C90000002F  
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C900000030  
-rw------- 1 postgresql postgresql 67108864 May 23 16:47 00000001000001C900000031  
-rw------- 1 postgresql postgresql 67108864 May 23 16:48 00000001000001C900000032  
-rw------- 1 postgresql postgresql 67108864 May 23 16:48 00000001000001C900000033  
-rw------- 1 postgresql postgresql 67108864 May 23 16:49 00000001000001C900000034  
-rw------- 1 postgresql postgresql 67108864 May 23 16:49 00000001000001C900000035  
-rw------- 1 postgresql postgresql 67108864 May 23 16:50 00000001000001C900000036  
-rw------- 1 postgresql postgresql 67108864 May 23 16:51 00000001000001C900000037  
-rw------- 1 postgresql postgresql 67108864 May 23 16:52 00000001000001C900000038  
-rw------- 1 postgresql postgresql 67108864 May 23 16:53 00000001000001C900000039  
-rw------- 1 postgresql postgresql 67108864 May 23 16:55 00000001000001C90000003A  
-rw------- 1 postgresql postgresql 67108864 May 23 16:56 00000001000001C90000003B  
-rw------- 1 postgresql postgresql 67108864 May 23 16:57 00000001000001C90000003C  
-rw------- 1 postgresql postgresql 67108864 May 23 16:58 00000001000001C90000003D  
-rw------- 1 postgresql postgresql 67108864 May 23 16:59 00000001000001C90000003E  
-rw------- 1 postgresql postgresql 67108864 May 23 17:00 00000001000001CA00000000  
-rw------- 1 postgresql postgresql 67108864 May 23 17:01 00000001000001CA00000001  
-rw------- 1 postgresql postgresql 67108864 May 23 17:02 00000001000001CA00000002  
-rw------- 1 postgresql postgresql 67108864 May 23 17:03 00000001000001CA00000003  
-rw------- 1 postgresql postgresql 67108864 May 23 17:05 00000001000001CA00000004  
-rw------- 1 postgresql postgresql 67108864 May 23 17:06 00000001000001CA00000005  
-rw------- 1 postgresql postgresql 67108864 May 23 17:07 00000001000001CA00000006  
-rw------- 1 postgresql postgresql 67108864 May 23 17:08 00000001000001CA00000007  
-rw------- 1 postgresql postgresql 67108864 May 23 17:15 00000001000001CA00000008  
-rw------- 1 postgresql postgresql 67108864 May 23 17:16 00000001000001CA00000009  
-rw------- 1 postgresql postgresql 67108864 May 23 17:16 00000001000001CA0000000A  
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000B  
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000C  
-rw------- 1 postgresql postgresql 67108864 May 23 17:17 00000001000001CA0000000D  
-rw------- 1 postgresql postgresql 67108864 May 23 17:18 00000001000001CA0000000E  
-rw------- 1 postgresql postgresql 67108864 May 23 17:18 00000001000001CA0000000F  
-rw------- 1 postgresql postgresql 67108864 May 23 17:19 00000001000001CA00000010  
-rw------- 1 postgresql postgresql 67108864 May 23 17:19 00000001000001CA00000011  
-rw------- 1 postgresql postgresql 67108864 May 23 17:20 00000001000001CA00000012  
-rw------- 1 postgresql postgresql 67108864 May 23 17:21 00000001000001CA00000013  
-rw------- 1 postgresql postgresql 67108864 May 23 17:22 00000001000001CA00000014  
-rw------- 1 postgresql postgresql 67108864 May 23 17:23 00000001000001CA00000015  
-rw------- 1 postgresql postgresql 67108864 May 23 17:24 00000001000001CA00000016  
-rw------- 1 postgresql postgresql 67108864 May 23 17:25 00000001000001CA00000017  
-rw------- 1 postgresql postgresql 67108864 May 23 17:26 00000001000001CA00000018  
-rw------- 1 postgresql postgresql 67108864 May 23 17:27 00000001000001CA00000019  
-rw------- 1 postgresql postgresql 67108864 May 23 17:29 00000001000001CA0000001A  
-rw------- 1 postgresql postgresql 67108864 May 23 17:30 00000001000001CA0000001B  
-rw------- 1 postgresql postgresql 67108864 May 23 17:31 00000001000001CA0000001C  
-rw------- 1 postgresql postgresql 67108864 May 23 17:32 00000001000001CA0000001D  
-rw------- 1 postgresql postgresql 67108864 May 23 17:33 00000001000001CA0000001E  
-rw------- 1 postgresql postgresql 67108864 May 23 17:34 00000001000001CA0000001F

备注：注意近一个文件是 “00000001000001CA0000001F” , 再根据 “ postgres: startup process recovering 00000001000001CA0000001F “ 说明，PG已经恢复到最后一个日志文件了，猜测再等等，应该就恢复完成。大概过了10分钟左右，后来发现PostgreSQL 果然停了，正好应验了前面的猜测，有点险。接着，启动数据库。

启动数据库

1 2	[postgresql@test_db pg_log]$ pg_ctl start -D $PGDATA server starting

运气不错，数据库可以正常启动。

连接测试

[postgresql@test_db pg_log]$ ps -ef | grep post  
501 5665 1 3 17:56 pts/3 00:00:00 /opt/postgresql-9.0.3/bin/postgres -D /opt/pg/data  
501 5666 5665 1 17:56 ? 00:00:00 postgres: logger process  
501 5672 5665 0 17:56 ? 00:00:00 postgres: writer process  
501 5673 5665 0 17:56 ? 00:00:00 postgres: wal writer process  
501 5674 5665 0 17:56 ? 00:00:00 postgres: autovacuum launcher process  
501 5675 5665 0 17:56 ? 00:00:00 postgres: stats collector process  
501 5679 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43189) idle  
501 5680 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43190) idle  
501 5681 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43191) idle  
501 5682 5665 0 17:56 ? 00:00:00 postgres: mylog skytflog xxx.xxx.xxx.xx(43192) idle  
501 5689 31953 0 17:56 pts/3 00:00:00 grep post

[postgresql@test_db pg_log]$ psql postgres postgresql  
psql (9.0.3)  
Type "help" for help.
postgres=> /l  
 List of databases  
 Name | Owner | Encoding | Collation | Ctype | Access privileges  
-----------+--------+----------+-----------+-------+-------------------  
postgres | mylog | UTF8 | C | C |  
skytflog | skyppa | UTF8 | C | C |  
template0 | mylog | UTF8 | C | C | =c/mylog +  
 | | | | | mylog=CTc/mylog  
template1 | mylog | UTF8 | C | C | =c/mylog +  
 | | | | | mylog=CTc/mylog  
(4 rows)

汗，总算缓了口气，数据库正常启动，数据还在，再一次向开发人员强调，以后不要通过”kill “ 命令来杀 PostgreSQL 进程。

原创文章，作者：306829225，如若转载，请注明出处：https://blog.ytso.com/236399.html

Copy 进程被 Kill 后数据库无法连接故障

故障现象

排查过程

试图关闭数据库

查看 Csvlog

查看数据库 WAL

启动数据库

连接测试

发表回复

Copy 进程被 Kill 后数据库无法连接故障

故障现象

排查过程

试图关闭数据库

查看 Csvlog

查看数据库 WAL

启动数据库

连接测试

相关推荐

发表回复