PostgreSQL的checkpoint简析

一、Checkpoint简介

官方文档对于checkpoint的描述：

Checkpoints are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint.
At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the log file. (The change records were previously flushed to the WAL files.) 
In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the log (known as the redo record) from which it should start the REDO operation.
Any changes made to data files before that point are guaranteed to be already on disk.
Hence, after a checkpoint, log segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When WAL archiving is being done, the log segments must be archived before being recycled or removed.)

简单来说，checkpoint就是一个事务顺序的记录点，checkpoint主要是进行刷脏页，redo时会参考checkpoint进行日志回放。除了刷脏之外还会更新一些位点信息，清理一些不再需要的wal。

下图分为part1-4，4个部分描述checkpoint的触发条件，以及触发后进行的操作等。

二、Checkpoint的触发条件

如图Part1：

在PostgreSQL中Checkpoint是由checkpointer进程执行的，大致的逻辑是这样子的。Checkpointer进程的主流程是一个无条件的for循环，在未触发checkpoint时一直在WaitLatch中sleep，也就是在epoll_wait中观察list链表，查看是否有事件句柄已经就绪（某个条件在触发checkpoint）；

如果已经存在就绪事件，则wake up（通过SetLatch中write pipe的方式wake up），执行checkpoint。

哪些条件会触发checkpoint呢？

Checkpoint是由一些flag来触发的，这些flag并不只是单独作用，大多情况下是根据场景多个flag进行或运算组合为ckpt_flags

根据触发方式flag可以分为两种：

1、checkpointer进程本身通过checkpoint_timeout触发

#define CHECKPOINT_CAUSE_TIME    0x0100    /* Elapsed time */

2、其他进程向checkpointer发送信号触发：

#define CHECKPOINT_IS_SHUTDOWN    0x0001    /* Checkpoint is for shutdown */
主要场景：数据库shutdown时

其它进程调用RequestCheckpoint向checkpointer进程发送SIGINT信号触发

如图Part2：

Step1：修改共享内存CheckpointerShmem->ckpt_flags，传入对应的flags

Step2：向checkpointer进程发送SIGINT信号，唤醒进程

#define CHECKPOINT_END_OF_RECOVERY    0x0002    /* Like shutdown checkpoint, but  issued at end of WAL recovery */
主要场景：startup进程StartupXlog完成时

#define CHECKPOINT_IMMEDIATE    0x0004    /* Do it without delays */
主要场景：当postgres为standalone backend模式请求checkpoint时；Basebackup执行备份时

#define CHECKPOINT_FORCE        0x0008    /* Force even if no activity */
主要场景：手动执行checkpoint命令；standby实例进行promote时

#define CHECKPOINT_FLUSH_ALL    0x0010    /* Flush all pages, including those belonging to unlogged tables */
主要场景：drop database或者create database后

#define CHECKPOINT_CAUSE_XLOG    0x0040    /* XLOG consumption */
主要场景：wal新增数量大于等于CheckPointSegments – 1时，默认参数下大致是42。

在9.5后CheckPointSegments不再是一个单独参数，根据max_wal_size_mb和checkpoint_completion_target参数联动。
CalculateCheckpointSegments函数中计算CheckPointSegments = max_wal_size_mb/(wal_segment_size/(1024*1024))/(1.0 + CheckPointCompletionTarget)
                      = 1024 / (16777216/(1024*1024))/1.5 ≈ 43
XLogCheckpointNeeded函数中判断新增wal数量大于等于CheckPointSegments – 1, 满足时函数返回true，表示需要进行checkpoint。
有时checkpoint比较频繁会提示需要增大max_wal_size，根据计算公式，被除数max_wal_size越大，则CheckPointSegments越大，checkpoint的间隔就越大。

三、Checkpoint会做什么

如图Part4：

表示的是checkpoint触发后，createcheckpoint实际的工作内容

1、Flush Dirty Pages，刷脏，这里不展开了；

2、Update some points，更新XlogCtl和ControlFile，并持久化至pg_control文件；

    /*
     * Update the control file.
     */
    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (shutdown)
        ControlFile->state = DB_SHUTDOWNED;
    ControlFile->checkPoint = ProcLastRecPtr;
    ControlFile->checkPointCopy = checkPoint;
    ControlFile->time = (pg_time_t) time(NULL);
    /* crash recovery should always recover to the end of WAL */
    ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
    ControlFile->minRecoveryPointTLI = 0;
    /*
     * Persist unloggedLSN value. It's reset on crash recovery, so this goes
     * unused on non-shutdown checkpoints, but seems useful to store it always
     * for debugging purposes.
     */
    SpinLockAcquire(&XLogCtl->ulsn_lck);
    ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
    SpinLockRelease(&XLogCtl->ulsn_lck);
    /*更新pg_control文件*/
    UpdateControlFile();
    LWLockRelease(ControlFileLock);

3、Remove old wal，计算两次checkpoint间的wal数量进行回收重用，并清理不再需要的wal

/*
     * Update the average distance between checkpoints if the prior checkpoint
     * exists.
     */
    if (PriorRedoPtr != InvalidXLogRecPtr)
     /*根据ptr偏移量，预估出两次checkpoint间产生的wal量CheckPointDistanceEstimate*/
        UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
    /*
     * Delete old log files, those no longer needed for last checkpoint to
     * prevent the disk holding the xlog from growing full.
     */
    XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
/*根据min{wal_keep_segments, min(replication_slot.restart_lsn)}计算出_logSegNo，比_logSegNo早的日志后续将会被清理掉*/
    KeepLogSeg(recptr, &_logSegNo);
    _logSegNo--;
/*首先根据CheckPointDistanceEstimate 结合一套公式，计算出开始回收重用的recycleSegNo，从这个日志开始回收重用（wal_recycle默认开启，主要是保留日志并rename为新的序列号，回收一个序列号加一）*/
/*然后将_logSegNo之前并已经归档（如果开启归档）的wal都清理掉*/
    RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);

这里就能解释，为什么在没有任何异常的情况下，wal实际保留个数总是大于wal_keep_segments，在remove old wal时已经recycle了一部分了。

四、checkpoint skipped机制

每次Checkpoint都会进行刷脏、清理wal？

如图Part3：

并不是，当system idle时触发checkpoint，会进入checkpoint skipped逻辑，函数中直接return，跳过刷脏、清理wal等步骤；这里将这个机制描述为checkpoint skipped。

System idle：这里可以理解为上次到本次checkpoint之间没有wal写入

看下这里的逻辑：

/*
     * If this isn't a shutdown or forced checkpoint, and if there has been no
     * WAL activity requiring a checkpoint, skip it.  The idea here is to
     * avoid inserting duplicate checkpoints when the system is idle.
     */
    if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
                  CHECKPOINT_FORCE)) == 0)
    {
        if (last_important_lsn == ControlFile->checkPoint)
        {
            WALInsertLockRelease();
            LWLockRelease(CheckpointLock);
            END_CRIT_SECTION();
            ereport(DEBUG1,
                    (errmsg('checkpoint skipped because system is idle')));
            return;
        }
    }

在CreateCheckPoint时，如果checkpointFlag不是CHECKPOINT_FORCE（手动执行checkpoint）或者CHECKPOINT_IS_SHUTDOWN，当满足last_important_lsn == ControlFile->checkPoint时，则直接return（当日志级别大于等于DEBUG1会打印checkpoint skipped信息），不进行后续操作。

着重来看if条件的左右值：

1. last_important_lsn

last_important_lsn = WALInsertLocks[lockno].l.lastImportantAt

在写wal时，XLogInsertRecord函数中更新lastImportantAt为wal开始写入的location

WALInsertLocks[lockno].l.lastImportantAt = StartPos;

2. ControlFile->checkPoint

共享内存成员ControlFile->checkPoint的更新位于checkpoint skipped代码块之后，在完成checkpoint操作后，会更新ControlFile并进行持久化（写入pg_control文件）

ControlFile->checkPoint = ProcLastRecPtr;

这里两个变量等值成立的条件大概又可能是什么？controlfile.checkpoint读取的是上次checkpoint完成后的值，wal写入点是当前正在写wal的位置，那么就是说wal写入点一直未更新，也就是说数据库未进行写操作。

当两次checkpoint间没有写操作时，刷脏和清理wal都是不需要的，看起来checkpoint skipped机制是比较合理的。

不过，在特定场景下，还是有些隐患的，需要手动维护下。

特殊场景：

实例持续大并发数据写入，wal归档速度相对较慢，一段时间后停止写入。这时可能会发现wal累积的比较多，甚至远超于保留策略范围，导致磁盘容量告急。由于后续没有写wal的操作，因此每次checkpoint_timeout触发checkpoint后，会进入checkpoint skipped机制，一直不会清理wal，哪怕是归档已经完成。

这个时候就需要手动做一次checkpoint，也就是CHECKPOINT_FORCE的方式触发，是不会进入checkpoint skipped机制的。

五、如何记录checkpoint

打开checkpoint日志，设置log_checkpoints=on;

当触发checkpoint时pglog中会记录两条信息:

一条记录触发的flag，由LogCheckpointStart函数完成。

/*
 * Log start of a checkpoint.
 */
static void
LogCheckpointStart(int flags, bool restartpoint)
{
    elog(LOG, '%s starting:%s%s%s%s%s%s%s%s',
         restartpoint ? 'restartpoint' : 'checkpoint',
         (flags & CHECKPOINT_IS_SHUTDOWN) ? ' shutdown' : '',
         (flags & CHECKPOINT_END_OF_RECOVERY) ? ' end-of-recovery' : '',
         (flags & CHECKPOINT_IMMEDIATE) ? ' immediate' : '',
         (flags & CHECKPOINT_FORCE) ? ' force' : '',
         (flags & CHECKPOINT_WAIT) ? ' wait' : '',
         (flags & CHECKPOINT_CAUSE_XLOG) ? ' wal' : '',
         (flags & CHECKPOINT_CAUSE_TIME) ? ' time' : '',
         (flags & CHECKPOINT_FLUSH_ALL) ? ' flush-all' : '');
}

另外一条记录checkpoint做了什么，刷了多少脏块，新增/清理/回收了多少wal等，由LogCheckpointEnd函数完成。

/*
 * Log end of a checkpoint.
 */
static void
LogCheckpointEnd(bool restartpoint)
{
    /* ............*/
    elog(LOG, '%s complete: wrote %d buffers (%.1f%%); '
         '%d WAL file(s) added, %d removed, %d recycled; '
         'write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s; '
         'sync files=%d, longest=%ld.%03d s, average=%ld.%03d s; '
         'distance=%d kB, estimate=%d kB',
         restartpoint ? 'restartpoint' : 'checkpoint',
         CheckpointStats.ckpt_bufs_written,
         (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
         CheckpointStats.ckpt_segs_added,
         CheckpointStats.ckpt_segs_removed,
         CheckpointStats.ckpt_segs_recycled,
         write_secs, write_usecs / 1000,
         sync_secs, sync_usecs / 1000,
         total_secs, total_usecs / 1000,
         CheckpointStats.ckpt_sync_rels,
         longest_secs, longest_usecs / 1000,
         average_secs, average_usecs / 1000,
         (int) (PrevCheckPointDistance / 1024.0),
         (int) (CheckPointDistanceEstimate / 1024.0));
}

例如这次由CHECKPOINT_CAUSE_XLOG触发的checkpoint记录：

2021-10-10 20:55:08.044 CST,,,10801,,615da38b.2a31,41,,2021-10-06 21:24:27 CST,,0,LOG,00000,'checkpoint starting: wal',,,,,,,,,''
2021-10-10 20:55:18.058 CST,,,10801,,615da38b.2a31,42,,2021-10-06 21:24:27 CST,,0,LOG,00000,'checkpoint complete: wrote 5776 buffers (35.3%); 0 WAL file(s) added, 0 removed, 41 recycled; write=9.147 s, sync=0.565 s, total=10.013 s; sync files=7, longest=0.333 s, average=0.080 s; distance=691976 kB, estimate=691976 kB',,,,,,,,,''

如果开启了log_checkpoints，日志中并未记录checkpoint信息，大概率是触发了checkpoint skipped机制，可以将log_min_messages配置为debug1，观察日志是否打印’checkpoint skipped because system is idle’。

六、checkpoint是否正常

1、可以通过系统函数查看执行时间等

Nick postgres=# select * from pg_control_checkpoint();
-[ RECORD 1 ]--------+-------------------------
checkpoint_lsn       | 18/39FD6C88
redo_lsn             | 18/39FD6C50
redo_wal_file        | 000000010000001800000039
timeline_id          | 1
prev_timeline_id     | 1
full_page_writes     | t
next_xid             | 0:1927987
next_oid             | 51061
next_multixact_id    | 1
next_multi_offset    | 0
oldest_xid           | 479
oldest_xid_dbid      | 1
oldest_active_xid    | 1927987
oldest_multi_xid     | 1
oldest_multi_dbid    | 1
oldest_commit_ts_xid | 0
newest_commit_ts_xid | 0
checkpoint_time      | 2021-10-10 22:15:33+08

2、pg_controldata 工具解析pg_control文件，根据结果分析

3、pstack,gdb,strace观察checkpointer进程是否正常

原创文章，作者：奋斗，如若转载，请注明出处：https://blog.ytso.com/237353.html