PostgreSQL9.5:SQL 新增 TABLESAMPLE 数据取样功能

查看文档中发现 9.5 版本新增数据取样功能,支持查询返回取样数据,这个功能吸引了我的注意,数据取样在很多应用场景都有用到。

TABLESAMPLE 取样方式

TABLESAMPLE 取样方式有两种:

  • SYSTEM: 数据块级的数据取样,后面接取样参数,数据抽取返回以数据块为单位,理论上表上每个数据块被取样的机率是一样的。
  • BERNOULLI: BERNOULLI 取样方式会扫描整张表,后面接取样参数,并且返回指定百分比抽样数据,理论上每条数据被取样的机率是一样的。

环境准备

创建测试表并插入 100 万数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[pg95@db2 tf]$ psql fdb fdb  
psql (9.5alpha1)
Type "help" for help.

fdb=> create table test_sample(id int4,message text, create_time timestamp(6) without time zone default clock_timestamp());
CREATE TABLE

fdb=> insert into test_sample(id,message) select n, md5(random()::text) from generate_series(1,1000000) n;
INSERT 0 1000000

fdb=> select * from test_sample limit 3;
id |message | create_time
----+----------------------------------+----------------------------
1 | 049559e35471b27b713c66cacfc518ac | 2015-08-09 17:51:52.732066
2 | 3dba81d93b91769071fd2bff5b9d6373 | 2015-08-09 17:51:52.732432
3 | 38746c72d7d5ce80ed0c16311c93ee77 | 2015-08-09 17:51:52.732449

SYSTEM 取样: 取样因子 0.01

1
2
3
4
5
6
7
fdb=> explain analyze select  *  from test_sample TABLESAMPLE SYSTEM(0.01);  
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Sample Scan (system) on test_sample (cost=0.00..1.00 rows=100 width=45) (actual time=0.101..0.180 rows=107 loops=1)
Planning time: 0.100 ms
Execution time: 0.263 ms
(3 rows)

备注:为什么返回是 107 条记录呢,往下看。

查看表占用 page

1
2
3
4
5
6
7
8
9
10
fdb=>  select relname,relpages from pg_class where relname='test_sample';  
relname | relpages
-------------+----------
test_sample | 9346

fdb=> select ceil(1000000/9346::numeric);
ceil
------
107
(1 row)

备注: test_sample 表占用 9346 个数据块,每个数据块数据为 107 条。

查看表 ctid,

1
2
3
4
5
6
7
8
9
fdb=>  select ctid,  *  from test_sample TABLESAMPLE SYSTEM(0.01);  
ctid |id |message | create_time
------------+--------+----------------------------------+----------------------------
(8887,1) | 950910 | e36fe0340ca717af13e50b9cef83441c | 2015-08-09 17:52:37.724168
(8887,2) | 950911 | 0fe2c556544556f9c89c6a51dc2b96eb | 2015-08-09 17:52:37.724281
...
省略输出
(8887,107) | 951016 | 9ab7e88b5a2199f203f3668095d50d1d | 2015-08-09 17:52:37.725332
(107 rows)

备注:从上看出返回的数据都位于数据块 8887 中。

SYSTEM 方式不可以返回少于一个块的数据

1
2
3
4
5
6
fdb=> explain analyze select  *  from test_sample TABLESAMPLE SYSTEM(0.0001);  
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sample Scan (system) on test_sample (cost=0.00..0.01 rows=1 width=45) (actual time=0.466..0.542 rows=107 loops=1)
Planning time: 0.868 ms
Execution time: 0.682 ms

备注:取样因子设置成 0.0001,依然返回了一个 page 的数据,接着看 bernoulli 取样方式。

system 方式随机返回不同 page 的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fdb=>  select ctid,  *  from test_sample TABLESAMPLE system (0.01) order by id desc limit 1  ;  
ctid |id |message | create_time
------------+--------+----------------------------------+----------------------------
(2939,107) | 314580 | 6469df3a7ec83782b61e5c125862e02c | 2015-08-09 17:52:00.637544
(1 row)

fdb=> select ctid, * from test_sample TABLESAMPLE system (0.01) order by id desc limit 1 ;
ctid |id |message | create_time
------------+--------+----------------------------------+---------------------------
(5587,107) | 597916 | 41a2769073b791cdbc7c972d6efdd9dc | 2015-08-09 17:52:10.77434
(1 row)

fdb=> select ctid, * from test_sample TABLESAMPLE system (0.01) order by id desc limit 1 ;
ctid |id |message | create_time
-----------+-------+----------------------------------+----------------------------
(123,107) | 13268 | ba85b1b26ba94d9a686f00637dfa1879 | 2015-08-09 17:51:52.957365
(1 row)

备注:根据 ctid 看出返回的 page 是随机的。

BERNOULLI 取样: 取样因子 0.01

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
fdb=> explain analyze select  *  from test_sample TABLESAMPLE bernoulli(0.01)  ;  
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Sample Scan (bernoulli) on test_sample (cost=0.00..9347.00 rows=100 width=45) (actual time=4.658..340.127 rows=106 loops=1)
Planning time: 0.092 ms
Execution time: 340.409 ms
(3 rows)

fdb=> select ctid, * from test_sample TABLESAMPLE bernoulli(0.01) ;
ctid |id |message | create_time
------------+--------+----------------------------------+----------------------------
(89,73) | 9596 | 3389376ab71d7e89642953508e1791a1 | 2015-08-09 17:51:52.890982
(91,27) | 9764 | 5f23f8feb3cdc2c87856b17a7ad4db09 | 2015-08-09 17:51:52.89356
(177,49) | 18988 | 8d58679ada864ca245d1b3fe90afbdc6 | 2015-08-09 17:51:53.049145
...
省略输出
(9152,14) | 979278 | 01b9d63735955da6432af360376daa6a | 2015-08-09 17:52:39.444338
(9227,94) | 987383 | 73ca23c5950d85a615ce8dcd300e2930 | 2015-08-09 17:52:40.680049
(98 rows)

备注: bernoulli 取样方式执行时间要 340 ms 左右,速度比 system 方式慢了很多,从 ctid 可看出返回的数据位于不同的数据块。

关注 bernoulli 的取样因子

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fdb=> explain analyze select ctid,  *  from test_sample TABLESAMPLE bernoulli(1);  
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sample Scan (bernoulli) on test_sample (cost=0.00..9446.00 rows=10000 width=51) (actual time=0.055..359.777 rows=10047 loops=1)
Planning time: 0.137 ms
Execution time: 367.271 ms
(3 rows)

fdb=> explain analyze select ctid, * from test_sample TABLESAMPLE bernoulli(2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Sample Scan (bernoulli) on test_sample (cost=0.00..9546.00 rows=20000 width=51) (actual time=0.033..325.595 rows=19863 loops=1)
Planning time: 0.071 ms
Execution time: 336.098 ms
(3 rows)

备注:bernoulliy 方式取样因子为 1 时返回约 10047 条数据,取样因子为 2 时返回约为 19863 条数据,取样因子正好是数据的百分比。

参考

原创文章,作者:6024010,如若转载,请注明出处:https://blog.ytso.com/tech/database/239638.html

(0)
上一篇 2022年2月12日 13:54
下一篇 2022年2月12日 13:54

相关推荐

发表回复

登录后才能评论