查看文档中发现 9.5 版本新增数据取样功能,支持查询返回取样数据,这个功能吸引了我的注意,数据取样在很多应用场景都有用到。
TABLESAMPLE 取样方式
TABLESAMPLE 取样方式有两种:
- SYSTEM: 数据块级的数据取样,后面接取样参数,数据抽取返回以数据块为单位,理论上表上每个数据块被取样的机率是一样的。
- BERNOULLI: BERNOULLI 取样方式会扫描整张表,后面接取样参数,并且返回指定百分比抽样数据,理论上每条数据被取样的机率是一样的。
环境准备
创建测试表并插入 100 万数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
|
[pg95@db2 tf]$ psql fdb fdb psql (9.5alpha1) Type "help" for help. fdb=> create table test_sample(id int4,message text, create_time timestamp(6) without time zone default clock_timestamp()); CREATE TABLE fdb=> insert into test_sample(id,message) select n, md5(random()::text) from generate_series(1,1000000) n; INSERT 0 1000000 fdb=> select * from test_sample limit 3; id |message | create_time ----+----------------------------------+---------------------------- 1 | 049559e35471b27b713c66cacfc518ac | 2015-08-09 17:51:52.732066 2 | 3dba81d93b91769071fd2bff5b9d6373 | 2015-08-09 17:51:52.732432 3 | 38746c72d7d5ce80ed0c16311c93ee77 | 2015-08-09 17:51:52.732449
|
SYSTEM 取样: 取样因子 0.01
1 2 3 4 5 6 7
|
fdb=> explain analyze select * from test_sample TABLESAMPLE SYSTEM(0.01); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Sample Scan (system) on test_sample (cost=0.00..1.00 rows=100 width=45) (actual time=0.101..0.180 rows=107 loops=1) Planning time: 0.100 ms Execution time: 0.263 ms (3 rows)
|
备注:为什么返回是 107 条记录呢,往下看。
查看表占用 page
1 2 3 4 5 6 7 8 9 10
|
fdb=> select relname,relpages from pg_class where relname='test_sample'; relname | relpages -------------+---------- test_sample | 9346 fdb=> select ceil(1000000/9346::numeric); ceil ------ 107 (1 row)
|
备注: test_sample 表占用 9346 个数据块,每个数据块数据为 107 条。
查看表 ctid,
1 2 3 4 5 6 7 8 9
|
fdb=> select ctid, * from test_sample TABLESAMPLE SYSTEM(0.01); ctid |id |message | create_time ------------+--------+----------------------------------+---------------------------- (8887,1) | 950910 | e36fe0340ca717af13e50b9cef83441c | 2015-08-09 17:52:37.724168 (8887,2) | 950911 | 0fe2c556544556f9c89c6a51dc2b96eb | 2015-08-09 17:52:37.724281 ... 省略输出 (8887,107) | 951016 | 9ab7e88b5a2199f203f3668095d50d1d | 2015-08-09 17:52:37.725332 (107 rows)
|
备注:从上看出返回的数据都位于数据块 8887 中。
SYSTEM 方式不可以返回少于一个块的数据
1 2 3 4 5 6
|
fdb=> explain analyze select * from test_sample TABLESAMPLE SYSTEM(0.0001); QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Sample Scan (system) on test_sample (cost=0.00..0.01 rows=1 width=45) (actual time=0.466..0.542 rows=107 loops=1) Planning time: 0.868 ms Execution time: 0.682 ms
|
备注:取样因子设置成 0.0001,依然返回了一个 page 的数据,接着看 bernoulli 取样方式。
system 方式随机返回不同 page 的数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
|
fdb=> select ctid, * from test_sample TABLESAMPLE system (0.01) order by id desc limit 1 ; ctid |id |message | create_time ------------+--------+----------------------------------+---------------------------- (2939,107) | 314580 | 6469df3a7ec83782b61e5c125862e02c | 2015-08-09 17:52:00.637544 (1 row) fdb=> select ctid, * from test_sample TABLESAMPLE system (0.01) order by id desc limit 1 ; ctid |id |message | create_time ------------+--------+----------------------------------+--------------------------- (5587,107) | 597916 | 41a2769073b791cdbc7c972d6efdd9dc | 2015-08-09 17:52:10.77434 (1 row) fdb=> select ctid, * from test_sample TABLESAMPLE system (0.01) order by id desc limit 1 ; ctid |id |message | create_time -----------+-------+----------------------------------+---------------------------- (123,107) | 13268 | ba85b1b26ba94d9a686f00637dfa1879 | 2015-08-09 17:51:52.957365 (1 row)
|
备注:根据 ctid 看出返回的 page 是随机的。
BERNOULLI 取样: 取样因子 0.01
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
|
fdb=> explain analyze select * from test_sample TABLESAMPLE bernoulli(0.01) ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------ Sample Scan (bernoulli) on test_sample (cost=0.00..9347.00 rows=100 width=45) (actual time=4.658..340.127 rows=106 loops=1) Planning time: 0.092 ms Execution time: 340.409 ms (3 rows) fdb=> select ctid, * from test_sample TABLESAMPLE bernoulli(0.01) ; ctid |id |message | create_time ------------+--------+----------------------------------+---------------------------- (89,73) | 9596 | 3389376ab71d7e89642953508e1791a1 | 2015-08-09 17:51:52.890982 (91,27) | 9764 | 5f23f8feb3cdc2c87856b17a7ad4db09 | 2015-08-09 17:51:52.89356 (177,49) | 18988 | 8d58679ada864ca245d1b3fe90afbdc6 | 2015-08-09 17:51:53.049145 ... 省略输出 (9152,14) | 979278 | 01b9d63735955da6432af360376daa6a | 2015-08-09 17:52:39.444338 (9227,94) | 987383 | 73ca23c5950d85a615ce8dcd300e2930 | 2015-08-09 17:52:40.680049 (98 rows)
|
备注: bernoulli 取样方式执行时间要 340 ms 左右,速度比 system 方式慢了很多,从 ctid 可看出返回的数据位于不同的数据块。
关注 bernoulli 的取样因子
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
fdb=> explain analyze select ctid, * from test_sample TABLESAMPLE bernoulli(1); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Sample Scan (bernoulli) on test_sample (cost=0.00..9446.00 rows=10000 width=51) (actual time=0.055..359.777 rows=10047 loops=1) Planning time: 0.137 ms Execution time: 367.271 ms (3 rows) fdb=> explain analyze select ctid, * from test_sample TABLESAMPLE bernoulli(2); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Sample Scan (bernoulli) on test_sample (cost=0.00..9546.00 rows=20000 width=51) (actual time=0.033..325.595 rows=19863 loops=1) Planning time: 0.071 ms Execution time: 336.098 ms (3 rows)
|
备注:bernoulliy 方式取样因子为 1 时返回约 10047 条数据,取样因子为 2 时返回约为 19863 条数据,取样因子正好是数据的百分比。
参考
原创文章,作者:6024010,如若转载,请注明出处:https://blog.ytso.com/tech/database/239638.html