根据前面博文介绍，已经知道了 Master 节点不存放用户数据，用户数据而是按一定规则打散到各个子节点里，这篇主要介绍下子节点数据分布。

一首先看一个模型

以下模型中涉及到四张表(sale ,customer, vendor, product), 每个表的第一个字段为 Primary key其中 sale 表有三个 foreign key。
GeenPlum 原理篇之二：物理数据分布

Figure: Sample Database Star Schema
在 Greenplum 中，所有表数据都按一定规则分割成不重叠的一部分，这些分割的部分分别位于各个节点, 其中 Master 仅存储系统表，所有 segments 节点存放用户数据，下图展示了 Greenplum 物理数据
的分布原理。
GeenPlum 原理篇之二：物理数据分布

二 Greenplum 数据分布策略

在建表的语句里，提供distributed 属性来设置表数据分布策略，表数据主要有以下两种分布策略。

2.1 哈希( Hash )分布
使用 Hash分布，表中的一个或多个字段可以设置成 distributed 键，相同的值被打散到同一个
Segment 节点上。Hash分区是默认的分区方式，如果不指定分区方式，那么表的的主键（如果存在主键）或者表的第一个字段将被默认选择为 distributed 键。

2.2 随机( Random )分析
使用 Random 分布方式，表中的数据将均匀地打散到各个子节点上，并且随着数据地进入顺序循环地打散到各个子节点上，表上相同的值并不一定颁布在同一个节点上，官方认为选择 Hash 分区方式在性能上占有优势。

三实验

3.1创建 appendonly 测试表并插入数据

warehouse=# create table test_3 (id integer , name varchar(32))  
warehouse-# with ( appendonly=true )  
warehouse-# distributed by (id);  
CREATE TABLE
warehouse=# insert into test_3 select generate_series(1,1000),'francs';  
INSERT 0 1000

3.2 查看各个节点数据分布

warehouse=# select get_ao_distribution('test_3');  
get_ao_distribution  
---------------------  
(0,501)  
(1,499)  
(2 rows)

3.3 查看当前GP配置

warehouse=# select * from gp_configuration;  
content | definedprimary | dbid | isprimary | valid | hostname | port | datadir  
---------+----------------+------+-----------+-------+----------+-------+------------------------  
 -1 | t | 1 | t | t | gpmaster | 5432 | /opt/gp_data/gp-1  
 0 | t | 2 | t | t | gpnode1 | 50001 | /opt/gp_data/data/gp0  
 1 | t | 3 | t | t | gpnode2 | 50001 | /opt/gp_data/data/gp1  
 0 | f | 4 | f | t | gpnode2 | 60001 | /opt/gp_data/mdata/gp0  
 1 | f | 5 | f | t | gpnode1 | 60001 | /opt/gp_data/mdata/gp1  
(5 rows)

从上面可以看到， content 为矩阵数据库标识， “-1” 表示 gpmaster 节点 , “0”表示第一个 Segment, “1”表示第二个 segment, 其中 Primary Segemnt 和 Mirror Segment 的值相同。根据 3.2结果来看，节点一分布了501条数据，节点二分布了 499条数据

3.4 附:gp_configuration 中字段 content 解释

The ID for the portion of data on an instance. A primary segment instance and its mirror will have the same content ID.For a segment the value is from 0-N, where N is the number of segments in Greenplum Database.For the master, the value is -1. The combination of content and definedprimary is the PRIMARY KEY.

3.5 附: get_ao_distribution 函数解释

Function	Return Type	Description
get_ao_distribution (oid,name)	Set of (dbid, tuplecount) rows	hows the distribution of rows of an append-only table across the array. Returns a set of rows, each of whichincludes a segment dbid and the numbetuples stored on the segment.

原创文章，作者：carmelaweatherly，如若转载，请注明出处：https://blog.ytso.com/236390.html

GeenPlum 原理篇之二：物理数据分布

一 首先看一个模型

二 Greenplum 数据分布策略

三 实验

相关推荐

发表回复

一首先看一个模型

三实验