1. 编译incubator-datasketches-cpp

Apache所有和Data Stream有关项目的列表：https://github.com/apache?utf8=%E2%9C%93&q=datasketches

编译datasketches-cpp项目

cd $P4_HOME
git clone --recursive https://github.com/apache/incubator-datasketches-cpp
cd $P4_HOME/incubator-datasketches-cpp
git checkout 2.0.x-incubating   ## the latest released active branch
rm -r build
mkdir build
cd build
cmake ..
make
make test
cd ../..

这是结果。

$ make test
Running tests...
Test project /Users/csqjxiao/P4/apache-incubator-datasketches-cpp/build
    Start 1: hll_test
1/6 Test #1: hll_test .........................   Passed   69.06 sec
  Start 2: cpc_test
2/6 Test #2: cpc_test .........................   Passed    2.50 sec
  Start 3: kll_test
3/6 Test #3: kll_test .........................   Passed    2.82 sec
    Start 4: fi_test
4/6 Test #4: fi_test ..........................   Passed    0.32 sec
    Start 5: theta_test
5/6 Test #5: theta_test .......................   Passed    0.34 sec
    Start 6: sampling_test
6/6 Test #6: sampling_test ....................   Passed    0.34 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) =  75.39 sec

在cmake ..的时候，也许会提示CMake版本太低的错误。

CMake Error at CMakeLists.txt:18 (cmake_minimum_required):
  CMake 3.12.0 or higher is required.  You are running version 3.5.1
  
-- Configuring incomplete, errors occurred!

按如下步骤升级CMake。去https://github.com/Kitware/CMake/releases下载cmake-3.12.0.tar.gz。这里，我并没有卸载旧版本的cmake，而是直接安装新版本，而且这样没有出现问题。

cd /tmp
wget https://github.com/Kitware/CMake/releases/download/v3.12.0/cmake-3.12.0.tar.gz
tar zxvf cmake-3.12.0.tar.gz
cd cmake-3.12.0
./bootstrap
./configure
make 
make test
sudo make install
cmake --version  ## 返回 cmake version 3.12.0

编译datasketches-cpp项目的Python Wrapper for Datasketches

来自：https://github.com/apache/incubator-datasketches-cpp/tree/master/python

首先安装python的相关依赖

## python2 dependencies on ubuntu
sudo apt-get install -y python-pip python-setuptools python-wheel python-numpy

# python3 dependencies on ubuntu
sudo apt-get install -y python3 python3-dev python3-pip 
sudo pip3 install setuptools numpy wheel 

sudo pip3.6 install setuptools numpy wheel

# install tox: a generic virtualenv management and test command line tool you can use for:
# https://tox.readthedocs.io/en/latest/
sudo pip install tox zipp
sudo pip3 install tox zipp
sudo pip3.6 install tox zipp

# I am curious about the tox version
tox --version
# 3.15.2 imported from /home/p4-basic/.local/lib/python3.5/site-packages/tox/__init__.py
which tox
# /home/p4-basic/.local/bin/tox

然后安装python bindings。

cd $P4_HOME/incubator-datasketches-cpp
git submodule update --init --recursive

# setup
#sudo python setup.py build  ## error ！！
#sudo python3 setup.py build   ## error ！！
sudo python3.6 setup.py build  # install binding for python 3.6 on ubuntu 16.04

python bindings成功安装到python3.6环境下。然后用tox框架执行单元测试。

cd $P4_HOME/incubator-datasketches-cpp
sudo rm -r .tox # remove old tox stuffs

# python Unit tests
sudo tox


=========================================================================================== test session starts ============================================================================================
platform linux -- Python 3.6.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
cachedir: .tox/py3/.pytest_cache
rootdir: /home/p4-basic/P4/incubator-datasketches-cpp
collected 14 items                                                                                                                                                                                         
cpc_test.py .                                                                                                                                                                                        [  7%]
fi_test.py ..                                                                                                                                                                                        [ 21%]
hll_test.py ...                                                                                                                                                                                      [ 42%]
kll_test.py ...                                                                                                                                                                                      [ 64%]
theta_test.py ..                                                                                                                                                                                     [ 78%]
vector_of_kll_test.py ..                                                                                                                                                                             [ 92%]
vo_test.py .                                                                                                                                                                                         [100%]

=========================================================================================== 14 passed in 18.59s ============================================================================================
_________________________________________________________________________________________________ summary __________________________________________________________________________________________________
  py3: commands succeeded
  congratulations :)

2. 编译incubator-datasketches-postgresql

This code requires C++11. It was tested with GCC 4.8.5 (standard in RedHat at the time of this writing), GCC 8.2.0, GCC 9.2.0, Apple LLVM version 10.0.1 (clang-1001.0.46.4) and version 11.0.0 (clang-1100.0.33.8).

cd $P4_HOME
git clone https://github.com/apache/incubator-datasketches-postgresql
cd $P4_HOME/incubator-datasketches-postgresql
ln -sf ../incubator-datasketches-cpp datasketches-cpp
make clean
make
sudo make install

On Ubuntu系统，sudo make install 可能会报如下的错误。这是因为super user是安全保护的，不允许启动的时候访问/usr/local/目录下的内容。

make: pg_config: Command not found

按照下面的方式安装就没问题了。

$ su
Password: 
$ source /etc/profile
$ make install
/bin/mkdir -p '/usr/local/pgsql/lib'
/bin/mkdir -p '/usr/local/pgsql/share/extension'
/bin/mkdir -p '/usr/local/pgsql/share/extension'
...
$ exit

如果报如下错误，这是因为postgresql是从apt上安装的，内部缺省使用的llvm版本是6.0。切记Ubuntu操作系统一定要从源代码安装postgresql。

cd '/usr/lib/postgresql/12/lib/bitcode' && /usr/lib/llvm-6.0/bin/llvm-lto -thinlto -thinlto-action=thinlink -o datasketches.index.bc datasketches/src/global_hooks.bc datasketches/src/base64.bc datasketches/src/common.bc datasketches/src/kll_float_sketch_pg_functions.bc datasketches/src/kll_float_sketch_c_adapter.bc datasketches/src/cpc_sketch_pg_functions.bc datasketches/src/cpc_sketch_c_adapter.bc datasketches/src/theta_sketch_pg_functions.bc datasketches/src/theta_sketch_c_adapter.bc datasketches/src/frequent_strings_sketch_pg_functions.bc datasketches/src/frequent_strings_sketch_c_adapter.bc datasketches/src/hll_sketch_pg_functions.bc datasketches/src/hll_sketch_c_adapter.bc
error: can't create module summary index for buffer: Expected a single module
LLVM ERROR: ThinLink didn't create an index

On MacOSX Mojave, if you see a warning like this:
```
clang: warning: no such sysroot directory: ‘/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk’ [-Wmissing-sysroot]
```
and the compilation fails because of not being able to find system include files, this is a known OSX problem. There are known solutions on the Internet.

3. Verifying installation of datasketches-postgresql with a test database

Make sure that PostgreSQL is running.

On Ubuntu, start the service:

$ su - postgres -c "source /etc/profile; pg_ctl -l logfile restart"
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started

$ ps -e | grep postgres
 2318 pts/1    00:00:00 postgres
 2320 ?        00:00:00 postgres
...

On Homebrew on MacOSX, start the service:
```
pg_start
```
Now we check the status of postgres with:
```
show-pg-status
```

Create a test database if it does not exist yet (on the command line):

$ su postgres
$ createdb test

Run the client (console) using the test database:

$ psql test

Create datasketches extension in the test database:

# create extension datasketches;

Try some of the datasketches functions:

# select cpc_sketch_to_string(cpc_sketch_build(1));

You should see the following result:

       cpc_sketch_to_string        
-----------------------------------
 ### CPC sketch summary:          +
    lg_k           : 8            +
    seed hash      : 93cc         +
    C              : 1            +
    flavor         : 1            +
    merged         : false        +
    HIP estimate   : 1            +
    kxp            : 255.5        +
    intresting col : 0            +
    table entries  : 1            +
    window         : not allocated+
 ### End sketch summary           +
 
(1 row)

从逻辑上看，schema，table，都是位于database之下。

在test数据库下面，建立test_table表格。

# create table test_table(id integer);
CREATE TABLE
# /q

然后，再在其他的数据库下进行查看：

$ psql test
psql (11.8)
Type "help" for help.

# SELECT * FROM information_schema.tables WHERE table_name='test_table';
 table_catalog | table_schema | table_name | table_type | self_referencing_column_name | reference_generation | user_defined_type_catalog | user_defined_type_schema | user_defined_type_name | is_insertable_into | is_typed | commit_action 
---------------+--------------+------------+------------+------------------------------+----------------------+---------------------------+--------------------------+------------------------+--------------------+----------+---------------
(0 rows)

# /q

4. Advanced Examples of datasketches-postgresql

This datasketches-postgresql module currently supports the following sketches:

CPC (Compressed Probabilistic Counting) sketch – very compact (smaller than HLL when serialized) distinct-counting sketch
Theta sketch – distinct counting with set operations (intersection, a-not-b)
HLL sketch – very compact distinct-counting sketch based on HyperLogLog algorithm
KLL float quantiles sketch – for estimating distributions: quantile, rank, PMF (histogram), CDF
Frequent strings sketch – capture the heaviest items (strings) by count or by some other weight

We will perform experiments on these sketches algorithms.

准备测试数据

注意：整个实验的测试数据非常的占磁盘空间，差不多要5GB多空间。实验结束以后要释放磁盘空间，只需要执行drop table test;的SQL命令即可。

连接postgresql客户端。

su - postgres -c "psql test"

执行如下的sql语句。

create table if not exists random_ints_100m(id integer);

-- list tables under database `test`
/dt

-- show the schema of table `random_ints_100m`
/d random_ints_100m

-- delete all rows in table random_ints_100m
delete from random_ints_100m;

-- tall numbers between 1 and 100 (step 1 by default)
insert into random_ints_100m(id) 
    select generate_series(1,100000000) 
ON CONFLICT DO NOTHING;

会看到下面的执行结果。

test=# /dt  
              List of relations
 Schema |       Name       | Type  |  Owner   
--------+------------------+-------+----------
 public | random_ints_100m | table | postgres
 public | test_table       | table | postgres
(2 rows)

test=# /d random_ints_100m
          Table "public.random_ints_100m"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           |          |

Distinct counting with CPC sketch

Suppose 100 million random integer values uniformly distributed in the range from 1 to 100M have been generated and inserted into a table named random_ints_100m.

Exact count distinct:

$ time psql test -c "select count(distinct id) from random_ints_100m"
   count   
-----------
 100000000
(1 row)

real	0m57.413s

Approximate count distinct:

$ time psql test -c "select cpc_sketch_distinct(id) from random_ints_100m"
 cpc_sketch_distinct 
---------------------
    100773880.411596
(1 row)

real	0m13.832s

Note that the above one-off distinct count is just to show the basic usage. Most importantly, the sketch can be used as an “additive” distinct count metric in a data cube.

重进刚才的测试数据库

su - postgres -c "psql test"

然后，创建cpc_sketch_test测试表格。

create table if not exists cpc_sketch_test(sketch cpc_sketch);

-- delete all rows in table cpc_sketch_test
delete from cpc_sketch_test;

insert into cpc_sketch_test select cpc_sketch_build(1);
insert into cpc_sketch_test select cpc_sketch_build(2);
insert into cpc_sketch_test select cpc_sketch_build(3);
insert into cpc_sketch_test select cpc_sketch_build(4);
insert into cpc_sketch_test select cpc_sketch_build(5);
insert into cpc_sketch_test select cpc_sketch_build(1);
insert into cpc_sketch_test select cpc_sketch_build(2);
select cpc_sketch_get_estimate(cpc_sketch_union(sketch)) from cpc_sketch_test;

看到如下输出

 cpc_sketch_get_estimate 
-------------------------
        5.00162840932184
(1 row)

或者可以查看cpc_sketch_test表格的所有行

# select * from cpc_sketch_test;
                      sketch                      
--------------------------------------------------
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/p9AAAAAAAAA8D/uAgAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/p9AAAAAAAAA8D86DAAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/59AAAAAAAAA8D+9KgAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/p9AAAAAAAAA8D/aCAAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/59AAAAAAAAA8D+1JAAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/p9AAAAAAAAA8D/uAgAA
 CAEQCwAOzJMBAAAAAQAAAAAAAAAA/p9AAAAAAAAA8D86DAAA
(7 rows)

# /q  // quit

Distinct counting with HLL sketch

See above for the exact distinct count of 100 million random integers

Approximate distinct count:

$ time psql test -c "select hll_sketch_distinct(id) from random_ints_100m"
 hll_sketch_distinct 
---------------------
    100509990.715786
(1 row)

real	0m14.204s

Note that the above one-off distinct count is just to show the basic usage. Most importantly, the sketch can be used as an “additive” distinct count metric in a data cube.

Aggregate union:

create table if not exists hll_sketch_test(sketch hll_sketch);

-- delete all rows in table hll_sketch
delete from hll_sketch;

insert into hll_sketch_test select hll_sketch_build(1);
insert into hll_sketch_test select hll_sketch_build(2);
insert into hll_sketch_test select hll_sketch_build(3);
insert into hll_sketch_test select hll_sketch_build(4);
insert into hll_sketch_test select hll_sketch_build(5);
insert into hll_sketch_test select hll_sketch_build(1);
insert into hll_sketch_test select hll_sketch_build(2);
insert into hll_sketch_test select hll_sketch_union(hll_sketch_build(1), hll_sketch_build(2));
select hll_sketch_get_estimate(hll_sketch_union(sketch)) from hll_sketch_test;

看到如下输出

 hll_sketch_get_estimate 
-------------------------
        5.00000004967054
(1 row)

Non-aggregate union:

select hll_sketch_get_estimate(hll_sketch_union(hll_sketch_build(1), hll_sketch_build(2)));

 hll_sketch_get_estimate 
-------------------------
        2.00000000496705

插入一行，其sketch等于所有之前sketch的合并。

insert into hll_sketch_test select hll_sketch_union(sketch) from hll_sketch_test;
select hll_sketch_get_estimate(hll_sketch_union(sketch)) from hll_sketch_test;

 hll_sketch_get_estimate 
-------------------------
        5.00000004967054
(1 row)

Distinct counting with Theta sketch

See above for the exact distinct count of 100 million random integers

Approximate distinct count:

$ time psql test -c "select theta_sketch_distinct(id) from random_ints_100m"

 theta_sketch_distinct 
-----------------------
      100719190.365712
(1 row)

real	0m14.163s

Note that the above one-off distinct count is just to show the basic usage. Most importantly, the sketch can be used as an “additive” distinct count metric in a data cube.

Aggregate union:

create table if not exists theta_sketch_test(sketch theta_sketch);

-- delete all rows in table theta_sketch_test
delete from theta_sketch_test;

insert into theta_sketch_test select theta_sketch_build(1);
insert into theta_sketch_test select theta_sketch_build(2);
insert into theta_sketch_test select theta_sketch_build(3);
insert into theta_sketch_test select theta_sketch_build(4);
insert into theta_sketch_test select theta_sketch_build(5);
insert into theta_sketch_test select theta_sketch_build(1);
insert into theta_sketch_test select theta_sketch_build(2);
insert into theta_sketch_test select theta_sketch_union(sketch) from theta_sketch_test;

select theta_sketch_get_estimate(theta_sketch_union(sketch)) from theta_sketch_test;
 theta_sketch_get_estimate 
---------------------------
                         5
(1 row)

Non-aggregate set operations:

create table if not exists theta_set_op_test(sketch1 theta_sketch, sketch2 theta_sketch);
insert into theta_set_op_test select theta_sketch_build(1), theta_sketch_build(1);
insert into theta_set_op_test select theta_sketch_build(1), theta_sketch_build(2);

select theta_sketch_get_estimate(theta_sketch_union(sketch1, sketch2)) from theta_set_op_test;

 theta_sketch_get_estimate 
---------------------------
                         1
                         2
(2 rows)

select theta_sketch_get_estimate(theta_sketch_intersection(sketch1, sketch2)) from theta_set_op_test;

 theta_sketch_get_estimate 
---------------------------
                         1
                         0
(2 rows)

select theta_sketch_get_estimate(theta_sketch_a_not_b(sketch1, sketch2)) from theta_set_op_test;

 theta_sketch_get_estimate 
---------------------------
                         0
                         1
(2 rows)

Estimating quanitles, ranks and histograms with KLL sketch

Table “normal” has 1 million values from the normal (Gaussian) distribution with mean=0 and stddev=1.

连接postgresql客户端。

su - postgres -c "psql test"

执行如下的sql语句。

参考网页：https://www.postgresql.org/docs/current/tablefunc.html

Function	Returns	Description
`normal_rand(int numvals, float8 mean, float8 stddev)`	`setof float8`	Produces a set of normally distributed random values

参考网页：https://www.postgresql.org/docs/11/datatype-numeric.html

Name	Storage Size	Description	Range
`smallint`	2 bytes	small-range integer	-32768 to +32767
`integer`	4 bytes	typical choice for integer	-2147483648 to +2147483647
`bigint`	8 bytes	large-range integer	-9223372036854775808 to +9223372036854775807
`decimal`	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
`numeric`	variable	user-specified precision, exact	up to 131072 digits before the decimal point; up to 16383 digits after the decimal point
`real`	4 bytes	variable-precision, inexact	6 decimal digits precision
`double precision`	8 bytes	variable-precision, inexact	15 decimal digits precision
`smallserial`	2 bytes	small autoincrementing integer	1 to 32767
`serial`	4 bytes	autoincrementing integer	1 to 2147483647
`bigserial`	8 bytes	large autoincrementing integer	1 to 9223372036854775807

create extension tablefunc;

-- drop table normal;

create table if not exists normal(value real);

-- list tables under database `test`
/dt

-- show the schema of table `random_ints_100m`
/d normal

-- delete all rows in table normal
delete from normal;

insert into normal(value) 
    select normal_rand(1000000,0,1) 
ON CONFLICT DO NOTHING;

We can build a sketch, which represents the distribution:

create table if not exists kll_float_sketch_test(sketch kll_float_sketch);
insert into kll_float_sketch_test select kll_float_sketch_build(value) from normal;

INSERT 0 1

We expect the value with rank 0.5 (median) to be approximately 0:

select kll_float_sketch_get_quantile(sketch, 0.5) from kll_float_sketch_test;

 kll_float_sketch_get_quantile 
-------------------------------
                   -0.00732369
(1 row)

In reverse: we expect the rank of value 0 (true median) to be approximately 0.5:

select kll_float_sketch_get_rank(sketch, 0) from kll_float_sketch_test;

 kll_float_sketch_get_rank 
---------------------------
                  0.501778
(1 row)

Getting several quantiles at once:

select kll_float_sketch_get_quantiles(sketch, ARRAY[0, 0.25, 0.5, 0.75, 1]) from kll_float_sketch_test;

          kll_float_sketch_get_quantiles           
---------------------------------------------------
 {-5.08401,-0.681126,-0.00732369,0.671863,4.94608}
(1 row)

Getting the probability mass function (PMF):

select kll_float_sketch_get_pmf(sketch, ARRAY[-2, -1, 0, 1, 2]) from kll_float_sketch_test;

                kll_float_sketch_get_pmf                
--------------------------------------------------------
 {0.023444,0.138741,0.339593,0.33886,0.136574,0.022788}
(1 row)

The ARRAY[-2, -1, 0, 1, 2] of 5 split points defines 6 intervals (bins): (-inf,-2), [-2,-1), [-1,0), [0,1), [1,2), [2,inf). The result is 6 estimates of probability mass in these bins (fractions of input values that fall into the bins). These fractions can be transformed to counts (histogram) by scaling them by the factor of N (the total number of input values), which can be obtained from the sketch:

select kll_float_sketch_get_n(sketch) from kll_float_sketch_test;

 kll_float_sketch_get_n 
------------------------
                1000000
(1 row)

In this simple example we know the value of N since we constructed this sketch, but in a general case sketches are merged across dimensions of data hypercube, so the vale of N is not known in advance.

Note that the normal distribution was used just to show the basic usage. The sketch does not make any assumptions about the distribution.

Let’s create two more sketches to show merging kll_float_sketch:

insert into kll_float_sketch_test select kll_float_sketch_build(value) from normal;
insert into kll_float_sketch_test select kll_float_sketch_build(value) from normal;
select kll_float_sketch_get_quantile(kll_float_sketch_merge(sketch), 0.5) from kll_float_sketch_test;

 kll_float_sketch_get_quantile 
-------------------------------
                   -0.00719804
(1 row)

Frequent strings

这位教授在自己主页上列出了各种的分布函数模拟工具：https://www.csee.usf.edu/~kchriste/tools/toolpage.html 。我们把这些工具都下载到该文档中。

The following programs generate a times series with a given probability distribution. A summary of the properties of key distributions is here.

gendet.c – Deterministic

genunifc.c – Uniform (continuous)

genunifd.c – Uniform (discrete)

genpeak.c – Peaked

gennorm.c – Normal

genexp.c – Exponential

gengeo.c – Geometric

genpois.c – Poisson

genbin.c – Binomial

generl.c – Erlang

genhyp1.c – Hyperexponential for lambdas and p1

genhyp2.c – Hyperexponential for mean lambda and CoV

genipp.c – Interrupted Poisson Process (IPP)

genpar1.c – Pareto

genpar2.c – Bounded Pareto

genzipf.c – Zipf

genemp.c – Empirical

genuniq.c – Unique random integers

Consider a numeric Zipfian distribution with parameter alpha=1.1 (high skew) and range of 2^{13}, so that the number 1 has the highest frequency, the number 2 appears substantially less frequently and so on. Suppose zipf_1p1_8k_100m table has 100 million random values drawn from such a distribution, and the values are converted to strings.

用以下命令生成100万条zipf分布的数据。执行比较耗时，差不多一个小时生成好。

$ gcc -o genzipf genzipf.c -lm
$ ./genzipf
---------------------------------------- genzipf.c ----- 
-     Program to generate Zipf random variables        - 
-------------------------------------------------------- 
Output file name ===================================> zipf.dat
Random number seed (greater than 0) ================> 12
Alpha value ========================================> 1.1
N value ============================================> 8192
Number of values to generate =======================> 100000000
-------------------------------------------------------- 
-  Generating samples to file                          - 
-------------------------------------------------------- 
-------------------------------------------------------- 
-  Done! 
-------------------------------------------------------- 

#删除每行开头的空格键和TAB键
# $ sed 's/^[ /t]*//g' zipf.dat

#删除每行结尾的空格键和TAB键
$ sed 's/[ /t]*$//g' zipf.dat > zipf2.dat

$ mv zipf2.dat zipf.dat

$ wc -l zipf.dat
100000000 zipf.dat

$ more zipf.dat
1 
47 
1 
23 
...

连接postgresql客户端。

su - postgres -c "psql test"

执行如下的sql语句。

用下面的语句将CSV或者TXT格式的数据导入zipf_1p1_8k_100m表格。

-- drop table zipf_1p1_8k_100m;

create table if not exists zipf_1p1_8k_100m(value varchar);

-- remove old data
delete from zipf_1p1_8k_100m; 

copy zipf_1p1_8k_100m from '/path/to/zipf.dat' DELIMITER ' ';
-- copy zipf_1p1_8k_100m from '/home/p4-basic/P4/zipf.dat' DELIMITER ' ';

Here, about how to import a txt file into PostgreSQL.
参考：https://www.postgresql.org/docs/current/sql-copy.html

COPY is the SQL keyword.

table_name is the name of the table that you want to put the data into. (This is not intuitive if you just look at the syntax.)

FROM is another SQL keyword.

Then you have to specify the filename and the location of the file that you want to copy the data from between apostrophes.

And eventually you have to specify the field separator in your original file by typing DELIMITER and the field separator itself between apostrophes. So in this case ' ' means that the delimiter would be a space.

Suppose the goal is to get the most frequent strings from this table. In terms of the frequent items sketch we have to chose a threshold. Let’s try to capture values that repeat more than 1 million times, or more than 1% of the 100 million entries in the table. According to the error table, frequent items sketch of size 2^{9} must capture all values more frequent then about 0.7% of the input.

The following query is to build a sketch with lg_k=9 and get results with estimated weight above 1 million using “no false negatives” policy. The output format is: value, estimate, lower bound, upper bound.

$ time psql test -c "select frequent_strings_sketch_result_no_false_negatives(frequent_strings_sketch_build(9, value), 1000000) from zipf_1p1_8k_100m"
 
 frequent_strings_sketch_result_no_false_negatives 
---------------------------------------------------
 (1,15326588,15206626,15326588)
 (2,7152554,7032592,7152554)
 (3,4575255,4455293,4575255)
 (4,3334065,3214103,3334065)
 (5,2609685,2489723,2609685)
 (6,2137227,2017265,2137227)
 (7,1804014,1684052,1804014)
 (8,1557923,1437961,1557923)
 (9,1366476,1246514,1366476)
 (10,1217327,1097365,1217327)
 (11,1096711,976749,1096711)
(11 rows)

real	0m42.888s

Here is an equivalent exact computation:

$ time psql test -c "select value, weight from (select value, count(*) as weight from zipf_1p1_8k_100m group by value) t where weight > 1000000 order by weight desc"

 value |  weight  
-------+----------
 1     | 15326588
 2     |  7152554
 3     |  4575255
 4     |  3334065
 5     |  2609685
 6     |  2137227
 7     |  1804014
 8     |  1557923
 9     |  1366476
 10    |  1217327
 11    |  1096711
(11 rows)

real	0m31.810s

In this particular case the exact computation happens to be faster. This is just to show the basic usage. Most importantly, the sketch can be used as an “additive” metric in a data cube, and can be easily merged across dimensions.

Merging frequent_strings_sketch:

create table if not exists frequent_strings_sketch_test(sketch frequent_strings_sketch);
insert into frequent_strings_sketch_test select frequent_strings_sketch_build(9, value) from zipf_1p1_8k_100m;
insert into frequent_strings_sketch_test select frequent_strings_sketch_build(9, value) from zipf_1p1_8k_100m;
insert into frequent_strings_sketch_test select frequent_strings_sketch_build(9, value) from zipf_1p1_8k_100m;
select frequent_strings_sketch_result_no_false_negatives(frequent_strings_sketch_merge(9, sketch), 3000000) from frequent_strings_sketch_test;


  frequent_strings_sketch_result_no_false_negatives 
---------------------------------------------------
 (1,45979764,45619878,45979764)
 (2,21457662,21097776,21457662)
 (3,13725765,13365879,13725765)
 (4,10002195,9642309,10002195)
 (5,7829055,7469169,7829055)
 (6,6411681,6051795,6411681)
 (7,5412042,5052156,5412042)
 (8,4673769,4313883,4673769)
 (9,4099428,3739542,4099428)
 (10,3651981,3292095,3651981)
 (11,3290133,2930247,3290133)
(11 rows)

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/280267.html

datasketches-postgresql 安装手册

1. 编译incubator-datasketches-cpp

编译datasketches-cpp项目

编译datasketches-cpp项目的Python Wrapper for Datasketches

2. 编译incubator-datasketches-postgresql

3. Verifying installation of datasketches-postgresql with a test database

4. Advanced Examples of datasketches-postgresql

准备测试数据

Distinct counting with CPC sketch

Distinct counting with HLL sketch

Distinct counting with Theta sketch

Estimating quanitles, ranks and histograms with KLL sketch

Frequent strings

发表回复

datasketches-postgresql 安装手册

1. 编译incubator-datasketches-cpp

编译datasketches-cpp项目

编译datasketches-cpp项目的Python Wrapper for Datasketches

2. 编译incubator-datasketches-postgresql

3. Verifying installation of datasketches-postgresql with a test database

4. Advanced Examples of datasketches-postgresql

准备测试数据

Distinct counting with CPC sketch

Distinct counting with HLL sketch

Distinct counting with Theta sketch

Estimating quanitles, ranks and histograms with KLL sketch

Frequent strings

相关推荐

发表回复