Last time we have analyzed behavior in Oracle and found that Oracle treat values outside boundaries in less disruptive manner. (link)
What is the behavior for PostgreSQL optimizer to estimate matching rows for values beyond known boundary?
Setup demo table
create table t (d date, v int);
create index t_n1 on t(d);
Populate the data (2022-03-01 to 2022-03-31, each day with 10,000 rows)
insert into t
select '2022-03-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
generate_series as v
from generate_series(1,31*10000);
Analyze the statistics, and block autovacuum on this table
vacuum analyze t;
alter table t set (autovacuum_analyze_scale_factor=0,autovacuum_analyze_threshold=1000000000);
Populate the data (2022-04-01 to 2022-04-07, each day with 10,000 rows)
insert into t
select '2022-04-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
generate_series as v
from generate_series(1,7*10000);
select relname, relkind, reltuples, relpages
from pg_class where relname in ('t','t_n1');
relname | relkind | reltuples | relpages |
---|---|---|---|
t | r | 379928 | 1682 |
t_n1 | i | 310000 | 250 |
select attname,n_distinct,
array_to_string(most_common_vals, E'\n') as most_common_vals,
array_to_string(most_common_freqs, E'\n') as most_common_freqs
from pg_stats where tablename='t';
attname | n_distinct | most_common_vals | most_common_freqs |
---|---|---|---|
d | 31 | 2022-03-20 | 0.0344 |
2022-03-24 | 0.033833332 | ||
2022-03-19 | 0.0332 | ||
2022-03-29 | 0.0332 | ||
2022-03-23 | 0.033166666 | ||
2022-03-05 | 0.032833334 | ||
2022-03-17 | 0.03266667 | ||
2022-03-21 | 0.0326 | ||
2022-03-12 | 0.032566667 | ||
2022-03-06 | 0.032433335 | ||
2022-03-13 | 0.0324 | ||
2022-03-28 | 0.0324 | ||
2022-03-11 | 0.032233335 | ||
2022-03-15 | 0.032166667 | ||
2022-03-10 | 0.032133333 | ||
2022-03-25 | 0.032133333 | ||
2022-03-27 | 0.0321 | ||
2022-03-16 | 0.032066666 | ||
2022-03-18 | 0.03203333 | ||
2022-03-31 | 0.032 | ||
2022-03-14 | 0.031866666 | ||
2022-03-08 | 0.031833332 | ||
2022-03-26 | 0.031833332 | ||
2022-03-02 | 0.0318 | ||
2022-03-03 | 0.031733334 | ||
2022-03-04 | 0.031533334 | ||
2022-03-09 | 0.031433333 | ||
2022-03-22 | 0.031433333 | ||
2022-03-07 | 0.031366665 | ||
2022-03-01 | 0.0313 | ||
2022-03-30 | 0.0313 | ||
v | -1 | ||
(2 rows) |
Now test PostgreSQL optimizer estimation with different values for column "D"
Test 1: Within Boundary (D = 2022-03-15)
explain (analyze, verbose)
select count(*) from t where d = '2022-03-15'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-15'::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=284.84..284.85 rows=1 width=8) (actual time=1.259..1.260 rows=1 loops=1)
Output: count(*)
-> Index Only Scan using t_n1 on public.t (cost=0.42..254.29 rows=12221 width=0) (actual time=0.026..0.758 rows=10000 loops=1)
Output: d
Index Cond: (t.d = '2022-03-15'::date)
Heap Fetches: 0
Planning Time: 0.149 ms
Execution Time: 1.317 ms
(8 rows)
Math: 379928 * 0.032166667 = 12221.017459976 ~ 12221 rows.
Test 2: At the Boundary (D = 2022-03-31)
explain (analyze, verbose)
select count(*) from t where d = '2022-03-31'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-31'::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=283.58..283.59 rows=1 width=8) (actual time=1.180..1.181 rows=1 loops=1)
Output: count(*)
-> Index Only Scan using t_n1 on public.t (cost=0.42..253.19 rows=12158 width=0) (actual time=0.023..0.687 rows=10000 loops=1)
Output: d
Index Cond: (t.d = '2022-03-31'::date)
Heap Fetches: 0
Planning Time: 0.065 ms
Execution Time: 1.241 ms
(8 rows)
Math: 379928 * 0.032 = 12157.696 ~ 12158 rows.
Test 3: Outside the Boundary (D = 2022-04-01)
explain (analyze, verbose)
select count(*) from t where d = '2022-04-01'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-01'::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4.44..4.45 rows=1 width=8) (actual time=1.194..1.194 rows=1 loops=1)
Output: count(*)
-> Index Only Scan using t_n1 on public.t (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..0.731 rows=10000 loops=1)
Output: d
Index Cond: (t.d = '2022-04-01'::date)
Heap Fetches: 0
Planning Time: 0.061 ms
Execution Time: 1.274 ms
(8 rows)
Test 4: Outside the Boundary further (D = 2022-04-07)
explain (analyze, verbose)
select count(*) from t where d = '2022-04-07'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-07'::date;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4.44..4.45 rows=1 width=8) (actual time=2.368..2.369 rows=1 loops=1)
Output: count(*)
-> Index Only Scan using t_n1 on public.t (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..1.537 rows=10000 loops=1)
Output: d
Index Cond: (t.d = '2022-04-07'::date)
Heap Fetches: 0
Planning Time: 0.081 ms
Execution Time: 2.427 ms
(8 rows)
Test 5: Outside the Boundary even further (D = 2022-04-30)
explain (analyze, verbose)
select count(*) from t where d = '2022-04-30'::date;
mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-30'::date;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4.44..4.45 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
Output: count(*)
-> Index Only Scan using t_n1 on public.t (cost=0.42..4.44 rows=1 width=0) (actual time=0.015..0.015 rows=0 loops=1)
Output: d
Index Cond: (t.d = '2022-04-30'::date)
Heap Fetches: 0
Planning Time: 0.061 ms
Execution Time: 0.074 ms
(8 rows)
Observation:
The amount of estimated matching rows becomes 1
for values outside the low_value and high_value boundary.
Test | Value | Estimate Rows | Actual Rows |
---|---|---|---|
Test 1 | 2022-03-15 | 12221 | 10000 |
Test 2 | 2022-03-31 | 12158 | 10000 |
Test 3 | 2022-04-01 | 1 | 10000 |
Test 4 | 2022-04-07 | 1 | 10000 |
Test 5 | 2022-04-30 | 1 | 0 |
No comments:
Post a Comment