PostgreSQL estimate matching rows for values beyond known boundary (low_value and high_value)

Last time we have analyzed behavior in Oracle and found that Oracle treat values outside boundaries in less disruptive manner. (link)

What is the behavior for PostgreSQL optimizer to estimate matching rows for values beyond known boundary?

Setup demo table

create table t (d date, v int);
create index t_n1 on t(d);

Populate the data (2022-03-01 to 2022-03-31, each day with 10,000 rows)

insert into t 
select '2022-03-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
       generate_series as v
from generate_series(1,31*10000);

Analyze the statistics, and block autovacuum on this table

vacuum analyze t;
alter table t set (autovacuum_analyze_scale_factor=0,autovacuum_analyze_threshold=1000000000);

Populate the data (2022-04-01 to 2022-04-07, each day with 10,000 rows)

insert into t 
select '2022-04-01'::date + floor(generate_series/10000)*(interval '1 day') as d,
       generate_series as v
from generate_series(1,7*10000);

select relname, relkind, reltuples, relpages
from pg_class where relname in ('t','t_n1');

relname	relkind	reltuples	relpages
t	r	379928	1682
t_n1	i	310000	250

select attname,n_distinct,
       array_to_string(most_common_vals, E'\n') as most_common_vals, 
       array_to_string(most_common_freqs, E'\n') as most_common_freqs
from pg_stats where tablename='t';

attname	n_distinct	most_common_vals	most_common_freqs
d	31	2022-03-20	0.0344
		2022-03-24	0.033833332
		2022-03-19	0.0332
		2022-03-29	0.0332
		2022-03-23	0.033166666
		2022-03-05	0.032833334
		2022-03-17	0.03266667
		2022-03-21	0.0326
		2022-03-12	0.032566667
		2022-03-06	0.032433335
		2022-03-13	0.0324
		2022-03-28	0.0324
		2022-03-11	0.032233335
		2022-03-15	0.032166667
		2022-03-10	0.032133333
		2022-03-25	0.032133333
		2022-03-27	0.0321
		2022-03-16	0.032066666
		2022-03-18	0.03203333
		2022-03-31	0.032
		2022-03-14	0.031866666
		2022-03-08	0.031833332
		2022-03-26	0.031833332
		2022-03-02	0.0318
		2022-03-03	0.031733334
		2022-03-04	0.031533334
		2022-03-09	0.031433333
		2022-03-22	0.031433333
		2022-03-07	0.031366665
		2022-03-01	0.0313
		2022-03-30	0.0313
v	-1
(2 rows)

Now test PostgreSQL optimizer estimation with different values for column "D"

Test 1: Within Boundary (D = 2022-03-15)

explain (analyze, verbose) 
select count(*) from t where d = '2022-03-15'::date;

mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-15'::date;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=284.84..284.85 rows=1 width=8) (actual time=1.259..1.260 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..254.29 rows=12221 width=0) (actual time=0.026..0.758 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-03-15'::date)
         Heap Fetches: 0
 Planning Time: 0.149 ms
 Execution Time: 1.317 ms
(8 rows)

Math: 379928 * 0.032166667 = 12221.017459976 ~ 12221 rows.

Test 2: At the Boundary (D = 2022-03-31)

explain (analyze, verbose) 
select count(*) from t where d = '2022-03-31'::date;

mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-03-31'::date;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=283.58..283.59 rows=1 width=8) (actual time=1.180..1.181 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..253.19 rows=12158 width=0) (actual time=0.023..0.687 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-03-31'::date)
         Heap Fetches: 0
 Planning Time: 0.065 ms
 Execution Time: 1.241 ms
(8 rows)

Math: 379928 * 0.032 = 12157.696 ~ 12158 rows.

Test 3: Outside the Boundary (D = 2022-04-01)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-01'::date;

mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-01'::date;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=1.194..1.194 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..0.731 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-01'::date)
         Heap Fetches: 0
 Planning Time: 0.061 ms
 Execution Time: 1.274 ms
(8 rows)

Test 4: Outside the Boundary further (D = 2022-04-07)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-07'::date;

mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-07'::date;
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=2.368..2.369 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.024..1.537 rows=10000 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-07'::date)
         Heap Fetches: 0
 Planning Time: 0.081 ms
 Execution Time: 2.427 ms
(8 rows)

Test 5: Outside the Boundary even further (D = 2022-04-30)

explain (analyze, verbose) 
select count(*) from t where d = '2022-04-30'::date;

mytest=> explain (analyze, verbose)
mytest-> select count(*) from t where d = '2022-04-30'::date;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.44..4.45 rows=1 width=8) (actual time=0.018..0.019 rows=1 loops=1)
   Output: count(*)
   ->  Index Only Scan using t_n1 on public.t  (cost=0.42..4.44 rows=1 width=0) (actual time=0.015..0.015 rows=0 loops=1)
         Output: d
         Index Cond: (t.d = '2022-04-30'::date)
         Heap Fetches: 0
 Planning Time: 0.061 ms
 Execution Time: 0.074 ms
(8 rows)

Observation:

The amount of estimated matching rows becomes 1 for values outside the low_value and high_value boundary.

Test	Value	Estimate Rows	Actual Rows
Test 1	2022-03-15	12221	10000
Test 2	2022-03-31	12158	10000
Test 3	2022-04-01	1	10000
Test 4	2022-04-07	1	10000
Test 5	2022-04-30	1	0

Donghua's Blog - DBAGlobe

Monday, March 7, 2022