The partition_depth_histogram in Snowflake's SYSTEM$CLUSTERING_INFORMATION function can be a bit confusing. I tried to break down your questions with examples:
1. Why is it always 17 buckets?
The number of buckets in the partition_depth_histogram is not always 17. It's actually a combination of two factors:
Fixed Buckets (0-16): The first 17 buckets (numbered from "00000" to "00016") represent a fixed range from 0 to 16 overlap depth with increments of 1. This provides detailed information about micro-partitions with very low overlap.
Dynamic Buckets (Larger than 16): For overlap depths exceeding 16, the buckets increase in size based on a doubling scheme. This means the next bucket would be "00032" (twice the size of the previous bucket) and so on. This approach efficiently represents the distribution of overlap depth for a wider range of values.
2. What is the relation between micro-partitions and buckets in the histogram?
The partition_depth_histogram doesn't directly map micro-partitions to specific buckets. Instead, it shows the number of micro-partitions that fall within a certain overlap depth range.
Bucket Value: Each bucket represents a specific overlap depth range. Value in the Bucket: The value associated with a bucket (e.g., 98 in "00032") indicates the number of micro-partitions in that specific overlap depth range.
3.What do the numbers in the histogram mean?
The numbers in the histogram represent the count of micro-partitions within a specific overlap depth range.
In your example:
There are 0 micro-partitions with an overlap depth of 0 ("00000"). There are 3 micro-partitions with an overlap depth between 2 and 3 ("00002"). There are 98 micro-partitions with an overlap depth between 32 and 64 ("00032"). There are 698 micro-partitions with an overlap depth exceeding 128 ("00128").
Please Note:
The provided example also shows a high average_depth (64.07) compared to the total number of micro-partitions (1156). This suggests that a significant portion of the data is accessed by multiple queries, potentially leading to good query performance due to data sharing.