ClickHouse tags a major release around once a month, which is an order of magnitude more often than similar projects, and it does it while maintaining speed and stability.

One of the drawbacks of having a fast release cycle is that large developments are harder to introduce, specially if they can’t be broken down into smaller changes. The way ClickHouse handles this is by introducing these features as experimental meaning they are not ready for production use, either because they are not fully implemented, the design needs more iterations, they require more testing or further reviews, etc. With time they might be promoted to a stable feature, replaced by a better alternative or discarded.

Some features that were experimental in the past are decimal or LowCardinality types; some that are currently experimental are Live Views or Window Functions. A new addition to this list are projections, introduced in 21.6 by Amos Bird, which can be thought of as materialized views for table parts. A nice aspect about them is that the database will decide whether to use, and which one, projections automatically based on the query.

An example

To present an example with real data I took the Github events dataset and loaded the events from the months of January and February of 2020.

1
2
3
4
5
6
7
8
9
10
11
12
CREATE TABLE default.github_events
(
    `id` UInt32,
    `type` LowCardinality(String),
    `date` DateTime,
    `username` String,
    `repository` String
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(date)
ORDER BY repository
SETTINGS index_granularity = 8192

Once fully merged, I have one partition per day and each of them is internally ordered by repository, so the data related to events in the same repository will stored together in disk. If I wanted to know how many events happened around the Postgis repository I could count them with:

1
2
3
SELECT count()
FROM github_events
WHERE repository = 'postgis/postgis'
1
2
3
4
5
6
7
Query id: e6f9c71e-eeaa-46af-9adc-0a40d512cd6e

┌─count()─┐
│     251 │
└─────────┘

1 rows in set. Elapsed: 0.008 sec. Processed 491.52 thousand rows, 15.80 MB (65.29 million rows/s., 2.10 GB/s.)

To know the result, ClickHouse will go to each one of the table partitions (60), open the block containing the repository names and count how many of them contain this exact string. As the data is ordered it doesn’t need to check all names and instead it can look for where 'postgis/postgis' starts and ends. As a result it is only exploring ~500k rows instead of the whole 113 million.

On the other hand, if I wanted to know how many events were generated by my username, ClickHouse needs to read the whole username blocks completely since the ORDER BY doesn’t help us in this query:

1
2
3
SELECT count()
FROM github_events
WHERE username = 'Algunenano'
1
2
3
4
5
6
7
Query id: 22286941-e0ce-48bf-85f7-426be78f6741

┌─count()─┐
│     243 │
└─────────┘

1 rows in set. Elapsed: 0.099 sec. Processed 113.09 million rows, 2.19 GB (1.15 billion rows/s., 22.19 GB/s.)

We will also need to read the complete blocks if we mix both columns and need to know which repositories are interacted with by the users interacting with Postgis:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SELECT
    repository,
    count() AS users,
    sum(events) AS events
FROM
(
    SELECT
        username,
        repository,
        count() AS events
    FROM github_events
    WHERE username IN (
        SELECT username
        FROM github_events
        WHERE (repository = 'postgis/postgis') AND (NOT endsWith(username, '[bot]'))
    )
    GROUP BY
        username,
        repository
)
GROUP BY repository
HAVING users > 1
ORDER BY
    users DESC,
    events DESC
LIMIT 20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Query id: f58b8edd-d11d-4650-bbc8-c6b4e275b3b9

┌─repository─────────────────┬─users─┬─events─┐
│ postgis/postgis            │    50 │    248 │
│ postgres/postgres          │     7 │      7 │
│ qgis/QGIS                  │     4 │     92 │
│ libgeos/geos               │     4 │     86 │
│ consbio/mbtileserver       │     4 │     15 │
│ postgis/docker-postgis     │     4 │      7 │
│ openmaptiles/openmaptiles  │     3 │    100 │
│ Toblerity/Shapely          │     3 │     10 │
│ geotools/geotools          │     3 │      8 │
│ golang/go                  │     3 │      6 │
│ OSGeo/PROJ                 │     3 │      5 │
│ mapbox/mapbox-gl-js        │     3 │      4 │
│ elastic/kibana             │     3 │      4 │
│ OSGeo/gdal                 │     3 │      4 │
│ appropriate/docker-postgis │     3 │      4 │
│ Turfjs/turf                │     3 │      3 │
│ domlysz/BlenderGIS         │     3 │      3 │
│ dwmkerr/hacker-laws        │     3 │      3 │
│ cli/cli                    │     3 │      3 │
│ h5bp/html5-boilerplate     │     3 │      3 │
└────────────────────────────┴───────┴────────┘

20 rows in set. Elapsed: 0.158 sec. Processed 113.58 million rows, 2.82 GB (717.33 million rows/s., 17.79 GB/s.)

One of the subqueries, the one looking for interactions with Postgis, benefits from ordering by repository while the subquery that counts events per username would benefit from ordering by username. There isn’t a good way to improve both subqueries by sorting the data in a different way.

A possible solution could be to create a materialized view. By having a second table, this time ordered by username, we could have the data ordered by repository and by username and use one or the other depending on the query needs. This has a clear drawback were you know need to remember which table to use depending on what’s best when writing the query.

Let’s see how we could address this with projections instead:

1
2
3
SET allow_experimental_projection_optimization = 1;
ALTER TABLE github_events ADD PROJECTION projection_user_sort ( SELECT * ORDER BY username );
ALTER TABLE github_events MATERIALIZE PROJECTION projection_user_sort;

Now, using the exact same query requires reading way less rows from disk:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Query id: 31167652-9629-4856-a6d0-f8dbb311b0a7

┌─repository─────────────────┬─users─┬─events─┐
│ postgis/postgis            │    50 │    248 │
│ postgres/postgres          │     7 │      7 │
│ qgis/QGIS                  │     4 │     92 │
│ libgeos/geos               │     4 │     86 │
│ consbio/mbtileserver       │     4 │     15 │
│ postgis/docker-postgis     │     4 │      7 │
│ openmaptiles/openmaptiles  │     3 │    100 │
│ Toblerity/Shapely          │     3 │     10 │
│ geotools/geotools          │     3 │      8 │
│ golang/go                  │     3 │      6 │
│ OSGeo/PROJ                 │     3 │      5 │
│ mapbox/mapbox-gl-js        │     3 │      4 │
│ elastic/kibana             │     3 │      4 │
│ OSGeo/gdal                 │     3 │      4 │
│ appropriate/docker-postgis │     3 │      4 │
│ Turfjs/turf                │     3 │      3 │
│ domlysz/BlenderGIS         │     3 │      3 │
│ dwmkerr/hacker-laws        │     3 │      3 │
│ cli/cli                    │     3 │      3 │
│ h5bp/html5-boilerplate     │     3 │      3 │
└────────────────────────────┴───────┴────────┘

20 rows in set. Elapsed: 0.081 sec. Processed 21.65 million rows, 559.36 MB (267.94 million rows/s., 6.92 GB/s.)

In the background ClickHouse will:

  • Find all the usernames that have interacted with the 'postgis/postgis' repository. To do that it will use the default partitions, ordered by repository. This requires reading only ~0.5 M rows.
  • Out of those usernames discard the ones ending in [bot]. No need to read anything from disk.
  • For each remaining username (50), find which repositories they have interacted with. Using the projection partitions, ordered by username, it can read only ~0.5M rows instead of the whole dataset. 50 usernames x 0.5M rows = 25M rows (if you go one by one).
  • Lastly, it groups everything together to count them. No extra reading required.

Without requiring any change from the caller we’ve reduced the disk usage by about 85% at the cost of duplicating the storage disk usage.

But wait, there is more. ClickHouse really shines when you know the kind of queries you are going to use. In the same way that we could have an aggregation in a materialized view, we could create a projection that stores the relation between usernames and repositories:

1
2
3
ALTER TABLE github_events DROP PROJECTION projection_user_sort;
ALTER TABLE github_events ADD PROJECTION projection_user_repository (SELECT username, repository, count() GROUP BY username, repository);
ALTER TABLE github_events MATERIALIZE PROJECTION projection_user_repository;

When we execute the exact same query, which is the magical part that wasn’t available in ClickHouse before, it will detect that using the new partition makes sense and use it:

1
20 rows in set. Elapsed: 0.084 sec. Processed 17.01 million rows, 567.29 MB (202.03 million rows/s., 6.74 GB/s.)

Note that we could theoretically improve things further if the new projection was sorted to make it easier to do search later, but this isn’t currently supported. It’s an experimental feature at the end of the day.

Projections and Skipping Index

How does projections differ from other ClickHouse features like Skipping Index? Well, indexes are more limited since they are applied per granule ("the smallest indivisible data set that ClickHouse reads when selecting data") and you only have a few types to choose from, but they can be equally powerful depending on the query:

1
2
3
ALTER TABLE github_events DROP PROJECTION projection_user_repository;
ALTER TABLE github_events ADD INDEX username_bloom_index (username) TYPE bloom_filter GRANULARITY 1;
ALTER TABLE github_events MATERIALIZE INDEX username_bloom_index;

The result is the same with slightly more data read:

1
20 rows in set. Elapsed: 0.106 sec. Processed 28.68 million rows, 1.26 GB (269.53 million rows/s., 11.83 GB/s.)

Both Skipping Indexes and Projections share the limitation that they are only available for the MergeTree family and the benefit that they are automatically applied over the queries. As indexes can be much smaller, I would consider them as the first option when optimizing the storage for a query and only think of projections when a more complex transformation or aggregation is needed.

Projections and Materialized Views

Materialized views have some clear benefits over projections aside from the fact that they are already a stable feature: you can use them with any table engine and you can output several views to the same target table. With materialized views you can easily enrich the input rows with data from multiple sources.

Having a target table is both a benefit and a drawback of materialized views. For simple cases, like aggregations or transformations of the source table, having to change your queries or know there are materialized tables aside from the main one can be inconvenient, but as queries become more complex, hiding them behind a materialized table can be a very good thing.

One important distinction is that even though both projections and materialized views are generated after the data has been inserted in the main table, how they act if the query used fails is completely different. On the one hand, a failure while adding data through a materialized view means that this data won’t be available in the target table. This situation leads to many headaches as it could happen that some data has been already inserted and some failed, and recovering from this inconsistent status requires extra information. On the other hand, as projections are generated in the background, they deal with failures silently. This is ok since you can mix data from both projections and normal parts, but since the issue was caused by the applied query, it is expected that once you go and apply that query later you will get the exception. I find this way of error handling preferable as the caller will either get a clear error from their SQL query or a full result with no missing data.