# Practical ClickHouse for real-time analytics

Recently, we read this great post from Haki Benita on how to do lots of operations like pivot tables, subtotals, linear regression, binning, or interpolation on Postgres, coming from a Pandas background.

Postgres is great as an OLTP database, but if you need real-time queries on hundreds of millions of rows, it’s not going to be fast enough. ClickHouse is. Most queries from Postgres will look very similar on ClickHouse. Here we’ll show you how to adapt most of the Postgres queries from Haki’s post to run on ClickHouse.

## Common table expressions ¶

ClickHouse supports CTEs. Both the `WITH <expression> AS <identifier>`

as well as the `WITH <identifier> AS <subquery expression>`

syntaxes are supported.

### On Tinybird ¶

For now, we only support the `WITH <expression> AS <identifier>`

syntax. So the previous queries would have to be rewritten like this:

There’s a difference with CTEs on Postgres VS ClickHouse. In Postgres, as the original post says, “CTEs are a great way to split a big query into smaller chunks, perform recursive queries and even to cache intermediate results”. On ClickHouse, CTEs can only return one row, so those intermediate results can’t have multiple rows. For a similar result, on ClickHouse you have to use subqueries.

On Tinybird, pipes act like notebooks where each node is a subquery and you can refer to the results of one node in another node. It’s great to see intermediate results and reduce the complexity of your queries. If you’d like to try it out, sign up here.

## Generating data ¶

As you see in the original article, in Postgres there are several ways to do it:

### Union all ¶

This works the same in ClickHouse as in Postgres:

The `VALUES`

keyword won’t work on Clickhouse to select data, only to insert it.

### Unnest - arrayJoin ¶

arrayJoin is the ClickHouse equivalent of unnest on Postgres. So this Postgres query:

Would be rewritten on ClickHouse like this:

## Generate_series ¶

The `generate_series`

doesn’t exist on ClickHouse, but with the `numbers`

function we can accomplish a lot as well. This is its basic usage:

A similar result can be obtained with the `range`

function, that returns arrays. If we only provide an argument, it behaves like `numbers`

. And with `range`

we can also specify a `start`

, `end`

and `step`

:

This, combined with `arrayJoin`

lets us do the same as `generate_series`

:

### Generating date ranges ¶

`generate_series`

can produce results with other types different than integers, while `range`

only outputs integers. But with some smart logic we can achieve the same results. For example, on Postgres you’d generate a with a datetime for each hour in a day this way, as in the original post:

### Generate a time series specifying the start date and the number of intervals ¶

On ClickHouse, you can achieve the same this way:

#### Generate a time series specifying the start and end date and the step ¶

Another way of doing the same thing:

#### Generate a time series using timeSlots ¶

Using the `timeSlots`

function, we can specify the start (DateTime), duration (seconds) and step (seconds) and it generates an array of DateTime values.

## Random ¶

The `rand`

function in ClickHouse is akin to `random`

in Postgres, with the difference that `rand`

returns a random UInt32 number between 0 and 4294967295. So to get random floats between 0 and 1 like `random`

, you have to divide the result by 4294967295.

To get more than one row, you’d simply do

### Generating random discrete values from a range of integers ¶

You would use the `floor`

or `ceil`

function (not `round`

, for the reasons explained here) in addition to the result of `rand`

multiplied by the max of the range of integers you want to generate, like this:

And here you can see that the distribution is uniform (this wouldn’t happen if you had use `round`

):

### Random choice from a list of values ¶

This is how you’d take samples with replacement from a list in Postgres:

In ClickHouse, this is how you’d do it:

To get only one value, you’d remove the `FROM numbers(5)`

part. Note that to define an array on ClickHouse you can do it either calling `array('red', 'green', 'blue')`

or with `['red', 'green', 'blue']`

like in the code snippet.

## Sampling ¶

Sorting data by `rand()`

can be used to get a random sample, like here:

But this is slow, as a full scan of the table has to be run here.

A more efficient way to do it is using the `SAMPLE`

clause. You can pass an integer to it (should be large enought, typicaly above 100000)

And you can also pass a float between 0 and 1, to indicate the fraction of the data that will be sampled.

## Descriptive statistics ¶

### Describing a numeric series ¶

ClickHouse also comes with lots of statistical function, like Postgres does (see this section of the original post). The first query, written on Postgres this way:

Would be done on ClickHouse like this:

# Describing a categorical series ¶

ClickHouse can also be used to get some statistics from discrete values. While on Postgres you’d do this:

On ClickHouse you’d do:

As a side note, if you have categorical columns, most likely you’ll have better performance and lower storage cost by using `LowCardinality`

data types. The performance of using `LowCardinality`

will be better than using the base data types even on columns with more than a few millions of different values. Quoting the this link from the Instana post:

When we came across the LowCardinality data type the first time, it seemed like nothing we could use. We assumed that our data is just not homogeneous enough to be able to use it. But when looking at it recently again, it turns out we were very wrong. The name LowCardinality is slightly misleading. It actually can be understood as a dictionary. And according to our tests, it still performs better and is faster even when the column contains millions of different values

## Subtotals ¶

The same operations done in this section of Haki’s post can be done with ClickHouse.

Given a table that contains this data:

Finding the number of employees with each role is straighforward:

### Rollup ¶

The `ROLLUP`

modifier is also available on ClickHouse, althought the syntax is slighly different than on Postgres. This query on Postgres:

would be written on ClickHouse like this:

It allows you to have more subtotals (but not all). To have all the subtotals for all the possible combinations of grouping keys, you need to use the `CUBE`

modifier:

## Pivot tables and condtional expressions ¶

Pivot tables let you reshape data when you want typically a column with keys, a column with categories and a column with values, and you want to aggregate those values and use the categories column as columns of a new table.

On Postgres you could do it this way:

On ClickHouse, you could do the same this way:

The original post doesn’t mention this, but Postgres has a very convenient `crosstab`

function that lets you what we’ve done in one line. If the number of the categories to pivot is too large, you can imagine how long this query could be become if done manually and how handy the `crosstab`

function could get. Something like this is not available yet on ClickHouse, unfortunately.

## Running and Cumulative Aggregation ¶

Aggregations over sliding windows are a very common thing, as Haki says. The best way to achieve this is with Window functions, and they’re coming soon to ClickHouse, as we wrote about here.

## Linear regression ¶

Given this data

On Postgres we can see in the the original post that you’d do linear regression like this:

It’s also possible to do it in ClickHouse. Here’s how:

There’s not a function on ClickHouse like `regr_r2`

that gives you the R2 coefficient, but it wouldn’t be hard to calculate it yourself as the formula is simple.

## Interpolation ¶

Filling null values with Pandas is a one-liner. Imagine we have this table

### Fill with constant ¶

The way to replace all the `null`

values by a constant value is using the `coalesce`

function, that works in ClickHouse the same way it does in Postgres:

### Back and forward fill ¶

This is also a one-liner in Pandas. On the section of the original post, the author does this using correlated subqueries, but those aren’t supported yet on ClickHouse. Fortunately, ClickHouse comes with a lot of powerful array functions like `groupArray`

, `arrayFill`

and `arrayReverseFill`

.

`groupArray`

, as the other aggregate functions on ClickHouse, skips the null values. So the solution involves replacing them by another value (make sure that the new value doesn’t appear in the column before). This is done with the `ifNull`

function. Add some array magic in, and this is how you’d do it:

To understand what’s going on here, check this out with a step-by-step explanation and the results of the transformations that are taking place.

### Linear interpolation ¶

Reference: https://altinity.com/blog/harnessing-the-power-of-clickhouse-arrays-part-3

## Binning ¶

WITH tb AS ( SELECT * FROM (VALUES (1, ‘A’ ), (2, ‘B’ ), (3, null), (4, ‘D’ ), (5, null), (6, null), (7, ‘G’ ) ) AS t(n, v) )