Time series predictions are one of the most common use cases you can find. Predicting the future, enables you to get ready for it (and act accordingly) so, as you would expect, it is something every company would love to do. Good news is that there are many methods to do it: from sophisticated Machine Learning algorithms or advanced forecasting libraries like prophet to simpler approaches based on simpler statistical foundations, like the one we describe below.

In this case, we are going to use the famous NYC taxi dataset to predict the number of pick-ups for a specific day around the same area.

Importing the data

The data consists of a set of CSV files with “yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts”.

First thing, we will load the data from 2017 and 2018 using the Tinybird Datasource API. We will be using pick-ups in 2017 to “train” the model and the ones in 2018 to validate it. This will load around 220M records in your account.

#!/bin/bash
TOKEN=<DATASOURCE:CREATE token> 
for year in 2017 2018
do
    for month in $(seq -f "%02g" 1 12)
    do
      curl -H "Authorization: Bearer $TOKEN" \
      -X POST "https://api.tinybird.co/v0/datasources?url=https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_${year}-${month}.csv&name=nyc_taxi_17_18&mode=append"
    done
done

Remember that you’ll need to grab your DATASOURCE:CREATE from your tokens page

The method

The simplest method we can use is to assume that the number of pickups is exactly the same than one year ago, but taking into account the average growth for the year, in other words:

pickups_predicted(date) = pickups(date - ‘1 year’) * growth_ratio

So let’s start by predicting the total pick-ups number for August.

Calculating the average growth

It can be done per day, per week, per month or using any other time range, but we will try with a month. It’s a good balance between having enough samples to avoid outliers, and not too many that it does not pick up the tendency.

const READ_TOKEN = 'p.eyJ1IjogImMzZTMwNDIxLTYwNzctNGZhMS1iMjY1LWQwM2JhZDIzZGRlOCIsICJpZCI6ICI3NjNmZjYwNy02MDA3LTQzMTUtYTZlYy01Mjg0NDQwMTQzMjUifQ.KJaTOlGcg3Dx96g2ZkHJLCtijZ6Zu-_jLrCTzebYrWU';
var nytaxi = tinybird(READ_TOKEN).pipe('nyc_taxi_17_18_pipe')
var res = await nytaxi.json(`
  WITH
  (
    SELECT count(1) c FROM _
    WHERE toYYYYMM(tpep_pickup_datetime) == 201803
  ) as c_2018,
  (
    SELECT count(1) c FROM _
    WHERE toYYYYMM(tpep_pickup_datetime) == 201703

  ) as c_2017
        SELECT c_2018 / c_2017 ratio
`)
console.log(res.data[0]['ratio'], res.statistics.elapsed)

Some notes: Instead of using the pipe name, the underscore (_) symbol is used. In Tinybird _ is a shortcut to avoid typing pipe names all the time (check many other shortcuts tinybird.js has). It substitutes the pipe name used when instantiating the nyc_taxi_17_18_pipe object. This also lets you reuse queries when changing pipe names, for example, when changing the environment from staging to production. toYYYYMM function returns an integer of a datetime. It could be replaced by tpep_pickup_datetime between ‘2018-03-01’ and ‘2018-03-31’ but it’s shorter and we don’t have to deal with different month lengths.

As a side note, we always run benchmarks and examples on the smallest machine we can get from our infrastructure provider. Better than showing how fast Tinybird can be, we prefer to show you the baseline, which we can always scale up linearly.

console.log(`this query took ${res.statistics.elapsed} seconds on the smallest tinybird account`)

Fetching previous year info

As mentioned above, we are going to calculate the number of pick-ups based on the previous year’s data. To get the same day, but one year before, we will use the addYears function:

var res = await nytaxi.json(`
    WITH
        toDate('2018-01-01') AS date
    SELECT
        addYears(date, -1) AS one_year_ago
`)

console.log(res['data'][0]['one_year_ago'], res.statistics.elapsed)

In order to match the day of the week, we are going to use a nice trick from Xoel López. It involves getting the date 364 days before, instead of 365 (this will not work on a leap year but let’s simplify to check if the model works).

Now, let’s evaluate how our model performs by calculating it for every day and comparing the predicted value with the actual one (2018).

var res = await nytaxi.json(`
    -- precompute the growth year-to-year for March
    WITH (
      SELECT
        c_2018 / c_2017 ratio,
        countIf(toYear(tpep_pickup_datetime) == 2018) c_2018,
        countIf(toYear(tpep_pickup_datetime) == 2017) c_2017
      FROM _
      WHERE toMonth(tpep_pickup_datetime) == 2
    ) as growth

    -- compute diff between 2018 prediction and 2018 real data
    SELECT
        toDate('2018-01-01') + d as day,
        growth.1 as ratio,
        -- this is the actual model :)
        c_2017 * ratio as estimated,
        c_2018,
        abs(c_2018 - estimated) as abs_error,
        abs(c_2018 - estimated)/c_2018 as rel_error

    -- 2018 data to evaluate the model
    FROM (
        SELECT toInt32(toDayOfYear((tpep_pickup_datetime))) d, count(1) c_2018
        FROM _ WHERE toYYYYMM(tpep_pickup_datetime) == 201802
        GROUP BY d
    )

    -- 2017 data, note the 1-day offset
    ANY INNER JOIN
    (
        SELECT toDayOfYear((tpep_pickup_datetime)) - 1 d, count(1) c_2017
        FROM _ WHERE toYYYYMM(tpep_pickup_datetime) == 201702
        GROUP BY d
    )
    USING d
    ORDER BY day
`)

console.log(`      day      |   error  `)
console.log(`--------------------------`)
for (var day of res.data) {
  console.log(`  ${day.day}   |   ${(100*day.rel_error).toFixed(1)}%`)
}
console.log(`--------------------------`, res.statistics.elapsed)

Not bad! almost everything under 10% can be considered pretty good taking into account the simplicity of our model.

The final function

Next, once we have checked that the model works, we should create a function to predict data for a given day. It would look like

async function predict_pickups(date) {
    var res = await nytaxi.first(`
        -- precompute the growth year-to-year for date month
        WITH
        (
            SELECT count(1) c FROM _
            WHERE
            toYYYYMM(tpep_pickup_datetime) == toYYYYMM(addYears( toDate('${date}'), -1))
        ) as prev_count,
        (
            SELECT count(1) c FROM _
            WHERE
            toYYYYMM(tpep_pickup_datetime) == toYYYYMM( toDate('${date}'))
        ) as curr_count

        -- compute prediction
        SELECT
            throwIf(prev_count == 0),
            toDate(tpep_pickup_datetime) d,
            curr_count/prev_count as ratio,
            round(count(1) * ratio) as estimated
            FROM _
            WHERE toDate(tpep_pickup_datetime) == addDays(toDate('${date}'), -364)
            GROUP BY d
    `)
    if (res.error) {
       throw new Error("failed to get the prediction")
    }
    return res['estimated']
}

// use the function
const date = '2018-01-15'
var pickups = await predict_pickups(date)
console.log(`predicted pickups for date ${date} -> ${pickups}`)

Some comments: In this case we are using addDays(date, -364) instead of toDayOfYear((tpep_pickup_datetime)) - 1, simpler and more elegant. In order to avoid problems when there is no data, we added throwIf(prev_count == 0) which raises and exception and stops the query. We can also use first(sql) instead of json(sql) to fetch the first row. Lots of stuff can be done in order to speed up the query, but we will talk about that in a future post.

Quick wins to improve the model

There are some quick wins that could be added, for example:

  • Take into account special dates like July, 4th or Dec, 25th. As outliers, they deserve some special treatment. Actually, it would be better to analyze when the model is not working properly, analyze why and fix the model for that.
  • Calculate the growth ratio using all the data up to date, instead of just using the last month.

Other interesting analysis

Knowing the number of pickups is not really interesting, it would have more value if we would do a more specific analysis like calculating the number of pick-ups per taxi zone per hour, the estimated fare rate per taxi zone per hour, or the estimated number of people who will be taking a taxi in an area at a specific time. Give them a try! We will cover those in following blog posts.

Other resources you might be interested in