Apr 21, 2019

Simple Time Series Prediction Modeling Using Tinybird

Time series predictions are one of the most common use cases you can find. Predicting the future, enables you to get ready for it (and act accordingly).
Javier Santana
Co-founder

Time series predictions are one of the most common use cases you can find. Predicting the future, enables you to get ready for it (and act accordingly) so, as you would expect, it is something every company would love to do. Good news is that there are many methods to do it: from sophisticated Machine Learning algorithms or advanced forecasting libraries like prophet                  to simpler approaches based on simpler statistical foundations, like the one we describe below.

In this case, we are going to use the famous NYC taxi dataset                  to predict the number of pick-ups for a specific day around the same area.

Importing the data

The data consists of a set of CSV files                  with "yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts".

First thing, we will load the data from 2017 and 2018 using the Tinybird Datasource API. We will be using pick-ups in 2017 to "train" the model and the ones in 2018 to validate it. This will load around 220M records in your account.

Remember that you'll need to grab your DATASOURCE:CREATE from your tokens page

The method

The simplest method we can use is to assume that the number of pickups is exactly the same than one year ago, but taking into account the average growth for the year, in other words:

So let's start by predicting the total pick-ups number for August.

Calculating the average growth

It can be done per day, per week, per month or using any other time range, but we will try with a month. It’s a good balance between having enough samples to avoid outliers, and not too many that it does not pick up the tendency.

Some notes: Instead of using the pipe name, the underscore (_) symbol is used. In Tinybird _ is a shortcut to avoid typing pipe names all the time. It substitutes the pipe name used when instantiating the nyc_taxi_17_18_pipe object. This also lets you reuse queries when changing pipe names, for example, when changing the environment from staging to production. toYYYYMM                  function returns an integer of a datetime. It could be replaced by tpep_pickup_datetime between ‘2018-03-01’ and ‘2018-03-31’ but it’s shorter and we don’t have to deal with different month lengths.

As a side note, we always run benchmarks and examples on the smallest machine we can get from our infrastructure provider. Better than showing how fast Tinybird can be, we prefer to show you the baseline, which we can always scale up linearly.

console.log(`this query took ${res.statistics.elapsed} seconds on the smallest tinybird account`)

Fetching previous year info

As mentioned above, we are going to calculate the number of pick-ups based on the previous year's data. To get the same day, but one year before, we will use the addYears function:

In order to match the day of the week, we are going to use a nice trick from Xoel López. It involves getting the date 364 days before, instead of 365 (this will not work on a leap year but let's simplify to check if the model works).

Now, let's evaluate how our model performs by calculating it for every day and comparing the predicted value with the actual one (2018).

Not bad! almost everything under 10% can be considered pretty good taking into account the simplicity of our model.

The final function

Next, once we have checked that the model works, we should create a function to predict data for a given day. It would look like

Some comments: In this case we are using addDays(date, -364) instead of toDayOfYear((tpep_pickup_datetime)) - 1, simpler and more elegant. In order to avoid problems when there is no data, we added throwIf(prev_count == 0) which raises and exception and stops the query. We can also use first(sql) instead of json(sql) to fetch the first row. Lots of stuff can be done in order to speed up the query, but we will talk about that in a future post.

Quick wins to improve the model

There are some quick wins that could be added, for example:

  • Take into account special dates like July, 4th or Dec, 25th. As outliers, they deserve some special treatment. Actually, it would be better to analyze when the model is not working properly, analyze why and fix the model for that.
  • Calculate the growth ratio using all the data up to date, instead of just using the last month.

Other interesting analysis

Knowing the number of pickups is not really interesting, it would have more value if we would do a more specific analysis like calculating the number of pick-ups per taxi zone per hour, the estimated fare rate per taxi zone per hour, or the estimated number of people who will be taking a taxi in an area at a specific time. Give them a try! We will cover those in following blog posts.

Other resources you might be interested in

Do you like this post?

Related posts

Real-time Data Visualization: How to build faster dashboards
A new way to create intermediate Data Sources in Tinybird
Tinybird
Team
Jun 15, 2023
Export data from Tinybird to Amazon S3 with the S3 Sink
Tinybird
Team
Mar 21, 2024
Tinybird: A ksqlDB alternative when stateful stream processing isn't enough
To the limits of SQL... and beyond
Automating data workflows with plaintext files and Git
Chatting GraphQL with Jamie Barton of Grafbase
Tinybird
Team
Apr 24, 2023
What it takes to build a real-time recommendation system
We launched an open source ClickHouse Knowledge Base
Tinybird
Team
Oct 11, 2022
The definition of real-time data

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.
Need more? Contact sales for Enterprise support.