Static applications have a lot of advantages: they run fully on the browser, you don’t need a full stack developer to create an app, they are fast, scale pretty well and are easy to deploy with the help of tools like Zeit and it’s easy to build more complex apps with frameworks like next.js or Gatsby.

The main problem with this kind of apps is you often need to get data from different services. Nowadays there are services for almost everthing and you can talk to them directly from your static app or through a small backend or maybe with a bunch of lambda functions (also provided by Zeit). Ideally the less pieces between your app and the services providing the data the better, less latency, less points of failure, less things to maintain.

Tinybird is one of those services: it provides a way to query large amounts of data in real time. In this case we are building a static application that shows some stats over more than 50M github events from the last January.

Let’s start, we assume you have Zeit and Tinybird accounts.

Loading raw events from github data

Github provides access to public raw events produced by their platform events data in JSON format, one file per hour. We will load data for January 2020 as a example.

Before loading it we need to select the data from the JSON events we want to analyze and tranform to CSV. In this case we just need to pipe data though jq to transform to CSV and push to Tinybird using the API. We could load all the events but let’s keep only some of them for the sake of simplicity.

1
2
3
4
5
6
$ parallel -j 6 curl ::: https://data.gharchive.org/2020-01-{01..31}-{0..23}.json.gz | \
  gzip -d  | \
  jq -r "[.id, .type, .created_at, .actor.login, .repo.name] | @csv" | \
  parallel -j 2 --pipe -N500000 \
  curl -F csv=@- \
  "https://api.tinybird.co/v0/datasources?mode=append&name=github_actions&token=${TB_IMPORT_TOKEN}"

Let me explain the command.

First, use parallel to load using 6 curl workers. Notice the bash expansion trick to load all the hours for the whole month:

1
2
$ echo {1..10}
1 2 3 4 5 6 7 8 9 10

After that, we decompress with gzip -d. Then, using jq, we select the fields we want to analyze from the JSON events and transform to CSV. We batch the data using GNU’s parallel using two workers, -N500000 means we are sending 0.5 million rows batches. Finally we pipe those batches to curl using multipart upload to push to Tinybird. Notice we use the append mode and the Data Source with name github_actions. You don’t need to set an schema, Tinybird guesses everything for you.

That ${TB_IMPORT_TOKEN} variable is a token with import permissions, you can grab it from the dashboard of your Tinybird account.

Creating the static aplication

To show how to query that data, we created a pretty simple Vue application. It could be done using any other framework, React, Angular, or just vanilla javascript. The application will show top users and repositories for each different event. Obviously, a real application would have a build pipeline, testing and so on, for this example a simple HTML file is fine.

Fetching the whole dataset (1.5Gb compressed) would be impossible, it would crash the browser. So, in this case, we need to aggregate the data. Tinybird provides APIs to fetch the data in different forms, in the case the app fetches aggregated data in JSON format using the query API, using plain SQL.

1
2
curl "https://api.tinybird.co/v0/sql?q=select+count()+from+github_actions&token=${TB_READ_TOKEN}"
55305106

If you don’t like SQL, it’s also possible to create an endpoint (/v0/pipes/github_actions_tops.json) using Tinybird’s UI.

1
2
3
4
5
6
7
8
9
10
11
12
13
curl "https://api.tinybird.co/v0/pipes/github_actions_tops.json?token=p.eyJ1IjogIjI0OTA1NjBmLWJkYTEtNDE0OC1iZmViLTNmYWEzODMzZGEzMyIsICJpZCI6ICI1MjkyMTYxNi1jZTY1LTRjNjYtODQ1My05MDNmMWNmZWJjNTgifQ.4ln-8XdZisZypRUOpbmwdcpke_ubaRB30jY-LTn5-BQ"

...
    "data":
    [
        {
            "top_users": ["dependabot-preview[bot]","LombiqBot","dependabot[bot]","pull[bot]","direwolf-github","renovate[bot]","lihkg-boy","greenkeeper[bot]","svc-software-factory","codeserver-test1"],
            "top_repos": ["lihkg-backup\/thread","latinwordnet\/latinwordnet-archive","itengying-maker\/itengying","getselfstudy\/builder-course-5e2f8181d31c99000946f33f","otiny\/up","jdcloudcom\/MKTContent","qinwang13\/sqlite","eret1k\/chromiumos","2537463005\/codecombat.github.io","lihaoyi\/test"],
            "top_days": ["2020-01-13","2020-01-28","2020-01-14","2020-01-09","2020-01-15","2020-01-21","2020-01-22","2020-01-16","2020-01-07","2020-01-08"],
            "event_count": 55305106
        }
    ]
...

For this example, let’s just use the SQL interface.

The application is really simple:

  • when the application starts, the different events are requested using tinyb.query method from tinybird.js
1
2
3
4
5
6
7
// get different kind of events to populate the select
tinyb.query('select distinct(type) t from github_actions').then(r => {
  this.options = r.data.map(v => { 
    return { text: v.t, value: v.t } 
  })
  this.selected = this.options[0].text
})
  • when the user selects one type of event, the app fetches top users, top repos and event count:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
getStats(event_type) {
  // get top ten user repos and total event count
  const sql = `
    select 
      topK(10)(user) top_users,
      topK(10)(repo_name) top_repos,
      count() event_count
    from github_actions 
  where type = '${event_type}'
  `
  tinyb.query(sql).then(r => {
    var row = r.data[0]
    this.top_users = row.top_users
    this.top_repos = row.top_repos
    this.event_count = row.event_count
  })
}

This is the code for the full app (see the complete source code on github)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
var app = new Vue({
  el: '#app',
  data: {
    selected: '',
    options: [],
    top_users: [],
    top_repos: [], 
    event_count: 0
  }, 
  mounted() {
    const READ_TOKEN = 'p.eyJ1IjogIjI0OTA1NjBmLWJkYTEtNDE0OC1iZmViLTNmYWEzODMzZGEzMyIsICJpZCI6ICIzOGM5OTFjNy1hOWM3LTRjZTMtOGE0Ny0yYjI2NTY4M2I2MzIifQ.SZsRrUuopEMnlYJeuF_3rXWJtUTYuj60IRtIi5on9dE'
    this.tinyb = tinybird(READ_TOKEN)
    this.tinyb.query('select distinct(type) t from github_actions').then(r => {
      this.options = r.data.map(v => { 
        return { text: v.t, value: v.t } 
      })
      this.selected = this.options[0].text
      this.getStats(this.selected)
    })
  },
  methods: {
    getStats(event_type) {
      const sql = `
        select 
          topK(10)(user) top_users,
          topK(10)(repo_name) top_repos, 
          topK(10)(toDate(created_at)) top_days,
          count() event_count
        from github_actions 
      where type = '${event_type}'
      `
      this.tinyb.query(sql).then(r => {
        var row = r.data[0]
        this.top_users = row.top_users
        this.top_repos = row.top_repos
        this.event_count = row.event_count
      })
    },
    onChangeType (e) {
      this.getStats(e.target.value)
    }
  }
})

The deploy

This is as easy as executing now command thanks to Zeit:

You can run the application here.

Does my data need to be public?

In this particular case github data is public so to keep the data private makes no sense but it’s possible to use access tokens. In the source code when tinybird object is created a token is passed as an argument. That token is hardcoded in the HTML in this case but it could be provided by a backend behind a login.

1
2
3
4
5
mounted() {
    const READ_TOKEN = 'p.eyJ1IjogIjI0OTA1NjBmLWJkYTEtNDE0OC1iZmViLTNmYWEzODMzZGEzMyIsICJpZCI6ICIzOGM5OTFjNy1hOWM3LTRjZTMtOGE0Ny0yYjI2NTY4M2I2MzIifQ.SZsRrUuopEMnlYJeuF_3rXWJtUTYuj60IRtIi5on9dE'
    this.tinyb = tinybird(READ_TOKEN)
    ...
}

You could generate tokens for each user and even add restrictions like filtering only one kind of event or date range.

Sending raw SQL from the client application is not insecure!

If you allow to send queries like DROP TABLE or DELETE FROM, it’d be a very big issue, but that’s not the case. We only run the queries that are accepted by the token in use, so you can’t remove data or access data you don’t have access to.

You could also limit the SQL usage and generate regular endpoints that accepts input as parameters, like:

1
curl "https://api.tinybird.co/v0/pipes/github_actions_tops.json?event_type=IssueCommentEvent"

How fast are the queries?

The average query time on the small account we use in this example is about 300ms. Time is reported on headers and also in the JSON payload. There are no intermediate caches involved. No other optimizations like indices are created.

We are opening new accounts, request early access.