During the last years all the development lifecycle has revolved around the version control system. You want continuous integration and testing, healthy release workflows, automatic security checks, linters, links to tickets, alerts… you use a tool or a service that runs that for you when something happens in your repo, like a commit, a pull request or a merge.
The thing is: developers are privileged, they work with source code, which is a line based text format and that’s what most of the SCMs know to work with. I don’t actually know how other industries work without tools like that.
When designing Tinybird one of the things we had in mind was: analytics data projects are code and code should be in a repo, like other parts of the aplicacion. And that’s why we decided to expose any resource as a simple text based format and a way to serialize/deserialize to and from our service.
Most SaaS products don’t allow you to mirror your project/metadata to a repo and that makes it impossible to use the good practices I mentioned in the first paragraph.
The design ¶
Our data model is simple, we just have two kinds of resources: datasources and data transformation pipes, they store and process data respectively. You can access both resources using a regular API that returns JSON but JSON is not the best format to edit and in general, be processed by a human. So we decided to also serialize them as a regular text file.
After some tests, we finally went with the simplest possible design for that and not tie the design to an existing format. We wanted to maximize how easy it is to write one of those files in a code editor. We expose the same resources as JSON as I said if you want to automate anything, so you don’t need to write a parser for those files. Machines and people need different interfaces.
We chose a file format like a Dockerfile, easy to parse, easy to write and organize, that allows to resolve merge conflicts without much hassle and that most developers more or less know how to deal with.
To be clear, we are not so clever to think about all those things before we start: we went through several data analytics projects and after some iterations we found a format that was handy.
So for example, you define a datasource like
1 2 3 4 5 6 7 # test.datasource VERSION 0 SCHEMA > timestamp DateTime, user_id Int32 SORTING_KEY timestamp
And you push to our platform with our CLI tool made specifically to work with those files.
1 $ tb push test.datasource
That’s it, you can do that with every single resource in a project so you can still use your favorite version control system on any provider of your choice and use the code editor you use every single day.
Of course you can pull files as well
1 $ tb pull
The benefits ¶
Being able to serialize the project as text files and store them in github allows us to do different things with our data pipelines:
- Run data tests
- Test the API endpoints you can expose with a pipe (this means exposing the result of a SQL as an API)
- Push to production new data workflows
- Replicate the same project to several environments (local/dev/staging/pro)
- Use all the available tools: merge requests, github actions, gitlab CI/CD system…
We just want to introduce those concepts, we will write a lot more about these things in future blog posts, you can subscribe to receive updates.