From day one, Tinybird supported CSV as the main ingestion format. CSV is natively supported by databases and applications as an exchange format, so with a one-time development effort Tinybird could interoperate âseamlesslyâ with other systems.
The main problem with CSV is thereâs not a standard specification but a loose set of guidelines on how to serialize data into a text string.
Those guidelines make most of the stuff related to reading and writing CSV files application-dependant, completely breaking the promise of interoperability. To name just a few:
- Mojibake encoding.
- No standard column separator, CSV actually stands for âCharacter Separated Valuesâ.
- Multiple character escaping.
- Header might or might not be present.
- Unwanted headers anywhere when you read a CSV file exported in chunks.
- New lines anywhere.
- Headers with a different number of elements than rows.
- Rows with different numbers of elements.
- Lines ended with a separator.
- Untyped, no standard way of storing any nested structures or to differentiate a boolean value from a string or integer.
- Empty value or null value?
The reality is that CSV is so wrong that its name, âComma Separated Valuesâ is not even accurate.
At Tinybird weâve had to ingest thousands of different user-defined CSV files âseamlesslyâ. We do a best guess on at least all those things mentioned above and we can tell you it is not a trivial task but a pretty challenging one indeed, especially when you are focused on doing it at scale in real-time scenarios.
The era of JSON data analytics
JSON is the defacto standard for data communication in the web. IoT sensors, server and security logs, real-time advertising, click-stream apps, social media, etc. all of them operate with JSON data, and thatâs one of the reasons we are supporting JSON natively: from a Kafka stream or from local or remote NDJSON files (and very soon in other flavours).
As opposed to CSV, JSON is a semistructured standard format: Less ambiguous, less application-dependant on its interpretation, easier to match data structures or nested data, typed and relatively easy to parse.
There are still some very valid criticisms about JSON. Mainly if human readability is not an issue for your use case, then there are more efficient alternatives, such as Apache Avro (which we do support) or Protobuf. Schemaless JSON is also a pain but hey! when youâve been able to ingest tweets embedded in CSV files everything else is easy as pie.
While CSV is far from being dead and continues to be a very common and useful exchange format, this is the era of JSON data analytics and we are ready for it.
In our quest to build a delightful developer experience, whatâs more important than the nuances of parsing CSV or JSON, is identifying the critical patterns to design the best ingestion framework for our users. One that is format and transport agnostic.
When working with JSON in Tinybird we use the same framework we designed for CSV but adding some improvements for a better ingestion experience for our users. An API centric framework but integrated in our dashboard and CLI to:
- Get the best guessing on attributes and data types when creating new Data Sources.
- Stream JSON events from a Kafka topic or from NDJSON files, faster than CSV even with the JSON attributes overhead.
- Avoid broken ingestion processes thanks to quarantine.
- Monitor and trace ingestion with service data sources.
- Handle changes of the schema on read, so you can evolve your analyses as new data comes in.
While we always challenge our assumptions, this framework guides the way we are ingesting data at Tinybird, deeply focused on simplicity, speed and developer experience.
What are your main challenges when dealing with large quantities of data? Tell us about them or sign up to Tinybird and get started on solving them right away.