Column- and row-oriented databases

The relational databases that many software engineers are most familiar with, e.g. Postgres or MySQL, are row-oriented: they store data row-by-row. For a given row, its columns are stored next to each other on disk. This means that fetching all the data from a row is fast.

visit_idipdurationbrowser
11.2.3.4100chrome
22.3.4.55firefox
33.4.5.630firefox

However, for some queries over very large data sets, this can become inefficient. Say that you have a table of visits to your website, and you want to find the average time spent per session. If you’re dealing with billions of rows, row-oriented databases need to do a lot of work: the storage engine needs to perform many disk I/O operations to read data from the rows, even if it only needs to read a few columns from each.

Column-oriented databases like Clickhouse or InfluxDB lay out data in a different way, such that all the values for a given column are contiguous on disk. This means that analytical queries and aggregations have to jump around the disk far less.

visit_id123
ip1.2.3.42.3.4.53.4.5.6
duration100530
browserchromefirefoxfirefox

Because columns often contain many repeated instances of the same value, there are also opportunities for compressing the data in each column to use less disk space.

As always, there are tradeoffs: writes (i.e. creates, updates, and deletes) are generally slower for columnar databases, and more work is needed to fetch the data from a single row.