Achieving fast upserts for Apache Druid

Druid’s data hierarchy

  1. The topmost unit of data in Druid is called a data source, this is the thing you can actually name and query. There are many types of data sources but for the purpose of this post, we will focus on tables.
  2. Each table is divided into time chunks (or intervals), which allow Druid to quickly filter queries to specific times. Time-chunks can be divided into varying granularities such as a month, week, day, hour, and so on.
  3. Druid’s smallest unit of data and processing are segment files and each time-chunk may contain multiple of those. Segment files are columnar storage units that utilize reverse dictionaries, bitmaps, and compression to provide quick queries and filtration.
Druid servers architecture taken from https://druid.apache.org/docs/latest/design/processes.html
Druid servers architecture. Taken from https://druid.apache.org/docs/latest/design/processes.html

Data in Singular’s pipeline

  1. Customer — the Singular account to which the data belongs.
  2. Date — the date associated with the data points.
  3. Source — the source from which the data is pulled.

Data in Singular’s Druid

How we used to load data into Druid

  1. Reingest and re-index the whole Druid partition including data from sources that weren’t updated.
  2. Calculate a diff between the source’s old data to its new data and load it into Druid without replacing existing data.

How we load data to Druid now

Small segments

  1. New incremental updates go into an append table using appendToExisting
  2. As more new updates come in we automatically delete unused segment files
  3. Once every 24 hours we reingest all updates and their respective Druid partitions into the primary table
  4. Existing data in the append table was rendered unneeded and will be automatically deleted

Filter complexity

(
table_type='Append' AND (
(source='Facebook' AND date='2021-01-01' AND version='ver1') OR
(source='Facebook' AND date='2021-01-02' AND version='ver2') OR
(source='Adwords' AND date='2021-01-01' AND version='ver3')
)
) OR
(
table_type='Primary' AND (
date not in ('2021-01-01', '2021-01-02') OR
(date='2021-01-01' and source not in ('Facebook', 'Adword')) OR
(date='2021-01-02' and source not in ('Facebook',))
)
)

Version state

Query performance

Results

Closing thoughts

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Singular Engineering Blog

Singular Engineering Blog

Singular’s engineering blog. We post here about tech topics we encounter and solutions we build. For more info go to → www.singular.net :)