Druid is a fast column-oriented distributed data
store. It allows you to execute queries via a
JSON-based query language,
in particular OLAP-style queries.
Druid can be loaded in batch mode or continuously; one of Druid’s key
differentiators is its ability to
load from a streaming source such as Kafka
and have the data available for query within milliseconds.
Calcite’s Druid adapter allows you to query the data using SQL,
combining it with data in other Calcite schemas.
First, we need a
model definition.
The model gives Calcite the necessary parameters to create an instance
of the Druid adapter.
A basic example of a model file is given below:
This file is stored as druid/src/test/resources/druid-wiki-model.json,
so you can connect to Druid via
sqlline
as follows:
That query shows the top 5 countries of origin of wiki page edits
on 2015-09-12 (the date covered by the wikiticker data set).
Now let’s see how the query was evaluated:
That plan shows that Calcite was able to push down the GROUP BY
part of the query to Druid, including the COUNT(*) function,
but not the ORDER BY ... LIMIT. (We plan to lift this restriction;
see [CALCITE-1206].)
Complex Metrics
Druid has special metrics that produce quick but approximate results.
Currently there are two types:
hyperUnique - HyperLogLog data sketch used to estimate the cardinality of a dimension
thetaSketch - Theta sketch used to also estimate the cardinality of a dimension,
but can be used to perform set operations as well.
In the model definition, there is an array of Strings called complexMetrics that declares
the alias for each complex metric defined. The alias is used in SQL, but its real column name
is used when Calcite generates the JSON query for druid.
Foodmart data set
The test VM also includes a data set that denormalizes
the sales, product and customer tables of the Foodmart schema
into a single Druid data set called “foodmart”.
You can access it via the
druid/src/test/resources/druid-foodmart-model.json model.
Simplifying the model
If less metadata is provided in the model, the Druid adapter can discover
it automatically from Druid. Here is a schema equivalent to the previous one
but with dimensions, metrics and timestampColumn removed:
Calcite dispatches a
segmentMetadataQuery
to Druid to discover the columns of the table.
Now, let’s take out the tables element:
Calcite discovers the “wikiticker” data source via the
/druid/coordinator/v1/metadata/datasources
REST call. Now that the “wiki” table element is removed, the table is called
“wikiticker”. Any other data sources present in Druid will also appear as
tables.
Our model is now a single schema based on a custom schema factory with only two
operands, so we can
dispense with the model
and supply the operands as part of the connect string: