Summarize and sketch
Expert interview
We're in the process of migrating this content. Check back soon.
Summarized tables
Apache Druid is a speedy GROUP BY engine thanks to the way that it stores data and how it parallelizes query operations.
First off, run through the GROUP BY notebook on learn-druid to delve deeper into the not-so-obvious permutations of the GROUP BY statement in Apache Druid.
Open JupyterLab in your learn-druid environment.
Work through the GROUP BY (local | source) notebook in the query section.
But GROUP BY is not just useful at query time. You can also use GROUP BY at ingestion time to summarise your incoming data.
If you're working with streaming data, you may want to check out the notebook on rollup at ingestion time (local | source).
Before you move on, take what you have learned in the notebook above and apply it to some of your own data. Use GROUP BY to summarise your data, applying a date and time function to truncate the primary timestamp, and adding aggregation functions to output metrics like COUNT and MAX.
Finally, be sure that you understand the relationship to "queryGranularity", "rollup", and "metricsSpec" in JSON-based ingestion as you've seen when using SQL-based ingestion.
Approximation
Apache Druid includes numerous query execution engines and functions that help you eke out the maximum performance for your queries. When you're ingesting or querying large amounts of data, it's especially important to know about these techniques.
There are several notebooks that you now need to complete to experience what these techniques are, and how they can be applied both at query time and as part of your ingestion.
Open JupyterLab in your learn-druid environment.
Work through the following notebooks in the query section:
- TopN approximation (local | source)
- Approximate COUNT DISTINCT (local | source) with HyperLogLog and Theta sketches
- Approximate data distribution functions (local | source) with Quantiles sketches
Work through the sketch generation (local | source) notebook from the ingestion section to learn how to create sketches at ingestion time as part of a summarized table.
When you are done, you will know how to leverage Apache DataSketches inside Apache Druid, and will know how to switch between approximate and non-approximate modes of query execution. You will also see how to pre-load your tables with Apache DataSketches for greater efficiency.
Tables
UNION ALL
Apache Druid enables you to UNION the results of queries in specific ways. This can help you to bring together different result sets into a single result set.
Open JupyterLab in your learn-druid environment.
Work through the UNION notebook (local | source) in the query section.
This notebook is important as it walks you through the types of UNION ALL operations that are possible in Druid, and helps you to understand where they can be applied.
Learn more
To get you ready for the exam, take time to delve into the documentation. Especially remember to look at:
- The Apache DataSketches extension and the functions that are available for each.
- TopN queries.
- The UNION ALL operator.
- How to roll up (summarize) data in native JSON-based ingestion (used in streaming), especially how to truncate the timestamp and where to specify the aggregates (metrics) that you want to produce.
- Both the SQL and native aggregate functions that are available to you.