Streaming data sources and schemas

Streaming data sources

Expert interview

info

We're in the process of migrating this content. Check back soon.

Learn more

Take a look at these resources to learn more:

Check the latest available streaming types.
Dig into the supervisor spec, particularly the ioConfig section.
- Apache® Kafka supervisor spec specifics - including how to ingest from multiple topics.
- Amazon Kinesis supervisor spec specifics.
Review the data server configuration options - notice druid.worker.capacity for each running Middle Manager.
Read more about the Apache Druid autoscaler.

Exercises

The ingestion specification contains the dataSchema, ioConfig, and tuningConfig components. These components determine how Druid connects to, interprets, and writes data into a table.

Now it's time to turn to the learn-druid repository of notebooks. These notebooks contain useful reference material, so even if you don't run them in the provided Docker image, it's worth looking at them in the source repository.

To start a streaming ingestion job using a supervisor, check out the following notebook:

Quickstart for streaming with Druid [local | source]

Druid can ingest data from multiple streams into the same table simultaneously. To see this in action, try the following notebook with sample data:

Multi-topic Kafka ingestion in Druid [local | source]

Equipped with what you have learned, why not spin up a Quickstart of your own. Try to connect to your own Amazon Kinesis or Apache Kafka-compatible source and run some simple queries on the data as it arrives.

Streaming data schema

Expert interview

info

We're in the process of migrating this content. Check back soon.

Learn more

To learn more, take a look at these resources:

Druid table schemas, noting how the primary timestamp is used.
Understand strategies for adding secondary timestamps.
Read about segments, including the optimizations that Druid applies automatically.
Consult the documentation on dimension specs. Read about strategies for schema changes.

Exercises

The dataSchema component of an ingestion specification defines how Druid parses data and configures the resulting schema.

Define the schema of the data to add to the table manually or turn on automatic schema detection. To see automatic and manual schema definition in a JSON ingestion specification, check out the "Defining table schemas in native Ingestion" notebook in the learn-druid repository [local | source].

You may want to use your new knowledge to:

Read from your own stream, experimenting with both manual and automatic schema detection.
Sample your own data to identify different timestamps and run experiments to see how Druid partitions the data.