Druid (Data Store)

From Bauman National Library
This page was last modified on 22 June 2016, at 15:50.
Druid
Druiddb.png
Developer(s) Community
Repository {{#property:P1324}}
Development status Active
Written in Java
Platform Cross-platform
Type distributed, real-time, column-oriented data store
License Apache license 2.0 [1]
Website druid.io

Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, making that data immediately available to queries. This is sometimes referred to as real-time data.

General

Druid is an open-source analytics data store designed for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is most commonly used to power user-facing analytic applications.

Key Features

Sub-second OLAP Queries Druid’s column orientation and inverted indexes enable complex multi-dimensional filtering and scanning exactly what is needed for a query. Aggregate and filter on data in milliseconds.

Real-time Streaming Ingestion Typical analytics databases ingest data via batches. Ingesting an event at a time is often accompanied with transactional locks and other overhead that slows down the ingestion rate. Druid employs lock-free ingestion of append-heavy data sets to allow for simultaneous ingestion and querying of 10,000+ events per second per node. Simply put, the latency between when an event happens and when it is visible is limited only by how quickly the event can be delivered to Druid.

Power Analytic Applications Druid has numerous features built in for multi-tenancy. Power user-facing analytic applications designed to be used by thousands of concurrent users.

Cost Effective Druid is extremely cost effective at scale and has numerous features built in for cost reduction. Trade off cost and performance with simple configuration knobs.

Highly Available Druid is used to back SaaS implementations that need to be up all the time. Druid supports rolling updates so your data is still available and queryable during software updates. Scale up or down without data loss.

Scalable Existing Druid deployments handle trillions of events, petabytes of data, and thousands of queries every second.

Concepts

Indexing the Data

Druid gets its speed in part from how it stores data. Borrowing ideas from search infrastructure, Druid creates immutable snapshots of data, stored in data structures highly optimized for analytic queries.

Druid is a column store, which means each individual column is stored separately. Only the columns that pertain to a query are used in that query, and Druid is pretty good about only scanning exactly what it needs for a query. Different columns can also employ different compression methods. Different columns can also have different indexes associated with them.

Druid indexes data on a per shard (segment) level.

Loading the Data

Druid has two means of ingestion, real-time and batch. Real-time ingestion in Druid is best effort. Exactly once semantics are not guaranteed with real-time ingestion in Druid, although we have it on our roadmap to support this. Batch ingestion provides exactly once guarantees and segments created via batch processing will accurately reflect the ingested data. One common approach to operating Druid is to have a real-time pipeline for recent insights, and a batch pipeline for the accurate copy of the data.

Querying the Data

Druid's native query language is JSON over HTTP, although the community has contributed query libraries in numerous languages, including SQL.

Druid is designed to perform single table operations and does not currently support joins. Many production setups do joins at ETL because data must be denormalized before loading into Druid.

The Druid Cluster

A Druid Cluster is composed of several different types of nodes. Each node is designed to do a small set of things very well.

  • Historical Nodes Historical nodes commonly form the backbone of a Druid cluster. Historical nodes download immutable segments locally and serve queries over those segments. The nodes have a shared nothing architecture and know how to load segments, drop segments, and serve queries on segments.
  • Broker Nodes Broker nodes are what clients and applications query to get data from Druid. Broker nodes are responsible for scattering queries and gathering and merging results. Broker nodes know what segments live where.
  • Coordinator Nodes Coordinator nodes manage segments on historical nodes in a cluster. Coordinator nodes tell historical nodes to load new segments, drop old segments, and move segments to load balance.
  • Real-time Processing Real-time processing in Druid can currently be done using standalone realtime nodes or using the indexing service. The real-time logic is common between these two services. Real-time processing involves ingesting data, indexing the data (creating segments), and handing segments off to historical nodes. Data is queryable as soon as it is ingested by the realtime processing logic. The hand-off process is also lossless; data remains queryable throughout the entire process.

High Availability Characteristics

Druid is designed to have no single point of failure. Different node types are able to fail without impacting the services of the other node types. To run a highly available Druid cluster, you should have at least 2 nodes of every node type running.

Links

References

Cite error: Invalid <references> tag; parameter "group" is allowed only.

Use <references />, or <references group="..." />
  1. "Apache license".