Apache Cassandra

From Bauman National Library
This page was last modified on 20 December 2018, at 10:07.
</td></tr>
Apache Cassandra
Cassandra logo
Original author(s) Avinash Lakshman, Prashant Malik
Developer(s) Apache Software Foundation
Initial release 2008
Stable release
3.11.3 / August 1, 2018; 3 years ago (2018-08-01), testing version 4.0
Repository {{#property:P1324}}
Development status Active
Written in Java
Operating system Cross-platform
Available in English
Type Database
License Apache License 2.0
Website cassandra.apache.org

Apache Cassandra is a massively scalable open source non-relational database that offers continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. Cassandra was originally developed at Facebook, was open sourced in 2008, and became a top-level Apache project in 2010.

CAP theorem [1]

In theoretical computer science, the CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
Cassandra CAP.jpg

Apache Cassandra implements A (Availability) and P (Partition tolerance) and gives strong control on C (Consistency) while executing operations.

Key Cassandra Features and Benefits

Cassandra provides a number of key features and benefits for those looking to use it as the underlying database for modern online applications:

  • Massively scalable architecture – a masterless design where all nodes are the same, which provides operational simplicity and easy scale-out.
  • Active everywhere design – all nodes may be written to and read from.
  • Linear scale performance – the ability to add nodes without going down produces predictable increases in performance.
  • Continuous availability – offers redundancy of both data and node function, which eliminate single points of failure and provide constant uptime.
  • Transparent fault detection and recovery – nodes that fail can easily be restored or replaced.
  • Flexible and dynamic data model – supports modern data types with fast writes and reads.
  • Strong data protection – a commit log design ensures no data loss and built in security with backup/restore keeps data protected and safe.
  • Tunable data consistency – support for strong or eventual data consistency across a widely distributed cluster.
  • Multi-data center replication – cross data center (in multiple geographies) and multi-cloud availability zone support for writes/reads.
  • Data compression – data compressed up to 80% without performance overhead.
  • CQL (Cassandra Query Language) – an SQL-like language that makes moving from a relational database very easy.

Top Use Cases

While Cassandra is a general purpose non-relational database that can be used for a variety of different applications, there are a number of use cases where the database excels over most any other option. These include:

  • Internet of things applications – Cassandra is perfect for consuming lots of fast incoming data from devices, sensors and similar mechanisms that exist in many different locations.
  • Product catalogs and retail apps – Cassandra is the database of choice for many retailers that need durable shopping cart protection, fast product catalog input and lookups, and similar retail app support.
  • User activity tracking and monitoring – many media and entertainment companies use Cassandra to track and monitor the activity of their users’ interactions with their movies, music, website and online applications.
  • Messaging – Cassandra serves as the database backbone for numerous mobile phone and messaging providers’ applications.
  • Social media analytics and recommendation engines – many online companies, websites, and social media providers use Cassandra to ingest, analyze, and provide analysis and recommendations to their customers.
  • Other time-series-based applications – because of Cassandra’s fast write capabilities, wide-row design, and ability to read only the columns needed to satisfy queries, it is well suited time series based applications.

Architecture Overview

Cassandra Ring.png

Cassandra’s architecture is responsible for its ability to scale, perform, and offer continuous uptime. Rather than using a legacy master-slave or a manual and difficult-to-maintain sharded design, Cassandra has a masterless “ring” architecture that is elegant, easy to set up, and easy to maintain.

In Cassandra, all nodes play an identical role; there is no concept of a master node, with all nodes communicating with each other via a distributed, scalable protocol called "gossip." Cassandra’s built-for-scale architecture means that it is capable of handling large amounts of data and thousands of concurrent users or operations per second—​even across multiple data centers—​as easily as it can manage much smaller amounts of data and user traffic. To add more capacity, you simply add new nodes to an existing cluster without having to take it down first. Cassandra’s architecture also means that, unlike other master-slave or sharded systems, it has no single point of failure and therefore is capable of offering true continuous availability and uptime.

Writing and Reading Data

Cassandra is well known for its impressive performance in both reading and writing data. Data is written to Cassandra in a way that provides both full data durability and high performance. Data written to a Cassandra node is first recorded in an on-disk commit log and then written to a memory-based structure called a memtable. When a memtable’s size exceeds a configurable threshold, the data is written to an immutable file on disk called an SSTable. Buffering writes in memory in this way allows writes always to be a fully sequential operation, with many megabytes of disk I/O happening at the same time, rather than one at a time over a long period. This architecture gives Cassandra its legendary write performance.

Cassandra Write-path.png

Because of the way Cassandra writes data, many SSTables can exist for a single Cassandra logical data table. A process called compaction occurs on a periodic basis, coalescesing multiple SSTables into one for faster read access. Reading data from Cassandra involves a number of processes that can include various memory caches and other mechanisms designed to produce fast read response times. For a read request, Cassandra consults an in-memory data structure called a Bloom filter that checks the probability of an SSTable having the needed data. The Bloom filter can tell very quickly whether the file probably has the needed data, or certainly does not have it. If answer is a tenative yes, Cassandra consults another layer of in-memory caches, then fetches the compressed data on disk. If the answer is no, Cassandra doesn’t trouble with reading that SSTable at all, and moves on to the next.

Cassandra Read-path.png

Data Distribution and Replication

While the prior section provides a general overview of read and write operations in Cassandra in the context of a single node, the cluster-wide I/O activity that occurs is somewhat more sophisticated due to the database’s distributed, masterless architecture. Two concepts that impact read and write activity are the data distribution and replication models.

Automatic Data Distribution

Relational databases and some NoSQL systems require manual, developer-driven methods for distributing data across the multiple machines of a database cluster. These techniques are commonly referred to by the term "sharding." Sharding is an old technique that has seen some success in the industry, but is beset by inherent design and operational challenges. In contrast to this legacy architecture, Cassandra automatically distributes and maintains data across a cluster, freeing developers and architects to direct their energies into value-creating application features. Cassandra has an internal component called a partitioner, which determines how data is distributed across the nodes that make up a database cluster. In short, a partitioner is a hashing mechanism that takes a table row’s primary key, computes a numerical token for it, and then assigns it to one of the nodes in a cluster in a way that is predictable and consistent. While the partitioner is a configurable property of a Cassandra cluster, the default partitioner is one that randomizes data across a cluster and ensures an even distribution of all data. Cassandra also automatically maintains the balance of data across a cluster even when existing nodes are removed or new nodes are added to a system.

Replication Basics

Unlike many other databases, Cassandra features a replication mechanism that is very straightforward and easy to configure and maintain. Most Cassandra users agree that its replication model is one of the features that makes the database stand out from other relational or non-relational options. A Cassandra cluster can have one or more keyspaces, which are analogous to Microsoft SQL Server and MySQL databases or Oracle schemas. Replication is configured at the keyspace level, allowing different keyspaces to have different replication models. Cassandra is able to replicate data to multiple nodes in a cluster, which helps ensure reliability, continuous availability, and fast I/O operations. The total number of data copies that are replicated is referred to as the replication factor. A replication factor of 1 means that there is only one copy of each row in a cluster, whereas a replication factor of 3 means three copies of the data are stored across the cluster. Once a keyspace and its replication have been created, Cassandra automatically maintains that replication even when nodes are removed, added, or fail.

Multi-Data Center and Cloud Support

A popular aspect of Cassandra’s replication is its support for multiple data centers and cloud availability zones. Many users deploy Cassandra in this manner to ensure constant uptime for their applications and to supply fast read/write data access in localized regions. You can easily set up replication so that data is replicated across geographically diverse data centers, with users being able to read and write to any data center they choose and the data being automatically synchronized across all locations. You can also choose how many copies of your data exist in each data center (e.g. two copies in data center one; three copies in data center two, etc). Hybrid deployments of part on-premise data centers and part cloud are also supported.

Cassandra Multi-data-center.png

Data Model Overview

This section provides a brief introduction to Cassandra’s data model, along with the data structures and query language Cassandra uses to manage data.

Data Model Overview

Cassandra is a wide-row-store database that uses a highly denormalized model designed to capture and query data performantly. Although Cassandra has objects that resemble a relational database (e.g., tables, primary keys, indexes, etc.), Cassandra data modeling techniques necessarily depart from the relational tradition. For example, the legacy entity-relationship-attribute data modeling paradigm is not appropriate in Cassandra the way it is with a relational database. Success with Cassandra almost always comes down to getting two things right: the data model and the selected hardware—especially the storage subsystem. The topic of data modeling is of prime importance. Unlike a relational database that penalizes the use of many columns in a table, Cassandra is highly performant with tables that have thousands or even tens of thousands of columns. Cassandra provides helpful data modeling abstractions to make this paradigm approachable for the developer.

Cassandra Data Objects

The basic objects you will use in Cassandra include:

  • Keyspace – a container for data tables and indexes; analogous to a database in many relational databases. It is also the level at which replication is defined.
  • Table – somewhat like a relational table, but capable of holding vastly large volumes of data. A table is also able to provide very fast row inserts and column level reads.
  • Primary key – used to identity a row uniquely in a table and also distribute a table’s rows across multiple nodes in a cluster.
  • Index – similar to a relational index in that it speeds some read operations; also different from relational indices in important ways.

Transaction Management

While Cassandra does not support ACID transactions like most legacy relational databases, it does offer the “AID” portion of ACID. Writes to Cassandra are atomic, isolated, and durable. The “C” of ACID—​consistency—​does not apply to Cassandra, as there is no concept of referential integrity or foreign keys. Cassandra offers tunable data consistency across a database cluster. This means you can decide whether you want strong or eventual consistency for a particular transaction. You might want a particular request to complete if just one node responds, or you might want to wait until all nodes respond. Tunable data consistency is supported across single or multiple data centers, and you have a number of different consistency options from which to choose. Consistency is configurable on a per-query basis, meaning you can decide how strong or eventual consistency should be per SELECT, INSERT, UPDATE, and DELETE operation. For example, if you need a particular transaction to be available on all nodes throughout the world, you can specify that all nodes must respond before a transaction is marked complete. On the other hand, a less critical piece of data (e.g., a social media update) may only need to be propagated eventually, so in that case, the consistency requirement can be greatly relaxed. Cassandra also supplies lightweight transactions, or a compare-and-set mechanism. Using and extending the Paxos consensus protocol (which allows a distributed system to agree on proposed data modifications with a quorum-based algorithm, and without the need for any one "master" database or two-phase commit), Cassandra offers a way to ensure a transaction isolation level similar to the serializable level offered by relational database.

From Author

I use some other articles to compose this. I thank authors of these articles. Here the list:

  • A Brief Introduction to Apache Cassandra [2]
  • cassandra.apache.org [3]
  • What is Apache Cassandra? [4]

References

  1. Apache Cassandra [Internet] :[https://en.wikipedia.org/wiki/CAP_theorem CAP Theorem
  2. Apache Cassandra [Internet] :https://academy.datastax.com/demos/brief-introduction-apache-cassandra A Brief Introduction to Apache Cassandra
  3. Apache Cassandra [Internet] :http://cassandra.apache.org/ cassandra.apache.org
  4. Apache Cassandra [Internet] :http://www.planetcassandra.org/what-is-apache-cassandra/ What is Apache Cassandra?