NOsql-Cassandra & CQL

Cassandra --an extensible record (wide column) store

General idea...

• NoSQL
– Different types
– Scale up vs. scale out

• Key features

– Flexible data model

– High availability & Scalability

Amazon DynamoDB
– Data model, partition & sort key

– Data types (string, number, set, map, list)

– Consistent hashing

• Apache Cassandra

– Write & read path

– Upsert
– Minor & major compaction

Apache Hive
– HiveQL: SQL-like language

– Analyze data stored in HDFS
– Queries compiled into MapReduce jobs

Cassandra & DynamoDB
– Key-based (~ OLTP)
– Processing a small amount of data per query

Hive
– Analytical workload (~ OLAP)
– A query may need to process terabytes of data

模型：

Cassandra使用Google 设计的 BigTable的数据模型，与面向行(row)的relational database或键值存储的key-value数据库不同，Cassandra使用Wide Column Stores，每行数据由row key唯一标识，最多20亿个列，每列由column key标识，每个column key对应若干value。

这种模型可理解为一个二维的key-value存储，整个数据模型定义成一个类似
map<key1, map<key2,value>>。

交互

新版Cassandra采用与SQL类似的CQL,实现数据模型定义和数据读写。
desc keyspaces;
create keyspace xxx with replication = {'class':'SimpleStrategy','replication_factor':1}
drop keyspace xxx;
create table xxx()
create column family(name type primary key, name type)

SELECT * FROM users WHERE lastname= 'Smith';

insert into users (lastname, age, city, firstname) values ('Smith', 35, 'LA', 'John');
--note not check content of SSTable
--insert but actually an update--upsert

insert into users (lastname, age) values ('Smith', 25);
– This insert is actually an update (of age in SSTable)

update users set city = 'SFO' where lastname = 'Smith';

Upsert

• Both update and insert are implemented as upsert
• Update if exists; otherwise, insert-similar to MongoDB
• Insert if not exists yet; otherwise, update

Delete
deletes a specific column– The entire row will be removed!

Secondary Index
• create index age_idx on users(age);
– drop index age_idx;
• select * from users where age = 25;
– This now works

Range or Inequality or non-key attribute query are not supported,No join, No foreign key

Compound key

• A primary key that contains multiple columns
• 1st column is the partition key
– Decides how rows are distributed among nodes
• Remaining are clustering columns(sort key in DynamoDB)
– Decides how rows with same partition key are stored
– Default: ascending

CREATE TABLE playlists ( id uuid,

song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order)
);

CREATE TABLE playlists ( id uuid,
song_order int,
song_id uuid,
title text,
album text,
artist text,
PRIMARY KEY (id, song_order)

Change default order

) WITH CLUSTERING ORDER BY (song_order DESC);

结构

columns are grouped into column families --table

each row belongs to a column family

rows are stored on disk in SSTable (sorted string table)

Sorted string table--SSTable

Rows are stored by row key

Each row starts with a row key, followed by a sorted list of columns by column name / timestamp

Each column contains: 1. column name, 2. column value, 3. timestamp

Immutable--once created，no overwrite & random write

2 ways to create SSTable:

1. flush in-memory data stored in Mem-table--(Minor compaction):

In-memory structure holding new data & updates
1 mem-table per column family
Minor compaction– Flushed to disk as a new SSTable (when size exceeds threshold), releases buffer pages & shrink memory usage

2. Major compaction: merge a set of SSTable for the same column family, which can be efficient since rows are sorted by key, then Old data are removed & disk space is reclaimed

Each SSTable has an index

– Efficient lookup of row content from row key
NOTE:
the index structure has 2 parts:
1. bloom filtering--no false negative, but has false positive
if bloom filter say no, it won't be wrong
2. B+ tree index

BigTable中的列族(Column Family)在Cassandra中被称作类似关系型数据库中的表(table)，而Cassandra/BigTable中, 1. row key和2. column key并称为主键(primary key)

Cassandra的row key决定该行数据存储在哪些节点，因此row key按哈希来存储，不能顺序扫描或读取，而一个row内的column key是顺序存储的，可以进行有序扫描或范围查找.(clustering columns like sort key in DynamoDB)

Write:insert/delete/update

A log entry is appended to a commit log file
Write data to memtable & acknowledge completion to client
When memtable is full, flush it as a new SSTable & purge corresponding entries from commit log (minor compaction)
Periodically, merge SSTables of the same column family (major compaction)

Read:
Content of row is distributed among Memtable & Multiple SSTables
=> Read is expensive than write & may require:
– disk access (to locate SSTables that contain fragments of row)
– merging (row content in mem-table & SSTables)

存储

与BigTable和其模仿者HBase不同，Cassandra数据并不存储在分布式文件系统如GFS或HDFS中，而是直接存于本地。

与BigTable一样，Cassandra也是日志型数据库，把新写入的数据存储在内存的Memtable中,通过磁盘的CommitLog做持久化，内存填满后将数据按key的顺序写进一个只读文件SSTable中，每次读取数据时，将所有SSTable和内存中的数据查找和合并。这种系统特点是写入比读取快，因为写入一条数据是顺序计入commit log中，不需随机读取磁盘及搜索。

系统架构

Cassandra系统架构与Dynamo类似，基于一致性哈希，每行数据通过哈希决定存在哪些节点。集群没有master的概念，所有节点都是同样角色，避免了系统的单点问题导致的不稳定性。

每个节点都把数据存在本地，都接受来自客户端的请求。

每次客户端随机选择集群中的一个节点来请求数据，对应接受请求的节点将对应的key在一致性哈希的环上定位是节点，将请求转发到对应的节点，并将对应若干节点的查询反馈返回。

在一致性，可用性，分区耐受能力（CAP）的问题，Cassandra和Dynamo一样灵活。
Cassandra的每个keyspace(database in RDBMS, contatiner for column family)可配置一行数据写入多少个节点(设个数为N)(replication strategy)

simple replication strategy
all replicas are in the same data center
1st replica on a node decided by consistent hashing
additional replica on next nodes clockwise in the ring
not rack-aware

与Hbase

HBase是Apache Hadoop的子项目，Google BigTable的克隆，与Cassandra一样，都使用BigTable的列族式的数据模型，但：
Cassandra只有一种节点，而HBase有多种不同角色，除处理读写请求的region server之外，架构在一套完整的HDFS分布式系统上，需ZooKeeper同步集群状态，部署上Cassandra更简单。
Cassandra的数据一致性策略可配置，选择强一致性or性能更高的最终一致性；HBase总是强一致性。

Cassandra通过一致性哈希决定一行数据存储在哪些节点，靠概率平均来实现负载均衡；
HBase每段数据(region)只有一个节点负责处理，由master动态分配一个region是否大到需要拆分成两个，同时将过热的节点的一些region动态的分配给负载较低的节点，因此实现动态的负载均衡。每个region同时只能有一个节点处理，一旦这个节点无响应，在系统将这个节点的所有region转移到其他节点之前这些数据便无法读写，加上master也只有一个节点，备用master的恢复也需要时间，因此HBase在一定程度上有单点问题；而Cassandra无单点问题。
Cassandra的读写性能优于HBase。

Find !

Stepping