NoSql and AWS DynamoDB practices

no sql and dynamoDB
Relational database:
Mature and stable
Feature rich versatile query language
Acid 

No sql 
Flexible 
Scaled out easily
Data replicated over multi-servers
Weak consistency high availability

Scale up out
Scale up vertical,CPU
Scale out adding more computers

Types of no sql database
K-v tuple row stores
redis (k-v), dynamoDB(row)

Document store
Son xml mongoDB

Extensible record column stores
Bigtable cassandra Hbase

Graph store: spark graph library distributed computation of graph data 

Extensible record store: similar to relational database, with rows & columns, columns may be grouped into column family 
Different rows may have different columns
Wide column store

Dynamo db
Schema-less no predefined schema 
database contains a single table, which consists of a set of items. Each item contains a set of attributes

Items—rows in relational db
Different row s may have different set of attributes
Max size of item 400k
No concept of columns in dynamo db

Each item is uniquely identified by a primary key 
Primary key consists of 
– partition key
(optional) sort key 

Partition key 
– Partition (by hashing) the data across hosts for scalability & availability 
• Pick an attribute with wide range of values& evenly distributed patterns for partition key E.g., user ID
• Hash function may put "Rod Stewart" and "Maria Kelly" in the same partition e.g. artist name 

Sort key
• Allow searching within a partition  E.g., year
– So primary key = artist + year 
• This allows search of CDs by a specific artist and produced in certain years 

DynamoDB is not good for... 
• Ad-hoc query 
– Since it does have query language like SQL & does not support joins 
• OLAP
– Require joining of fact and dimension tables 
• BLOB (binary large objects) storage – E.g., images, videos Better for  S3 

Consistent hashing 
A hash function h(x) = y 
– x: a value of arbitrary size / length, e.g., a string of characters 
– y: a fixed-size / fixed-range value, e.g., 128 bits, [0, n-1], [0, 1] 
– h(s) = (sum of values of characters in string s) % 11 

Partitioning by hashing 
Items are stored in different servers based on hash values of their partition keys: h(k) 
Suppose n nodes in a cluster, h(k) is typically a very big number, e.g., 128 bits, Assign item with key k to node: h(k) % n
Problem in scaling out - # servers (n) grows 
• Key k is now assigned to h(k)%(n+1)– may be different from h(k) % n 
• Consequence: Almost all items (keys) need to be moved (assigned) to different servers 

Consistent hashing
• Hash key to a value in a fixed range, say [0, 1] 
– E.g., h'(k) = h(k) / max(h(k))
• Assign each server to a point in the same range 
– E.g., hashing machine serial to range [0, 1] 
• Assign each key to the first machine with a larger hash value
– If over range, find next one from beginning of range 

How much improvement? • m=#of keys n=#of servers 
• m/n = # of keys to be moved on average 
• Typically, m/n << m (large n) increase n => reduce movement 








评论

发表评论

此博客中的热门博文

8 Link Analysis

How to do addition for sparse vectors

Sorting Algorithms Summary