NoSql and AWS DynamoDB practices
no sql and dynamoDB
Relational database:
Mature and stable
Feature rich versatile query language
Acid
No sql
Flexible
Scaled out easily
Data replicated over multi-servers
Weak consistency high availability
Scale up out
Scale up vertical,CPU
Scale out adding more computers
Types of no sql database
K-v tuple row stores
redis (k-v), dynamoDB(row)
Document store
Son xml mongoDB
Extensible record column stores
Bigtable cassandra Hbase
Graph store: spark graph library distributed computation of graph data
Extensible record store: similar to relational database, with rows & columns, columns may be grouped into column family
Different rows may have different columns
Wide column store
Dynamo db
Schema-less no predefined schema
database contains a single table, which consists of a set of items. Each item contains a set of attributes
Items—rows in relational db
Different row s may have different set of attributes
Max size of item 400k
No concept of columns in dynamo db
Each item is uniquely identified by a primary key
Primary key consists of
– partition key
– (optional) sort key
– (optional) sort key
Partition key
– Partition (by hashing) the data across hosts for scalability & availability
• Pick an attribute with wide range of values& evenly distributed patterns for partition key E.g., user ID
• Hash function may put "Rod Stewart" and "Maria Kelly" in the same partition e.g. artist name
• Hash function may put "Rod Stewart" and "Maria Kelly" in the same partition e.g. artist name
Sort key
• Allow searching within a partition E.g., year
– So primary key = artist + year
• Allow searching within a partition E.g., year
– So primary key = artist + year
• This allows search of CDs by a specific artist and produced in certain years
DynamoDB is not good for...
• Ad-hoc query
– Since it does have query language like SQL & does not support joins
• OLAP
– Require joining of fact and dimension tables
– Require joining of fact and dimension tables
• BLOB (binary large objects) storage – E.g., images, videos Better for S3
Consistent hashing
A hash function h(x) = y
– x: a value of arbitrary size / length, e.g., a string of characters
– y: a fixed-size / fixed-range value, e.g., 128 bits, [0, n-1], [0, 1]
– h(s) = (sum of values of characters in string s) % 11
Partitioning by hashing
Items are stored in different servers based on hash values of their partition keys: h(k)
Suppose n nodes in a cluster, h(k) is typically a very big number, e.g., 128 bits, Assign item with key k to node: h(k) % n
Problem in scaling out - # servers (n) grows
• Key k is now assigned to h(k)%(n+1)– may be different from h(k) % n
• Consequence: Almost all items (keys) need to be moved (assigned) to different servers
Consistent hashing
• Hash key to a value in a fixed range, say [0, 1]
• Hash key to a value in a fixed range, say [0, 1]
– E.g., h'(k) = h(k) / max(h(k))
• Assign each server to a point in the same range
• Assign each server to a point in the same range
– E.g., hashing machine serial to range [0, 1]
• Assign each key to the first machine with a larger hash value
– If over range, find next one from beginning of range
– If over range, find next one from beginning of range
How much improvement? • m=#of keys n=#of servers
• m/n = # of keys to be moved on average
• Typically, m/n << m (large n) increase n => reduce movement
I really enjoy your blog it's a nice post AWS Online Training Hyderabad
回复删除