Hadoop--MapReduce review
Hadoop: large-scale distributed batch processing infrastructure
GFS, Big table Mapreduce
Hadoop architecture
HDFS(high reliability)
MapReduce(parallel distributed computational simplified programming model)
HDFS
distribute data among multiple data node
replicated data (2/3)
request can go any replica
1. 1 NameNode, storing meta data:
hierarchy, an inode of directories + files
attributes of directories + files: permission + access time + modification time
mapping of files to block on data nodes
2. n datanodes storing contents/blocks of file:
E.g., file /user/aaron/foo consists of blocks 1, 2, and 4, Block 1 is stored on data nodes 1 and 3
3. 1 Secondary NameNode
for checkpoints of NameNode and recovery
ps: single-machine setup, all nodes - 1 machine
Block size: 64MB (disk block 4KB)
Reduce metadata
fast streaming read (sequentially laid out on disk)
good for largely sequential read
Reading:
1. client(c)-NameNode(NN)
2. NN return c: closest DataNode(DN)
3. c-DN directly
Writing:
blocks are written 1 each time- pipeline fashion through all DNs
GFS, Big table Mapreduce
单台服务器扩充到数千台服务器
Namenode负责管理维护HDFS目录系统,控制文件的读写操作
多个datanode负责存储数据,大型集群有成千上万个节点
Scalable
Hadoop分布式计算与村书,扩充容量或运算,只需要增加Datanode
Economical
一般等级的服务器
Flexical
关系数据库必须要有数据表结构schema
各个数据对象的集合,
Hadoop存储的数据是schema-less非结构化的,可以各种形式,不同数据源
reliable
分布式架构,有replica
高度容错能力,检测自动恢复
streaming data access
HDFS batch processing 不是real time interactive
HDFS(high reliability)
MapReduce(parallel distributed computational simplified programming model)
HDFS
distribute data among multiple data node
replicated data (2/3)
request can go any replica
1. 1 NameNode, storing meta data:
hierarchy, an inode of directories + files
attributes of directories + files: permission + access time + modification time
mapping of files to block on data nodes
2. n datanodes storing contents/blocks of file:
E.g., file /user/aaron/foo consists of blocks 1, 2, and 4, Block 1 is stored on data nodes 1 and 3
3. 1 Secondary NameNode
for checkpoints of NameNode and recovery
ps: single-machine setup, all nodes - 1 machine
Block size: 64MB (disk block 4KB)
Reduce metadata
fast streaming read (sequentially laid out on disk)
good for largely sequential read
Reading:
1. client(c)-NameNode(NN)
2. NN return c: closest DataNode(DN)
3. c-DN directly
Writing:
blocks are written 1 each time- pipeline fashion through all DNs
Data pipelining:
blocks --> packets 64KB--send over-->pipeline
blocks --> packets 64KB--send over-->pipeline
MapReduce:
Map(data transformation)– Reduce(combines results of map)–
(Internally) shuffle (sends output of mapper to right reducer)
= the only communication among map and reduce tasks
(Internally) shuffle (sends output of mapper to right reducer)
= the only communication among map and reduce tasks
MapReduce in Hadoop
1. A node may run multiple map/reduce tasks
2. one map task per input split (chunk of data) (split typical 64MB)
3. One reduce task per partition of map output – E.g., partition by key range or hashing
reduce function evoke by different key, map function by different k-v pairs
4. Shuffling (distribute intermediate k-v to right reducer)
Framework groups the output k-v pairs
from mapper by key
6. Merging
Execute:
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
bin/hadoop jar wc.jar WordCount input-hello output-hello
1. A node may run multiple map/reduce tasks
2. one map task per input split (chunk of data) (split typical 64MB)
3. One reduce task per partition of map output – E.g., partition by key range or hashing
reduce function evoke by different key, map function by different k-v pairs
4. Shuffling (distribute intermediate k-v to right reducer)
• Begins when a map task completed on 1 node
• All intermediate k-v pairs with same key send to the same reducer
• Partitioning method in Partitioner class (hash the key)
5.Sorting: k-v pairs from 1 mapper are sorted by key
• All intermediate k-v pairs with same key send to the same reducer
• Partitioning method in Partitioner class (hash the key)
5.Sorting: k-v pairs from 1 mapper are sorted by key
6. Merging
• Framework merges them https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Reducer.html
– Recall merge-sort & external sorting
– Differences? merge by key
7.combiner: local reducer- Commutative and associative:
Eg – Sum, Max, Min; Non-Eg – Count,Average, Median
– Recall merge-sort & external sorting
– Differences? merge by key
7.combiner: local reducer- Commutative and associative:
Eg – Sum, Max, Min; Non-Eg – Count,Average, Median
Execute:
bin/hadoop com.sun.tools.javac.Main WordCount.java
bin/hadoop jar wc.jar WordCount input-hello output-hello
InputFormat defines an instance of RR
OutputFormat defines a RecordWriter
Join
Dangling tuples: K-v from only 1 relation: (a, [('R', b)]) => left dangling; (a, [('S', c)]) => right
Each relation stored as a text file In different input directories (R and S)
-execute: bin/hadoop jar join.jar R S output
Dangling tuples: K-v from only 1 relation: (a, [('R', b)]) => left dangling; (a, [('S', c)]) => right
Each relation stored as a text file In different input directories (R and S)
-execute: bin/hadoop jar join.jar R S output
评论
发表评论