Hadoop--MapReduce review

Hadoop: large-scale distributed batch processing infrastructure
GFS, Big table Mapreduce


单台服务器扩充到数千台服务器
Namenode负责管理维护HDFS目录系统,控制文件的读写操作
多个datanode负责存储数据,大型集群有成千上万个节点
Scalable
Hadoop分布式计算与村书,扩充容量或运算,只需要增加Datanode
Economical
一般等级的服务器
Flexical
关系数据库必须要有数据表结构schema
各个数据对象的集合,
Hadoop存储的数据是schema-less非结构化的,可以各种形式,不同数据源
reliable
分布式架构,有replica
高度容错能力,检测自动恢复
streaming data access

HDFS batch processing 不是real time interactive



Hadoop architecture
HDFS(high reliability)
MapReduce(parallel distributed computational simplified programming model)

HDFS
distribute data among multiple data node
replicated data (2/3)
request can go any replica



1. 1 NameNode, storing meta data:
hierarchy, an inode of directories + files
attributes of directories + files: permission + access time + modification time
mapping of files to block on data nodes

2. n datanodes storing contents/blocks of file:
E.g., file /user/aaron/foo consists of blocks 1, 2, and 4, Block 1 is stored on data nodes 1 and 3

3. 1 Secondary NameNode
for checkpoints of NameNode and recovery
ps: single-machine setup, all nodes - 1 machine


Block size: 64MB (disk block 4KB)

Reduce metadata
fast streaming read (sequentially laid out on disk)
good for largely sequential read

Reading:
1. client(c)-NameNode(NN)
2. NN return c: closest DataNode(DN)
3. c-DN directly

Writing:
blocks are written 1 each time- pipeline fashion through all DNs

Data pipelining:
blocks --> packets 64KB--send over-->pipeline



MapReduce:

Map(data transformation)Reduce(combines results of map)– 
(Internally) shuffle (sends output of mapper to right reducer)
the only communication among map and reduce tasks






MapReduce in Hadoop
1. A node may run multiple map/reduce tasks
2. one map task per input split (chunk of data) (split typical 64MB)
3. One reduce task per partition of map output E.g., partition by key range or hashing 
reduce function evoke by different key, map function by different k-v pairs
4. Shuffling (distribute intermediate k-v to right reducer)
Begins when a map task completed on 1 node
All intermediate k-v pairs with same key send to the same reducer
Partitioning method in Partitioner class (hash the key)
5.Sorting: k-v pairs from 1 mapper are sorted by key
Framework groups the output k-v pairs from mapper by ke
6. Merging
Framework merges them https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Reducer.html
Recall merge-sort & external sorting
Differences? merge by key

7.combiner: local reducer- Commutative and associative: 
Eg – Sum, Max, Min; Non-Eg – Count,Average, Median 




Execute: 
bin/hadoop com.sun.tools.javac.Main WordCount.java
jar cf wc.jar WordCount*.class
bin/hadoop jar wc.jar WordCount input-hello output-hello 



InputFormat defines an instance of RR 

FileInputFormat
Takes paths to files
Read all files in the paths
Divide each file into one or more InputSplits
Subclasses:
TextInputFormat
KeyValueTextInputFormat
SequenceFileInputFormat 

OutputFormat defines a RecordWriter 

Defined in the Java interface OutputFormat 
All Reducers write to the same directory– Each writes a separate file, named part-r-nnnnn
– r: output from Reducers– nnnnn: partition id associated with reduce task 
2 output files
• Part-r-00000 for partition 00000 + Part-r-00001 for partition 00001 





Join
Dangling tuples: K-v from only 1 relation: (a, [('R', b)]) => left dangling; (a, [('S', c)]) => right 


Each relation stored as a text file In different input directories (R and S)
-execute: bin/hadoop jar join.jar R S output 

评论

此博客中的热门博文

8 Link Analysis

1 Map reduce problems

NoSql and AWS DynamoDB practices