Hadoop--MapReduce review

四月 17, 2017

Hadoop: large-scale distributed batch processing infrastructure
GFS, Big table Mapreduce

单台服务器扩充到数千台服务器

Namenode负责管理维护HDFS目录系统，控制文件的读写操作

多个datanode负责存储数据，大型集群有成千上万个节点

Scalable

Hadoop分布式计算与村书，扩充容量或运算，只需要增加Datanode

Economical

一般等级的服务器

Flexical

关系数据库必须要有数据表结构schema

各个数据对象的集合，

Hadoop存储的数据是schema-less非结构化的，可以各种形式，不同数据源

reliable

分布式架构，有replica

高度容错能力，检测自动恢复

streaming data access

HDFS batch processing 不是real time interactive

Hadoop architecture
HDFS(high reliability)
MapReduce(parallel distributed computational simplified programming model)

HDFS
distribute data among multiple data node
replicated data (2/3)
request can go any replica

1. 1 NameNode, storing meta data:
hierarchy, an inode of directories + files
attributes of directories + files: permission + access time + modification time
mapping of files to block on data nodes

2. n datanodes storing contents/blocks of file:
E.g., file /user/aaron/foo consists of blocks 1, 2, and 4, Block 1 is stored on data nodes 1 and 3

3. 1 Secondary NameNode
for checkpoints of NameNode and recovery
ps: single-machine setup, all nodes - 1 machine

Block size: 64MB (disk block 4KB)

Reduce metadata
fast streaming read (sequentially laid out on disk)
good for largely sequential read

Reading:
1. client(c)-NameNode(NN)
2. NN return c: closest DataNode(DN)
3. c-DN directly

Writing:
blocks are written 1 each time- pipeline fashion through all DNs

Data pipelining:
blocks --> packets 64KB--send over-->pipeline

MapReduce:

Map(data transformation)– Reduce(combines results of map)–
(Internally) shuffle (sends output of mapper to right reducer)
= the only communication among map and reduce tasks

MapReduce in Hadoop
1. A node may run multiple map/reduce tasks
2. one map task per input split (chunk of data) (split typical 64MB)
3. One reduce task per partition of map output – E.g., partition by key range or hashing
reduce function evoke by different key, map function by different k-v pairs
4. Shuffling (distribute intermediate k-v to right reducer)

• Begins when a map task completed on 1 node
• All intermediate k-v pairs with same key send to the same reducer
• Partitioning method in Partitioner class (hash the key)
5.Sorting: k-v pairs from 1 mapper are sorted by key

Framework groups the output k-v pairs from mapper by key
6. Merging

• Framework merges them https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Reducer.html
– Recall merge-sort & external sorting
– Differences? merge by key

7.combiner: local reducer- Commutative and associative:
Eg – Sum, Max, Min; Non-Eg – Count,Average, Median

Execute:
bin/hadoop com.sun.tools.javac.Main WordCount.java

jar cf wc.jar WordCount*.class
bin/hadoop jar wc.jar WordCount input-hello output-hello

InputFormat defines an instance of RR

FileInputFormat
– Takes paths to files
– Read all files in the paths
– Divide each file into one or more InputSplits
• Subclasses:
– TextInputFormat
– KeyValueTextInputFormat
– SequenceFileInputFormat

OutputFormat defines a RecordWriter

Defined in the Java interface OutputFormat

All Reducers write to the same directory– Each writes a separate file, named part-r-nnnnn
– r: output from Reducers– nnnnn: partition id associated with reduce task

2 output files
• Part-r-00000 for partition 00000 + Part-r-00001 for partition 00001

Join
Dangling tuples: K-v from only 1 relation: (a, [('R', b)]) => left dangling; (a, [('S', c)]) => right

Each relation stored as a text file In different input directories (R and S)
-execute: bin/hadoop jar join.jar R S output

Find !

Stepping

Hadoop--MapReduce review

评论

发表评论

此博客中的热门博文

8 Link Analysis

How to do addition for sparse vectors

Sorting Algorithms Summary