Hadoop 生态系统

Resources

https://www.coursera.org/learn/hadoop/lecture/ZbVQf/hadoop-stack-basics

https://hadoop.apache.org/

Apache Hadoop Ecosystem

这门课对于上过551 553的同学来说比较简单，作业也是551 553涵盖了的。

但是551 553更偏重程序执行，这里有一点点对知识框架的扩展。

underline 的部分是感兴趣却没有实践过的。

A little extracts from above resources

Hadoop stack

Move computation to data

Schema reading style

Apache framework basic module

Reliability of applications:

Pig and hive allow us to expose high level interfaces

Java c command line shell script

Apache Sqoop

Tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases

HBASE

Column-oriented database management system
Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a Relational DBMS

PIG

High level programming on top of Hadoop MapReduce

The language: Pig Latin

Apache Hive

Data analysis problems as data flows

Data warehouse software facilitates querying and managing large datasets residing in distributed storage

SQL-like language!

Facilitates querying and managing large datasets in HDFS

Mechanism to project structure onto this data and query the data using a SQL- like language called HiveQL

Oozie

Workflow scheduler system to manage Apache Hadoop jobs

Oozie Coordinator jobs!

Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.

Zookeeper

Provides operational services for a Hadoop cluster group services

Centralized service for:
1. maintaining configuration information naming services
2. providing distributed synchronization and providing group services

3. maintaining configuration information

4. maintaining configuration information naming service

5. maintaining configuration information naming services
6. providing distributed synchronization and providing group services

Flume

Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data

Impala Additional Cloudera Hadoop Components

Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop

Spark Benefits

Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications

Allows user programs to load data into a cluster's memory and query it repeatedly

Well-suited to machine learning!!!

Apache framework basic module

Hadoop Common: The common utilities that support the other Hadoop modules.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects

Cassandra™: A scalable multi-master database with no single points of failure.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
ZooKeeper™: A high-performance coordination service for distributed applications.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Pig™: A high-level data-flow language and execution framework for parallel computation.

Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.

Extensions

Hoc-query：http://www.learn.geekinterview.com/data-warehouse/dw-basics/what-is-an-ad-hoc-query.html
“An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools. This is in contrast to any query which is predefine and performed routinely.”
DynamoDB is not good for...

• Ad-hoc query– Since it does have query language like SQL & does not support joins

• OLAP– Require joining of fact and dimension tables
• BLOB (binary large objects) storage – E.g., images, videos– Better suited for Amazon S3）

Dimension tables：https://en.wikipedia.org/wiki/Dimension_(data_warehouse)
EXTRACTS：
“A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.
In a data warehouse, dimensions provide structured labeling information to otherwise unordered numeric measures. The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labelling.
These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data.”

“In data warehousing, a dimension table is one of the set of companion tables to a fact table.

1） The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.

2） Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text). These attributes are designed to serve two critical purposes: query constraining and/or filtering, and query result set labeling.”

Find !

Stepping