Hadoop 生态系统

Resources

Apache Hadoop Ecosystem

这门课对于上过551 553的同学来说比较简单,作业也是551 553涵盖了的。
但是551 553更偏重程序执行,这里有一点点对知识框架的扩展。
underline 的部分是感兴趣却没有实践过的。

A  little extracts from above resources




Hadoop stack 
Move computation to data 
Schema reading style
Apache framework basic module

Reliability of applications:
Pig and hive allow us to expose high level interfaces 
Java c command line shell script 

Apache Sqoop 
Tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases 



HBASE 
  • Column-oriented database management system
  • Key-value store
  • Based on Google Big Table
  • Can hold extremely large data
  • Dynamic data model
  • Not a Relational DBMS
PIG 
High level programming on top of Hadoop MapReduc
The language: Pig Latin

Apache Hive 
Data analysis problems as data flows 
Data warehouse software facilitates querying and managing large datasets residing in distributed storage 
SQL-like language! 
Facilitates querying and managing large datasets in HDFS 
Mechanism to project structure onto this data and query the data using a SQL- like language called HiveQL 


Oozie 
Workflow scheduler system to manage Apache Hadoop jobs 
Oozie Coordinator jobs! 
Supports MapReduce, Pig, Apache Hive, and Sqoop, etc. 

Zookeeper 
Provides operational services for a Hadoop cluster group services 
Centralized service for:
1. maintaining configuration information naming services
2. providing distributed synchronization and providing group services 
3. maintaining configuration information 
4. maintaining configuration information naming service 
5. maintaining configuration information naming services
6. providing distributed synchronization and providing group services 

Flume 
Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data 

Impala Additional Cloudera Hadoop Components
Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop 


Spark Benefits 
Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications 
Allows user programs to load data into a cluster's memory and query it repeatedly 
Well-suited to machine learning!!! 




Apache framework basic module

  • Hadoop Common: The common utilities that support the other Hadoop modules. 
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Other Hadoop-related projects

  • Cassandra™: A scalable multi-master database with no single points of failure.
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
  • ZooKeeper™: A high-performance coordination service for distributed applications.
  • HBase™: A scalable, distributed database that supports structured data storage for large tables. 
  • Pig™: A high-level data-flow language and execution framework for parallel computation.
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.


Extensions

  • Hoc-queryhttp://www.learn.geekinterview.com/data-warehouse/dw-basics/what-is-an-ad-hoc-query.html 
  • An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools. This is in contrast to any query which is predefine and performed routinely.”
  • DynamoDB is not good for... 
• Ad-hoc query– Since it does have query language like SQL & does not support joins
• OLAP– Require joining of fact and dimension tables
• BLOB (binary large objects) storage – E.g., images, videos– Better suited for Amazon S3)
  • Dimension tables:https://en.wikipedia.org/wiki/Dimension_(data_warehouse)
  • EXTRACTS:
  • “A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.
    In a data warehouse, dimensions provide structured labeling information to otherwise unordered numeric measures. The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labelling.
    These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data.

    “In data warehousing, a dimension table is one of the set of companion tables to a fact table.
    1) The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.
    2) Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text). These attributes are designed to serve two critical purposes: query constraining and/or filtering, and query result set labeling.”















评论

发表评论

此博客中的热门博文

8 Link Analysis

1 Map reduce problems

NoSql and AWS DynamoDB practices