Hadoop 生态系统
Resources
Apache Hadoop Ecosystem
这门课对于上过551 553的同学来说比较简单,作业也是551 553涵盖了的。
但是551 553更偏重程序执行,这里有一点点对知识框架的扩展。
underline 的部分是感兴趣却没有实践过的。
Hadoop stack
Move computation to data
Schema reading style
Apache framework basic module
Reliability of applications:
Pig and hive allow us to expose high level interfaces
Java c command line shell script
Apache Sqoop
Tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases
HBASE
- Column-oriented database management system
- Key-value store
- Based on Google Big Table
- Can hold extremely large data
- Dynamic data model
- Not a Relational DBMS
PIG
High level programming on top of Hadoop MapReduce
The language: Pig Latin
Apache Hive
Data analysis problems as data flows
Data warehouse software facilitates querying and managing large datasets residing in distributed storage
SQL-like language!
Facilitates querying and managing large datasets in HDFS
Mechanism to project structure onto this data and query the data using a SQL- like language called HiveQL
Oozie
Workflow scheduler system to manage Apache Hadoop jobs
Oozie Coordinator jobs!
Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.
Zookeeper
Provides operational services for a Hadoop cluster group services
Centralized service for:
1. maintaining configuration information naming services
2. providing distributed synchronization and providing group services
1. maintaining configuration information naming services
2. providing distributed synchronization and providing group services
3. maintaining configuration information
4. maintaining configuration information naming service
5. maintaining configuration information naming services
6. providing distributed synchronization and providing group services
6. providing distributed synchronization and providing group services
Flume
Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Impala Additional Cloudera Hadoop Components
Cloudera's open source massively parallel processing (MPP) SQL query engine Apache Hadoop
Spark Benefits
Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
Allows user programs to load data into a cluster's memory and query it repeatedly
Well-suited to machine learning!!!
Apache framework basic module
- Hadoop Common: The common utilities that support the other Hadoop modules.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Other Hadoop-related projects
- Cassandra™: A scalable multi-master database with no single points of failure.
- Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Extensions
- Hoc-query:http://www.learn.geekinterview.com/data-warehouse/dw-basics/what-is-an-ad-hoc-query.html
- “An Ad-Hoc Query is a query that cannot be determined prior to the moment the query is issued. It is created in order to get information when need arises and it consists of dynamically constructed SQL which is usually constructed by desktop-resident query tools. This is in contrast to any query which is predefine and performed routinely.”
- DynamoDB is not good for...
• Ad-hoc query– Since it does have query language like SQL & does not
support joins
• OLAP– Require joining of fact and dimension tables
• BLOB (binary large objects) storage – E.g., images, videos– Better suited for Amazon S3)
• BLOB (binary large objects) storage – E.g., images, videos– Better suited for Amazon S3)
- Dimension tables:https://en.wikipedia.org/wiki/Dimension_(data_warehouse)
- EXTRACTS:
- “A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time.
In a data warehouse, dimensions provide structured labeling information to otherwise unordered numeric measures. The dimension is a data set composed of individual, non-overlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labelling.
These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data.”
“In data warehousing, a dimension table is one of the set of companion tables to a fact table.1) The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.2) Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text). These attributes are designed to serve two critical purposes: query constraining and/or filtering, and query result set labeling.”
可以搜搜SMACK
回复删除好的,谢谢建议。^ ^
删除Thank you for your guide to with upgrade information about Hadoop
回复删除Hadoop administration Online Course