Hadoop Ecosystem
von yunfei yangzhao
1. Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs
2. Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait
3. Hive
3.1. SQL-like querying
3.2. Combiner can be used to optimize reducer performance
3.3. Structured data warehousing
3.4. Partition columns instead of indexes
4. Pig
4.1. Scripting for Hadoop
5. HBase
5.1. Non-relational
5.2. Column store
5.3. Transactional lookups
6. Flume
6.1. Log collector
6.2. Integrates into Hadoop
7. Oozie
7.1. Workflow processing
7.2. Links jobs
7.3. Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.
8. Avro
8.1. Data parsing
8.2. Binary data serialization
8.3. RPC
8.4. language-neutral
8.5. optional codegen
8.6. schema evolution
8.7. untagged data
8.8. dynamic typing
9. Mahout
9.1. Machine learning
9.2. Applied to MR
10. Sqoop
10.1. Connects non-Hadoop stores (RDBMS)
10.2. Moves data to & from RDBMS to Hadoop
10.3. Autogens Java InputFormat code for data access
11. MapReduce
11.1. Distributed compute
11.2. Maps query onto nodes
11.3. Reduces aggregated results into answers
12. Ambari
12.1. Cluster deployment and admin
12.2. Driven by Hortonworks
13. ZooKeeper
13.1. Coordinator of shared state between apps
13.2. Naming, configuration, and synchronization services
14. YARN
14.1. cluster management
14.2. Hadoop 2
14.3. resource manager
14.4. job scheduler
15. BigTop
15.1. Package Hadoop ecosys
15.2. Test Hadoop ecosys package
16. Related Apache Ecosystems
17. HDFS
17.1. Distributed storage
17.2. Java-based filesystem
18. Spark
19. Impala
19.1. SQL query egnine
19.2. Query data stored in HDFS and HBase
19.3. Real time
20. Cascading
20.1. Higher abstraction from MR
20.2. Creates Flow that assembles Map/Reduce jobs