Tuesday, March 14, 2017

Apache YARN一Knowing the Basics

Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive (Hive, Tez, Spark) and real-time processing (Storm). These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy.

In this article, we will use Apache Hadoop YARN provided by Hortonworks in the discussion.

Apache Hadoop YARN Platform
Figure 1 Apache Hadoop YARN (Hortonworks)

Apache Hadoop YARN


HDFS and YARN form the data management layer of Apache Hadoop. YARN is the architectural center of Hadoop,   As its architectural center, YARN enhances a Hadoop compute cluster in the following ways:[1]
  • Multi-tenancy
  • Cluster utilization
  • Scalability
  • Compatibility
For example, YARN takes into account all the available compute resources on each machine in the cluster. Based on the available resources, YARN will negotiate resource requests from applications (such as MapReduce or Spark) running in the cluster. YARN then provides processing capacity to each application by allocating Containers.

YARN runs processes on a cluster一two or more hosts connected by a high-speed local network一similarly to the way an operating system runs processes on a standalone computer. Here we will present YARN from two perspectives:
  • Resource management
  • Process management
Figure 2. Basic Entities in YARN

Resource Management


In a Hadoop cluster, it’s vital to balance the usage of RAM, CPU and disk so that processing is not constrained by any one of these cluster resources.  YARN is the foundation of the new generation of Hadoop.

In its design, YARN split up the two old responsibilities of the JobTracker and TaskTracker (see the diagram in [2]) into separate entities:
  • A global ResourceManager (instead of a cluster manager)
  • A per-application master processApplicationMaster (instead of a dedicated and short-lived JobTracker)
    • Started on a container by the ResourceManager's launcher
  • Per-node worker processNodeManager (instead of TaskTracker)
  • Per-application Containers
    • Controlled by the NodeManagers
    • Can run different kinds of tasks (also Application Masters)
    • Can be configured to have different sizes (e.g., RAM, CPU)

The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. The ApplicationMaster is a framework-specific entity that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks.

Containers are an important YARN concept. You can think of a container as a request to hold resources on the YARN cluster. Currently, a container hold request consists of vcore and memory, as shown in the Figure 3 (left).  Once a hold has been granted on a node, the NodeManager launches a process called a task which runs in a container . The right side of Figure 3 shows the task running as a process inside a container.

Figure 3.  Resource-to-Process Mapping from Container's Perspective

Process Management


Let us see how YARN manages its processes/tasks from two perspectives:
  • YARN scheduler
    • By default, there are 3 schedulers currently provided with YARN:
      • FIFO, Capacity and Fair 
  • YARN model of computation
    • Will use Apache Spark to Illustrate how a Spark job fits into the YARN model of computation
Figure 4.  Application Submission in YARN

YARN Scheduler[6]


The ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. The scheduler schedules based on the resource requirements of each application.

An application (e.g. Spark application) is a YARN client program which is made up of one or more tasks.  For each running application, a special piece of code called an ApplicationMaster helps coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the application starts.

Here are the sequence of events happening when an application starts (see Figure 4):
  1. The application starts and talks to the ResourceManager for the cluster
    • The ResourceManager makes a single container request on behalf of the application
  2. The ApplicationMaster starts running within that container
  3. The ApplicationMaster requests subsequent containers from the ResourceManager that are allocated to run tasks for the application. 
  4. The ApplicationMaster launches tasks in the container and coordinates the execution of all tasks within the application 
    • Those tasks do most of the status communication with the ApplicationMaster
  5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated from the cluster.
  6. The application client exits

Figure 5.  A Spark Job Consists of Multiple Tasks



YARN Model of Computation


In the Spark paradigm, an application consists of Map tasks, Reduce tasks, etc.[7] Spark tasks align very cleanly with YARN tasks.

At the process level, a Spark application is run as:
  • A single driver process 
    • Manages the job flow and schedules tasks and is available the entire time the application is running
    • Could be the same as the client process (i.e., YARN-client mode) or run in the cluster (i.e, YARN-cluster mode; see Figure 6)
  • A set of executor processes 
    • Scattered across nodes on the cluster (see Figure 5)
      • Are responsible for executing the work in the form of tasks, as well as for storing any data that the user chooses to cache
    • Multiple tasks can run within the same executor
Deploying these processes on the cluster is up to the cluster manager in use (YARN in this article), but the driver and executor themselves exist in every Spark application.

Recap


Tying two aspects (i.e., Resource and Process Management) together, we conclude that:
  • A global ResourceManager runs as a master daemon, usually on a dedicated machine, that arbitrates the available cluster resources among various competing applications. 
    • The ResourceManager tracks how many live nodes and resources are available on the cluster and coordinates what applications submitted by users should get these resources and when. 
    • The ResourceManager is the single process that has the above information so it can make its allocation (or rather, scheduling) decisions in a shared, secure, and multi-tenant manner (for instance, according to an application priority, a queue capacity, ACLs, data locality, etc.)
  • Each ApplicationMaster (i.e., a per-application master process) has responsibility for negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. 
  • The NodeManager is the per-node worker process, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.
Figure 6 presents the consolidated view of how a spark application is run in YARN-cluster mode:

Figure 6.  Spark in YARN-cluster mode[3]

1 comment:

Alfred Avina said...

The article is so appealing. You should read this article before choosing the Big data modernization solutions you want to learn.