Tuesday, June 21, 2016

Kafka: Knowing the Basics

Learning a new software/system, it's better to start with a high-level view of it.

In this article, we will introduce you the basics of Apache Kafka (written in Scala; does not use JMS). From here, you may continue to explore, say, how to configure Kafka components, how to monitor Kafka performance metrics, etc.



What is Kafka?


In the last few years, there has been significant growth in the adoption of Apache Kafka. Current users of Kafka include Uber, Twitter, Netflix, LinkedIn, Yahoo, Cisco, Goldman Sachs, etc.[12]

Kafka is a message bus that achieves
  • a high level of parallelism
It also decouples between data producers and data consumers, which makes its architecture more flexible and adaptable to change.

Key Concepts


Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model. It is organized around a few key terms:
  • Topics
  • Producers
  • Consumers
  • Messages
  • Brokers
Communication between all components of Kafka is done via a high performance simple binary API over TCP protocol.
  • Kafka
    • Topics
        • Kafka maintains feeds of messages in categories called topics
          • All Kafka messages are organized into topics.
      • Cluster
        • As a distributed system, Kafka runs in a cluster.
        • Each node in the cluster is called a Kafka broker.
          • Each broker holds a number of partitions and each of these partitions can be either a leader or a replica for a topic.
          • Brokers load balance by partition
    • Clients
      • Producers
        • Send (or push) messages to a specific topic
      • Consumers
        • Read (or pull) messages from a specific topic

    Messages


    Each specific message in a Kafka cluster can be uniquely identified by a tuple consisting of the message’s
    • Topic
    • Partition
      • Topics are broken up into ordered commit logs called partitions
    • Offset (within the partition )
    Kafka offers 4 guarantees about data consistency and availability:[3]
    1. Messages sent to a topic partition will be appended to the commit log in the order they are sent,
    2. A single consumer instance will see messages in the order they appear in the log,
    3. A message is ‘committed’ when all in sync replicas have applied it to their log, and
    4. Any committed message will not be lost, as long as at least one in sync replica is alive.
    These guarantees hold as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers. Finally, message ordering is preserved for each partition, but not the entire topic.

    Zookeeper


    Kafka servers require zookeeper. Brokers, producers, and consumers use zookeeper to manage and share state.

    However, the way Zookeeper used in v 0.8 Kafka and v 0.9 Kafka differs.[11] So, the first thing you need to know is which version of Kafka you are referring to. To find out the version of Kafka, do:
    • cd $KAFKA_HOME
    • find ./libs/ -name \*kafka_\* | head -1 | grep -o '\kafka[^\n]*'
    For example, if the above command line prints:
    • kafka_2.10-0.9.0.2.4.2.0-258-javadoc.jar

    It means that the following versions of products are installed:
    • Scala version
      • 2.10
    • Kafka version
      • 0.9.0.2.4.2.0-258

    In summary, Kafka uses Zookeeper for the following:[5]
    1. Electing a controller
      • The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
      • When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away.
      • Zookeeper is used to elect a controller, make sure there is only one and elect a new one if it crashes.
    2. Cluster membership
      • Tells which brokers are alive and are still part of the cluster
    3. Topic configuration
      • Tells which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, and what configuration overrides are set for each topic
    4. (0.9.0)
      • Quotas - how much data is each client allowed to read and write
      • ACLs - who is allowed to read and write to which topic
    5. (old high level consumer)
      • Tells which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
      • This functionality is going away

    Diagram Credit

    • www.michael-noll.com

    References

    1. Apache Kafka
    2. Configuration of Kafka
    3. Kafka in a Nutshell
    4. Why do Kafka consumers connect to zookeeper, and producers get metadata from brokers
    5. What is the actual role of ZooKeeper in Kafka?
    6. How to choose the number of topics/partitions in a Kafka cluster?
    7. Message Hub Kafka Java API
    8. Introduction to Apache Kafka
    9. Kafka Controller Redesign
    10. Log4j Appender 
    11. Apache Kafka 0.8 Basic Training (Michael G. Noll, Verisign)
        • ZooKeeper
          • v0.8: used by brokers and consumers , but not by producers
            • v0.9: used by brokers only
                • Consumers will use speicial topics instead of ZooKeeper
                  • Will substitally reduce the load on ZooKeeper for large deployments
              • G1 Tuning (JDK 7u51 or later; slide 66)
                • java -Xms4g -Xmx4g -XX:PermSize=48m -XX:MaxPermSize=48m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35
                  • Note that PermGen has been removed in JDK 8
            1. The value of Apache Kafka in Big Data ecosystem
            2. Kafka Security Specific Features (06/03/2014)
            3. Kafka FAQ
            4. Kafka Operations
            5. Kafka System Tools
            6. Kafka Replication Tools
            7. Monitoring Kafka performance metrics
            8. Collecting Kafka performance metrics
            9. Spark Streaming + Kafka Integration Guide (Spark 1.6.1)

            Sunday, June 5, 2016

            How to Enable Software Collections (SCL) on RHEL

            Package is the basic unit used for the distribution, management and update of software customized for a Linux system.  Collections of packages stored in different software repositories (e.g., YUM Repositories) can be accessed locally or over a network connection.  Metadata are combined with packages themselves to determine (and resolve, if possible) dependencies among the packages.

            In this article, we will try to learn the following entities/concepts in more details:
            • RPM 
              • May refers to the .rpm file format, files in this format, software packaged in such files, and the package manager itself
              • RPM's dependency processing is based on knowing what capabilities are provided by a package and what capabilities a package requires.
              • It does automatically determine what shared libraries a package requires
            • YUM 
              • One of the several front-ends to RPM, which ease the process of obtaining and installing RPMs from repositories and help in resolving their dependencies.
              • YUM Repository
                • YUM repository metatadata is structured as a series of XML files to organize packages and their associated package groups for installation.
                • Need to setup Apache, nginx, or some other web server and point it at the base directory of the repository to make it available
            • Red Hat Software Collections
              • Is a prescribed set of content intended for use in Red Hat Enterprise Linux production environments

            RPM


            The RPM Package Manager (RPM) is a package management system that
            • Facilitates the distribution, management and update of software customized for a Linux system
              • Works with standard package management tools (e.g., Yum or PackageKit) to install, reinstall, remove, upgrade and verify RPM packages
            • Provides metadata to describe packages, installation instructions, and so on
              • Each RPM package includes metadata that describes the package's components, version, release, size, project URL, installation instructions, and so on
            • Separates source and binary packages
              • In source packages, you have the pristine sources along with any patches that were used, plus complete build instructions.
            • Allows you to use the database of installed packages to query and verify packages
            • Allows you to add your package to a Yum repository
            • Allows you to Digitally sign your packages (e.g., using a GPG signing key)

            Red Hat Software Collections


            Red Hat Software Collections is a prescribed set of content intended for use in Red Hat Enterprise Linux production environments. Through Red Hat Software Collections, you can choose the runtime versions best suited for your projects, preserve application stability, and deploy your applications with confidence.

            Software Collections Functionality


            Software collections functionality (from hereafter, we will refer it simply as "Software Collections")—not to be confused with Red Hat Software Collections—has been available in earlier Red Hat Enterprise Linux distributions. Software collections provides a structural definition, independent of the operating system, for applications or tools.

            With Software Collections, you can build and concurrently install multiple versions of the same software components on your system. Software Collections have no impact on the system versions of the packages installed by any of the conventional RPM package management utilities.

            To summarize, Software Collections have the following characteristics:
            • Do not overwrite system files
            • Are designed to avoid conflicts with system files
              • Software Collections make use of a special file system hierarchy to avoid possible conflicts between a single Software Collection and the base system installation.
            • Require no changes to the RPM package manager
            • Need only minor changes to the spec file
              • To convert a conventional package to a single Software Collection, you only need to make minor changes to the package spec file.
            • Allow you to build a conventional package and a Software Collection package with a single spec file
            • Uniquely name all included packages
            • Do not conflict with updated packages
            • Can depend on other Software Collections 
              • Because one Software Collection can depend on another, you can define multiple levels of dependencies.


            Enabling and Building Software Collections


            To enable support for Software Collections on your system so that you can enable and build Software Collections, you need to have installed the following packages:

            References

            1. Software Collections Guide (Redhat)
            2. 20 Linux YUM (Yellowdog Updater, Modified) Commands for Package Management
            3. PackageKit
            4. Red Hat Enterprise Linux 6 Deployment Guide 
            5. Red Hat Enterprise Linux 5 Deployment Guide.
            6. Learn Linux, 101: RPM and YUM package management
              • Yum allows automatic updates, package and dependency management, on RPM-based distributions.
            7. Creating a Local Yum Repository Using an ISO Image
              • Yum works with software repositories (collections of packages), which can be accessed locally or over a network connection.
            8. Configuring Yum and Yum Repositories (important)
            9. Converting a Conventional Spec File
            10. HOWTO: GPG sign and verify RPM packages and yum repositories
            11. The Software Collections (SCL) Repository (CentOS)
            12. Red Hat will provide PHP 5.4 for RHEL-6