Insights on Oracle & Tech: Kafka: Knowing the Basics

Learning a new software/system, it's better to start with a high-level view of it.

In this article, we will introduce you the basics of Apache Kafka (written in Scala; does not use JMS). From here, you may continue to explore, say, how to configure Kafka components, how to monitor Kafka performance metrics, etc.

What is Kafka?

In the last few years, there has been significant growth in the adoption of Apache Kafka. Current users of Kafka include Uber, Twitter, Netflix, LinkedIn, Yahoo, Cisco, Goldman Sachs, etc.^[12]

Kafka is a message bus that achieves

a high level of parallelism

It also decouples between data producers and data consumers, which makes its architecture more flexible and adaptable to change.

Key Concepts

Kafka is a distributed messaging system providing fast, highly scalable and redundant messaging through a pub-sub model. It is organized around a few key terms:

Topics
Producers
Consumers
Messages
Brokers

Communication between all components of Kafka is done via a high performance simple binary API over TCP protocol.

Kafka

Topics

Kafka maintains feeds of messages in categories called topics

All Kafka messages are organized into topics.

Cluster

As a distributed system, Kafka runs in a cluster.
Each node in the cluster is called a Kafka broker.

Each broker holds a number of partitions and each of these partitions can be either a leader or a replica for a topic.
Brokers load balance by partition

Clients

Producers

Send (or push) messages to a specific topic

Consumers

Read (or pull) messages from a specific topic

Messages

Each specific message in a Kafka cluster can be uniquely identified by a tuple consisting of the message’s

Topic
Partition

Topics are broken up into ordered commit logs called partitions

Offset (within the partition )

Kafka offers 4 guarantees about data consistency and availability:^[3]

Messages sent to a topic partition will be appended to the commit log in the order they are sent,
A single consumer instance will see messages in the order they appear in the log,
A message is ‘committed’ when all in sync replicas have applied it to their log, and
Any committed message will not be lost, as long as at least one in sync replica is alive.

These guarantees hold as long as you are producing to one partition and consuming from one partition. All guarantees are off if you are reading from the same partition using two consumers or writing to the same partition using two producers. Finally, message ordering is preserved for each partition, but not the entire topic.

Zookeeper

Kafka servers require zookeeper. Brokers, producers, and consumers use zookeeper to manage and share state.

However, the way Zookeeper used in v 0.8 Kafka and v 0.9 Kafka differs.^[11] So, the first thing you need to know is which version of Kafka you are referring to. To find out the version of Kafka, do:

cd $KAFKA_HOME
find ./libs/ -name \*kafka_\* | head -1 | grep -o '\kafka[^\n]*'

For example, if the above command line prints:

kafka_2.10-0.9.0.2.4.2.0-258-javadoc.jar

It means that the following versions of products are installed:

Scala version
- 2.10
Kafka version

0.9.0.2.4.2.0-258

In summary, Kafka uses Zookeeper for the following:^[5]

Electing a controller

The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away.
Zookeeper is used to elect a controller, make sure there is only one and elect a new one if it crashes.

Cluster membership

Tells which brokers are alive and are still part of the cluster

Topic configuration

Tells which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, and what configuration overrides are set for each topic

(0.9.0)

Quotas - how much data is each client allowed to read and write
ACLs - who is allowed to read and write to which topic

(old high level consumer)

Tells which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
This functionality is going away