Articles

Apache Kafka for Beginners

Explore Apache Kafka in this comprehensive tutorial. Learn how to get started with Apache Kafka, its key features, its advantages, and its core components.

Introduction to Apache Kafka

Have you ever wondered how popular applications like Netflix or Uber process massive amounts of real-time data seamlessly? The answer is Apache Kafka, a robust distributed streaming platform.

Apache Kafka is a popular distributed event streaming platform designed to efficiently manage real-time data feeds. It was originally developed by LinkedIn in 2011 and later open-sourced. Apache Kafka has gained immense popularity due to its impressive features, including scalability, fault tolerance, and high performance.

In this tutorial, we’ll discuss how to get started with Apache Kafka, the advantages it offers, and its core components.

Related Course

Intro to the Command Line

Discover the power of this simple yet essential text-based tool and increase your productivity as a developer.Try it for free

Why use Kafka?

Apache Kafka offers several key advantages, including:

  • Real-Time Data Processing: Kafka allows for the processing of real-time data streams, enabling businesses to make decisions quickly.
  • Scalability: Kafka is highly scalable and can manage a large volume of data without impacting performance.
  • Fault Tolerance: Kafka is fault-tolerant, ensuring that data is not lost even in case of hardware failure.
  • High Throughput: Kafka can process a large amount of data with low latency, making it suitable for applications that require real-time processing.

Core components of Apache Kafka

Apache Kafka consists of several core components that work together to process and store data. They are the following:

  • Kafka Clusters: Kafka clusters are distributed systems that consist of multiple Kafka brokers working together to handle and process real-time data streams.
  • Brokers: Brokers are the core of the Kafka cluster. They receive messages from producers, store them in partitions, and deliver them to consumers.
  • Topics: Topics are the channels through which data is organized and categorized. They can be divided into multiple partitions for better scalability and performance.
  • Partitions: Partitions are the fundamental unit of data storage in Kafka. Topics that are divided into multiple partitions are distributed across the brokers in the cluster.
  • Producers: Producers help in publishing data to Kafka topics. They send messages to specific topics within the Kafka cluster.
  • Consumers: Consumers subscribe to topics and receive messages from them. They can process the received messages, store them, or perform other actions.
  • ZooKeeper: ZooKeeper acts as a distributed coordination service that manages the Kafka cluster. It stores metadata about the cluster, such as the list of brokers, topics, and partitions.
  • Offsets: Offsets are unique identifiers that represent the position of a message within a specific partition of a topic. They are crucial for tracking the progress of consumers within a topic.

In current versions of Apache Kafka (especially versions 2.8.0 and beyond), ZooKeeper is no longer a mandatory component for Kafka clusters, as Apache Kafka is moving towards a ZooKeeper-less architecture using KRaft. KRaft (Kafka Raft) is a consensus protocol that provides a more integrated and efficient way to manage metadata within the Kafka cluster itself.

Now that we are familiar with the components that Apache Kafka includes, let’s discuss how data flows across these components.

How data flows across Apache Kafka components

Apache Kafka utilizes a publish-subscribe messaging model for managing data streams, making it a popular choice for stream processing and event-driven architectures.

Here is a diagram that shows how data flows across Kafka components in this model:

Made on Eraser - A diagram that shows how data flows across Kafka components

In this model:

  • Producers first send messages to specific topics within a Kafka cluster.
  • These topics are then split into partitions and distributed across the brokers within the cluster.
  • Consumers, part of a consumer group, subscribe to these topics and read messages from their partitions delivered by the brokers.
  • Each consumer tracks its progress using an offset, ensuring that the messages are processed exactly once.
  • ZooKeeper (or KRaft in newer versions) coordinates this entire process, ensuring a smooth and reliable data flow operation in the Apache Kafka ecosystem.

Next, let’s learn how to get started with Apache Kafka.

How to get started with Apache Kafka

The process of setting up Apache Kafka is relatively straightforward.

Before you go through the process of setting up Apache Kafka, you need to install the following tools on your machine that are required for Apache Kafka to work correctly:

  • WSL (Only required on Windows as Linux/macOS has the terminal)
  • Java

To install WSL (Windows Subsystem for Linux), open the command prompt / PowerShell on your machine and run the command:

wsl --install

To install Java, first, go to the official download page for Java and download the installer for the latest version from there.

After the download is complete, open the installer and go through the on-screen prompts to install Java on your machine.

Once you are done installing the tools, follow these step-by-step instructions to set up Apache Kafka on your machine:

Download Kafka: Go to the official downloads page for Apache Kafka and download the latest version as an archive (a file used for storing multiple files/folders) from the Binary downloads section.

After downloading Kafka, go to the directory where the archive is downloaded. Then, extract the contents from the archive to a new folder and rename it to kafka.

Next, open the command prompt / PowerShell on your machine and run the following command to navigate to the newly created folder kafka:

cd kafka

Start ZooKeeper: Execute the following command to start ZooKeeper on your machine, which comes bundled with Apache Kafka:

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

On Linux/macOS, run this command in the terminal instead:

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Apache Kafka: Start Apache Kafka on your machine by running this command in a new command prompt / PowerShell window:

.\bin\windows\kafka-server-start.bat .\config\server.properties

On Linux/macOS, the command would be:

bin/kafka-server-start.sh config/server.properties

Create a Topic: Execute the following command in a new command prompt / PowerShell instance to create a topic named MyTopic:

.\bin\windows\kafka-topics.bat --create --topic MyTopic --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-topics.sh --create --topic MyTopic --bootstrap-server localhost:9092

In this command:

  • --bootstrap-server: Specifies the Kafka broker (server) to connect to (localhost:9092). Apache Kafka requires a broker to manage the topic and perform operations like publishing or consuming messages.
  • localhost:9092: Refers to the address of the Kafka broker.
  • localhost: Indicates that the Kafka broker is running on the same machine where the command is executed.
  • 9092: The default port that Apache Kafka uses to listen to incoming connections. If it is running on a different machine or port, this value would need to be updated accordingly.

Start Kafka Producer: Start Kafka producer by running another command in a new command prompt / PowerShell instance:

.\bin\windows\kafka-console-producer.bat --topic MyTopic --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-console-producer.sh --topic MyTopic --bootstrap-server localhost:9092

Start Kafka Consumer: Execute another command in a new command prompt / PowerShell instance to start Kafka consumer:

.\bin\windows\kafka-console-consumer.bat --topic MyTopic --from-beginning --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-console-consumer.sh --topic MyTopic --from-beginning --bootstrap-server localhost:9092

Testing: Enter a message in the Kafka producer and press Enter:

This is a message.

After hitting Enter, the Kafka consumer reads the message and outputs the same:

This is a message.

And there it is - your first Kafka topic is now created!

In the next section, we will discuss some common Apache Kafka setup errors.

Performing basic operations in Apache Kafka

In this section, you will learn how to perform basic operations in Apache Kafka, such as listing topics, creating partitions, and deleting topics.

Firstly, to list topics in the Kafka server, run this command in the command prompt / PowerShell on Windows:

.\bin\windows\kafka-topics.bat --list --zookeeper localhost:2181

On Linux/macOS, the command would be:

bin/kafka-topics.sh --list --zookeeper localhost:2181

Here is the output:

MyTopic

Next, to create partitions within a topic (MyTopic in this case), run this command in the command prompt / PowerShell:

.\bin\windows\kafka-topics.bat --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic

On Linux/macOS, the command would be:

bin/kafka-topics.sh --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic

This command creates 3 partitions within MyTopic.

Finally, to delete a topic (MyTopic in this case) from the Kafka server, run this command in the command prompt / PowerShell:

.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --delete --topic MyTopic

On Linux/macOS, the command would be:

bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic MyTopic

Now, if you list the topics in the Kafka server again, it will not output anything because you have deleted the only topic (MyTopic) from the server.

Applications of Apache Kafka

Here are a few real-world applications of Apache Kafka:

  • Data Ingestion: One of the most common use cases of Apache Kafka is data ingestion. Companies use Kafka to collect and aggregate large volumes of data from various sources in real-time.
  • Real-Time Analytics: Kafka can be used to stream data for real-time analytics. By processing and analyzing data, companies can make quicker and more informed decisions.
  • Log Aggregation: Kafka is often used for log aggregation, allowing companies to centralize their log data from multiple sources and easily analyze and monitor system logs.
  • Metrics Monitoring: Kafka is also used for monitoring metrics in real-time. Companies can stream metrics data to Kafka and use it to monitor the health and performance of their systems.

Conclusion

In this tutorial, we learned about Apache Kafka and its features, its advantages, its core components, and the step-by-step process of getting started with it.

Apache Kafka is a powerful platform that is used by organizations worldwide for real-time data processing and analytics. Kafka can also be potentially used in cloud environments, such as with Confluent Cloud or Kubernetes for performing tasks like data ingestion, data replication, data integration, and more. By mastering Apache Kafka, you can take your data processing abilities to the next level.

If you want to learn more about data analysis in Python, check out the Analyze Data with Python course on Codecademy.

Author

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team