Articles

Apache Kafka Tutorial: A Complete Guide for Beginners

  • Discover the power of this simple yet essential text-based tool and increase your productivity as a developer.
    • Beginner Friendly.
      1 hour
  • Learn about the command line, starting with navigating and manipulating the file system, and ending with redirection and configuring the environment.
    • With Certificate
    • Beginner Friendly.
      4 hours

What is Apache Kafka?

Apache Kafka is a well-known distributed event streaming platform designed to efficiently manage real-time data feeds. It was originally developed by LinkedIn in 2011 and later open-sourced. Apache Kafka has gained immense popularity due to its impressive features, including scalability, fault tolerance, and high performance.

At its core, Kafka provides three essential capabilities:

  • Allows applications to publish and subscribe to event streams
  • Stores those streams durably for as long as needed
  • Processes them in real time or in batch

This combination makes it much more than just a messaging system—it’s a foundational technology for building real-time data pipelines and event-driven applications.

Now that we know what Apache Kafka is and what it offers, let’s check out the core components that Kafka is built of.

Core components of Apache Kafka

Apache Kafka consists of several core components that work together to process and store data:

  • Kafka clusters: Kafka clusters are distributed systems that consist of multiple Kafka brokers working together to handle and process real-time data streams.
  • Brokers: Brokers are the core of the Kafka cluster. They receive messages from producers, store them in partitions, and deliver them to consumers.
  • Topics: Topics are the channels through which data is organized and categorized. They can be divided into multiple partitions for better scalability and performance.
  • Partitions: Partitions are the fundamental unit of data storage in Kafka. Topics that are divided into multiple partitions are distributed across the brokers in the cluster.
  • Producers: Producers help in publishing data to Kafka topics. They send messages to specific topics within the Kafka cluster.
  • Consumers: Consumers subscribe to topics and receive messages from them. They can process the received messages, store them, or perform other actions.
  • ZooKeeper: ZooKeeper acts as a distributed coordination service that manages the Kafka cluster. It stores metadata about the cluster, such as the list of brokers, topics, and partitions.
  • Offsets: Offsets are unique identifiers that represent the position of a message within a specific partition of a topic. They are crucial for tracking the progress of consumers within a topic.

In current versions of Apache Kafka (especially versions 2.8.0 and beyond), ZooKeeper is no longer a mandatory component for Kafka clusters, as Apache Kafka is moving towards a ZooKeeper-less architecture using KRaft. KRaft (Kafka Raft) is a consensus protocol that provides a more integrated and efficient way to manage metadata within the Kafka cluster itself.

With core components covered, let’s discuss how data flows across these Kafka components.

How data flows across Apache Kafka components

Apache Kafka utilizes a publish-subscribe messaging model for managing data streams, making it a popular choice for stream processing and event-driven architectures.

Here is a diagram that shows how data flows across Kafka components in this model:

A diagram that shows how data flows across Kafka components

In this model:

  • Producers first send messages to specific topics within a Kafka cluster.
  • These topics are then split into partitions and distributed across the brokers within the cluster.
  • Consumers, part of a consumer group, subscribe to these topics and read messages from their partitions delivered by the brokers.
  • Each consumer tracks its progress using an offset, ensuring that the messages are processed exactly once.
  • ZooKeeper (or KRaft in newer versions) coordinates this entire process, ensuring a smooth and reliable data flow operation in the Apache Kafka ecosystem.

Next, let’s learn how to get started with Apache Kafka.

How to get started with Apache Kafka

The process of setting up Apache Kafka is relatively straightforward.

Before you go through the process of setting up Apache Kafka, you need to install these tools on your machine that are required for Apache Kafka to work correctly:

  • WSL (Only required on Windows as Linux/macOS has the terminal)
  • Java

To install WSL (Windows Subsystem for Linux), open the command prompt / PowerShell on your machine and run this command:

wsl --install

To install Java, first, go to the official download page for Java and download the installer for the latest version from there.

After the download is complete, open the installer and go through the on-screen prompts to install Java on your machine.

Once you are done installing the tools, follow these step-by-step instructions to set up Apache Kafka on your machine:

1. Download Kafka: Go to the official downloads page for Apache Kafka and download the latest version as an archive (a file used for storing multiple files/folders) from the Binary downloads section.

2. After downloading Kafka, go to the directory where the archive is downloaded. Then, extract the contents from the archive to a new folder and rename it to kafka.

3. Next, open the command prompt / PowerShell on your machine and run this command to navigate to the newly created folder kafka:

cd kafka

4. Start ZooKeeper: Execute this command to start ZooKeeper on your machine, which comes bundled with Apache Kafka:

.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties

On Linux/macOS, run this command in the terminal instead:

bin/zookeeper-server-start.sh config/zookeeper.properties

5. Start Apache Kafka: Start Apache Kafka on your machine by running this command in a new command prompt / PowerShell window:

.\bin\windows\kafka-server-start.bat .\config\server.properties

On Linux/macOS, the command would be:

bin/kafka-server-start.sh config/server.properties

6. Create a topic: Execute this command in a new command prompt / PowerShell instance to create a topic named MyTopic:

.\bin\windows\kafka-topics.bat --create --topic MyTopic --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-topics.sh --create --topic MyTopic --bootstrap-server localhost:9092

In this command:

  • --bootstrap-server: Specifies the Kafka broker (server) to connect to (localhost:9092). Apache Kafka requires a broker to manage the topic and perform operations like publishing or consuming messages.
  • localhost:9092: Refers to the address of the Kafka broker.
  • localhost: Indicates that the Kafka broker is running on the same machine where the command is executed.
  • 9092: The default port that Apache Kafka uses to listen to incoming connections. If it is running on a different machine or port, this value would need to be updated accordingly.

7. Start Kafka producer: Start Kafka producer by running another command in a new command prompt / PowerShell instance:

.\bin\windows\kafka-console-producer.bat --topic MyTopic --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-console-producer.sh --topic MyTopic --bootstrap-server localhost:9092

8. Start Kafka consumer: Execute another command in a new command prompt / PowerShell instance to start Kafka consumer:

.\bin\windows\kafka-console-consumer.bat --topic MyTopic --from-beginning --bootstrap-server localhost:9092

On Linux/macOS, the command would be:

bin/kafka-console-consumer.sh --topic MyTopic --from-beginning --bootstrap-server localhost:9092

9. Testing: Enter a message in the Kafka producer and press Enter:

This is a message.

After hitting Enter, the Kafka consumer reads the message and outputs the same:

This is a message.

And there it is - your first Kafka topic is now created.

Now, it’s time to discuss some basic operations that we can perform in Apache Kafka.

Performing basic operations in Apache Kafka

In this section, you will learn how to perform basic operations in Apache Kafka, such as listing topics, creating partitions, and deleting topics.

Firstly, to list topics in the Kafka server, run this command in the command prompt / PowerShell on Windows:

.\bin\windows\kafka-topics.bat --list --zookeeper localhost:2181

On Linux/macOS, the command would be:

bin/kafka-topics.sh --list --zookeeper localhost:2181

Here is the output:

MyTopic

Next, to create partitions within a topic (MyTopic in this case), run this command in the command prompt / PowerShell:

.\bin\windows\kafka-topics.bat --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic

On Linux/macOS, the command would be:

bin/kafka-topics.sh --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic

This command creates three partitions within MyTopic.

Finally, to delete a topic (MyTopic in this case) from the Kafka server, run this command in the command prompt / PowerShell:

.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --delete --topic MyTopic

On Linux/macOS, the command would be:

bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic MyTopic

Now, if you list the topics in the Kafka server again, it will not output anything because you have deleted the only topic (MyTopic) from the server.

Since we’re done learning the basic Apache Kafka operations, let’s discuss why we should use Kafka in our workflow.

Why use Kafka?

There are several reasons for using Kafka:

  • Real-time data processing: Kafka allows for the processing of real-time data streams, enabling businesses to make decisions quickly.
  • Scalability: Kafka is highly scalable and can manage a large volume of data without impacting performance.
  • Fault tolerance: Kafka is fault-tolerant, ensuring that data is not lost even in case of hardware failure.
  • High throughput: Kafka can process a large amount of data with low latency, making it suitable for applications that require real-time processing.

Finally, let’s explore the real-world applications of Apache Kafka.

Real-world applications of Apache Kafka

Here are some real-world applications of Apache Kafka:

  • Data ingestion: One of the most common use cases of Apache Kafka is data ingestion. Companies use Kafka to collect and aggregate large volumes of data from various sources in real-time.
  • Real-time analytics: Kafka is used to stream data for real-time analytics. By processing and analyzing data, companies can make quicker and more informed decisions.
  • Log aggregation: Kafka is often used for log aggregation, allowing companies to centralize their log data from multiple sources and easily analyze and monitor system logs.
  • Metrics monitoring: Kafka is also used for monitoring metrics in real-time. Companies can stream metrics data to Kafka and use it to monitor the health and performance of their systems.

Conclusion

In this Kafka tutorial, we had a detailed discussion on Apache Kafka, covering what it is, its core components, and how to get started with it. We also explored basic Apache Kafka operations, the reasons for using it, and some real-world applications that prove its popularity in the world of big data.

Apache Kafka is a powerful platform used by organizations worldwide for real-time data processing and analytics. Kafka can also be potentially used in cloud environments, such as with Confluent Cloud or Kubernetes, for performing tasks like data ingestion, data replication, data integration, and more. By mastering Apache Kafka, you can take your data processing abilities to the next level.

If you want to learn more about data analysis in Python, check out the Analyze Data with Python course on Codecademy.

Frequently asked questions

1. What is Apache Kafka used for?

Apache Kafka is used for creating real-time data pipelines and streaming applications. It helps organizations capture, store, and process event streams such as logs, user activity, transactions, and sensor data.

2. Is Apache Kafka tough to learn?

Apache Kafka has a learning curve, especially for beginners unfamiliar with distributed systems and event streaming concepts. However, once you understand its core components—topics, producers, consumers, and brokers—it becomes much easier.

3. Is Apache Kafka a tool or library?

Apache Kafka is a distributed platform (or system), not just a tool or library. It provides messaging, storage, and stream processing capabilities, making it more comprehensive than a simple message queue or library.

4. Does Apache Kafka require coding?

Using Apache Kafka at a basic level (like creating topics, producing, and consuming messages) doesn’t require heavy coding—you can use command-line tools. However, for integrating Kafka into applications, developing stream processing logic, or building event-driven systems, you will need to write code in languages like Java, Python, or Scala.

5. Does Apache Kafka use a queue?

Apache Kafka is often compared to message queues, but it works differently. Instead of a traditional queue where each message is consumed by one receiver, Kafka uses a publish-subscribe model with topics and partitions, allowing multiple consumers to read the same data independently.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Discover the power of this simple yet essential text-based tool and increase your productivity as a developer.
    • Beginner Friendly.
      1 hour
  • Learn about the command line, starting with navigating and manipulating the file system, and ending with redirection and configuring the environment.
    • With Certificate
    • Beginner Friendly.
      4 hours
  • Learn how to configure your environment and set up your settings and preferences from the command line.
    • Beginner Friendly.
      1 hour