Apache Kafka for Beginners
Introduction to Apache Kafka
Have you ever wondered how popular applications like Netflix or Uber process massive amounts of real-time data seamlessly? The answer is Apache Kafka, a robust distributed streaming platform.
Apache Kafka is a popular distributed event streaming platform designed to efficiently manage real-time data feeds. It was originally developed by LinkedIn in 2011 and later open-sourced. Apache Kafka has gained immense popularity due to its impressive features, including scalability, fault tolerance, and high performance.
In this tutorial, we’ll discuss how to get started with Apache Kafka, the advantages it offers, and its core components.
Intro to the Command Line
Discover the power of this simple yet essential text-based tool and increase your productivity as a developer.Try it for freeWhy use Kafka?
Apache Kafka offers several key advantages, including:
- Real-Time Data Processing: Kafka allows for the processing of real-time data streams, enabling businesses to make decisions quickly.
- Scalability: Kafka is highly scalable and can manage a large volume of data without impacting performance.
- Fault Tolerance: Kafka is fault-tolerant, ensuring that data is not lost even in case of hardware failure.
- High Throughput: Kafka can process a large amount of data with low latency, making it suitable for applications that require real-time processing.
Core components of Apache Kafka
Apache Kafka consists of several core components that work together to process and store data. They are the following:
- Kafka Clusters: Kafka clusters are distributed systems that consist of multiple Kafka brokers working together to handle and process real-time data streams.
- Brokers: Brokers are the core of the Kafka cluster. They receive messages from producers, store them in partitions, and deliver them to consumers.
- Topics: Topics are the channels through which data is organized and categorized. They can be divided into multiple partitions for better scalability and performance.
- Partitions: Partitions are the fundamental unit of data storage in Kafka. Topics that are divided into multiple partitions are distributed across the brokers in the cluster.
- Producers: Producers help in publishing data to Kafka topics. They send messages to specific topics within the Kafka cluster.
- Consumers: Consumers subscribe to topics and receive messages from them. They can process the received messages, store them, or perform other actions.
- ZooKeeper: ZooKeeper acts as a distributed coordination service that manages the Kafka cluster. It stores metadata about the cluster, such as the list of brokers, topics, and partitions.
- Offsets: Offsets are unique identifiers that represent the position of a message within a specific partition of a topic. They are crucial for tracking the progress of consumers within a topic.
In current versions of Apache Kafka (especially versions 2.8.0 and beyond), ZooKeeper is no longer a mandatory component for Kafka clusters, as Apache Kafka is moving towards a ZooKeeper-less architecture using KRaft. KRaft (Kafka Raft) is a consensus protocol that provides a more integrated and efficient way to manage metadata within the Kafka cluster itself.
Now that we are familiar with the components that Apache Kafka includes, let’s discuss how data flows across these components.
How data flows across Apache Kafka components
Apache Kafka utilizes a publish-subscribe messaging model for managing data streams, making it a popular choice for stream processing and event-driven architectures.
Here is a diagram that shows how data flows across Kafka components in this model:
In this model:
- Producers first send messages to specific topics within a Kafka cluster.
- These topics are then split into partitions and distributed across the brokers within the cluster.
- Consumers, part of a consumer group, subscribe to these topics and read messages from their partitions delivered by the brokers.
- Each consumer tracks its progress using an offset, ensuring that the messages are processed exactly once.
- ZooKeeper (or KRaft in newer versions) coordinates this entire process, ensuring a smooth and reliable data flow operation in the Apache Kafka ecosystem.
Next, let’s learn how to get started with Apache Kafka.
How to get started with Apache Kafka
The process of setting up Apache Kafka is relatively straightforward.
Before you go through the process of setting up Apache Kafka, you need to install the following tools on your machine that are required for Apache Kafka to work correctly:
- WSL (Only required on Windows as Linux/macOS has the terminal)
- Java
To install WSL (Windows Subsystem for Linux), open the command prompt / PowerShell on your machine and run the command:
wsl --install
To install Java, first, go to the official download page for Java and download the installer for the latest version from there.
After the download is complete, open the installer and go through the on-screen prompts to install Java on your machine.
Once you are done installing the tools, follow these step-by-step instructions to set up Apache Kafka on your machine:
Download Kafka: Go to the official downloads page for Apache Kafka and download the latest version as an archive (a file used for storing multiple files/folders) from the Binary downloads section.
After downloading Kafka, go to the directory where the archive is downloaded. Then, extract the contents from the archive to a new folder and rename it to kafka.
Next, open the command prompt / PowerShell on your machine and run the following command to navigate to the newly created folder kafka:
cd kafka
Start ZooKeeper: Execute the following command to start ZooKeeper on your machine, which comes bundled with Apache Kafka:
.\bin\windows\zookeeper-server-start.bat .\config\zookeeper.properties
On Linux/macOS, run this command in the terminal instead:
bin/zookeeper-server-start.sh config/zookeeper.properties
Start Apache Kafka: Start Apache Kafka on your machine by running this command in a new command prompt / PowerShell window:
.\bin\windows\kafka-server-start.bat .\config\server.properties
On Linux/macOS, the command would be:
bin/kafka-server-start.sh config/server.properties
Create a Topic: Execute the following command in a new command prompt / PowerShell instance to create a topic named MyTopic:
.\bin\windows\kafka-topics.bat --create --topic MyTopic --bootstrap-server localhost:9092
On Linux/macOS, the command would be:
bin/kafka-topics.sh --create --topic MyTopic --bootstrap-server localhost:9092
In this command:
--bootstrap-server
: Specifies the Kafka broker (server) to connect to (localhost:9092
). Apache Kafka requires a broker to manage the topic and perform operations like publishing or consuming messages.localhost:9092
: Refers to the address of the Kafka broker.localhost
: Indicates that the Kafka broker is running on the same machine where the command is executed.9092
: The default port that Apache Kafka uses to listen to incoming connections. If it is running on a different machine or port, this value would need to be updated accordingly.
Start Kafka Producer: Start Kafka producer by running another command in a new command prompt / PowerShell instance:
.\bin\windows\kafka-console-producer.bat --topic MyTopic --bootstrap-server localhost:9092
On Linux/macOS, the command would be:
bin/kafka-console-producer.sh --topic MyTopic --bootstrap-server localhost:9092
Start Kafka Consumer: Execute another command in a new command prompt / PowerShell instance to start Kafka consumer:
.\bin\windows\kafka-console-consumer.bat --topic MyTopic --from-beginning --bootstrap-server localhost:9092
On Linux/macOS, the command would be:
bin/kafka-console-consumer.sh --topic MyTopic --from-beginning --bootstrap-server localhost:9092
Testing: Enter a message in the Kafka producer and press Enter:
This is a message.
After hitting Enter, the Kafka consumer reads the message and outputs the same:
This is a message.
And there it is - your first Kafka topic is now created!
In the next section, we will discuss some common Apache Kafka setup errors.
Performing basic operations in Apache Kafka
In this section, you will learn how to perform basic operations in Apache Kafka, such as listing topics, creating partitions, and deleting topics.
Firstly, to list topics in the Kafka server, run this command in the command prompt / PowerShell on Windows:
.\bin\windows\kafka-topics.bat --list --zookeeper localhost:2181
On Linux/macOS, the command would be:
bin/kafka-topics.sh --list --zookeeper localhost:2181
Here is the output:
MyTopic
Next, to create partitions within a topic (MyTopic
in this case), run this command in the command prompt / PowerShell:
.\bin\windows\kafka-topics.bat --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic
On Linux/macOS, the command would be:
bin/kafka-topics.sh --alter --zookeeper localhost:2181 --partitions 3 --topic MyTopic
This command creates 3 partitions within MyTopic
.
Finally, to delete a topic (MyTopic
in this case) from the Kafka server, run this command in the command prompt / PowerShell:
.\bin\windows\kafka-topics.bat --zookeeper localhost:2181 --delete --topic MyTopic
On Linux/macOS, the command would be:
bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic MyTopic
Now, if you list the topics in the Kafka server again, it will not output anything because you have deleted the only topic (MyTopic
) from the server.
Applications of Apache Kafka
Here are a few real-world applications of Apache Kafka:
- Data Ingestion: One of the most common use cases of Apache Kafka is data ingestion. Companies use Kafka to collect and aggregate large volumes of data from various sources in real-time.
- Real-Time Analytics: Kafka can be used to stream data for real-time analytics. By processing and analyzing data, companies can make quicker and more informed decisions.
- Log Aggregation: Kafka is often used for log aggregation, allowing companies to centralize their log data from multiple sources and easily analyze and monitor system logs.
- Metrics Monitoring: Kafka is also used for monitoring metrics in real-time. Companies can stream metrics data to Kafka and use it to monitor the health and performance of their systems.
Conclusion
In this tutorial, we learned about Apache Kafka and its features, its advantages, its core components, and the step-by-step process of getting started with it.
Apache Kafka is a powerful platform that is used by organizations worldwide for real-time data processing and analytics. Kafka can also be potentially used in cloud environments, such as with Confluent Cloud or Kubernetes for performing tasks like data ingestion, data replication, data integration, and more. By mastering Apache Kafka, you can take your data processing abilities to the next level.
If you want to learn more about data analysis in Python, check out the Analyze Data with Python course on Codecademy.
Author
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Important PowerShell Commands for Cybersecurity Analysts
The basics of PowerShell commands useful for any Cybersecurity professional. - Article
Command Line Interface
Getting started with the command line - Article
Big Data Storage and Computing
Learn about the challenges of storing and analyzing big data
Learn more on Codecademy
- Free course
Intro to the Command Line
Discover the power of this simple yet essential text-based tool and increase your productivity as a developer.Beginner Friendly1 hour - Course
Learn the Command Line
Learn about the command line, starting with navigating and manipulating the file system, and ending with redirection and configuring the environment.With CertificateBeginner Friendly4 hours - Free course
Learn the Command Line: Configuring the Environment
Learn how to configure your environment and set up your settings and preferences from the command line.Beginner Friendly1 hour