What is Apache Kafka? An introductory guide

In this article, we break down what Apache Kafka is, compare it to other systems, show how it works using code, and discuss challenges and best practices.

By Axel Sirota

Apr 15, 2024 • 7 Minute Read

Please set an alt value for this image...

Subscribe to the newsletter

Welcome to the world of Apache Kafka, a powerful tool reshaping how we handle real-time data. Today, we'll uncover what Kafka is and why it's becoming a cornerstone in modern data processing.

Imagine a bustling city where information is constantly flowing. Apache Kafka is like the central nervous system of this city, designed to handle this massive flow of data with ease and efficiency. Originating at LinkedIn in 2011 and later becoming an open-source project under the Apache Software Foundation, Kafka is a distributed streaming platform.

But what does that mean? In simple terms, Kafka allows you to publish and subscribe to streams of records, store those records reliably, and process them as they occur. It's like having a super-efficient post office that never sleeps, continuously sorting and delivering messages to where they need to go.

In our data-driven world, the ability to handle real-time data efficiently is crucial. Kafka excels in this arena. It's used by thousands of companies, including giants like Netflix, Uber, and Twitter, to process streaming data for real-time analytics, monitoring, and many other applications.

Kafka's robustness, scalability, and fault tolerance make it an indispensable tool in handling large streams of data, ensuring that businesses can make data-driven decisions quickly and effectively.

Kafka Architecture and Main Elements

Let's dive into the architecture of Apache Kafka. We'll explore its key components and understand how they work together.

At its core, Kafka is designed to be robust, scalable, and fault-tolerant. It's built on a distributed architecture, meaning its components are spread across different machines, ensuring high availability and resilience.

The Kafka ecosystem comprises several critical components:

Producers: These are the data sources that publish records to Kafka topics. Think of them as the senders in a messaging system.
Consumers: They subscribe to topics and process the published records. They are the receivers in our analogy.
Brokers: These are the heart of Kafka. A Kafka cluster consists of multiple brokers to maintain load balance and manage data. Two comments about the Broker is that they are stateful and they are the unit of scalability in Kafka
Topics: A topic is a category or feed to which records are published. Topics in Kafka are split into partitions for scalability and parallel processing.
Partitions: Each partition is an ordered, immutable sequence of records that is continually appended. Partitions allow Kafka to parallelize processing as each partition can be consumed independently.

Imagine Kafka as a highly efficient mail system. Producers are like senders dropping off letters (messages) at the post office (broker). Each letter is sorted into specific PO boxes (topics), further organized into compartments (partitions). Consumers then collect letters from these compartments, ensuring efficient and orderly processing.

Let's look at how you can set up a basic Kafka environment using Docker. We'll deploy a Kafka cluster with two brokers and Zookeeper, which Kafka uses for cluster management and coordination.

      # docker-compose.yml for Kafka
version: '3'
services:
  zookeeper:
    image: wurstmeister/zookeeper
    ports:
      - "2181:2181"
  kafka:
    image: wurstmeister/kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_ADVERTISED_HOST_NAME: kafka
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    depends_on: 
      - zookeeper 
  kafka2: 
    image: wurstmeister/kafka 
    ports: 
      - "9093:9092" 
    environment: 
      KAFKA_ADVERTISED_HOST_NAME: kafka2 
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 
    depends_on: 
      - zookeeper
    

This `docker-compose.yml` file sets up a basic Kafka environment. We have Zookeeper, essential for managing the Kafka cluster, and two Kafka brokers for handling messages. Each broker is exposed on different ports, and they are configured to connect to the Zookeeper service.

With this setup, you have a miniature version of what companies use to manage vast streams of data. Kafka's architecture is designed to handle high throughput and low latency, making it perfect for real-time data processing.

Kafka vs Other Systems

In the world of data streaming and message brokering, Apache Kafka, Apache Pulsar, and RabbitMQ are prominent players. Let's briefly compare them.

Kafka vs. Pulsar

Kafka is renowned for high throughput and reliability, ideal for large-scale message processing. Apache Pulsar, on the other hand, offers similar capabilities but with a stronger emphasis on multi-tenancy and geo-replication, making it suitable for complex distributed systems.

Kafka vs. RabbitMQ

RabbitMQ, widely known for its simplicity and ease of use, excels in traditional messaging and queueing, often favored for smaller-scale or less complex applications. Kafka, with its distributed nature and high durability, is more suited for large-scale event streaming and logging.

Going to the Code: Producers and Consumers

Let's dive into the practical side of Apache Kafka with basic Java code examples for a producer and a consumer. We'll also touch on key configuration settings.

Kafka Producer

A Kafka producer sends records to topics. Here's a simple Java example ( you can also use NodeJS or Python simply):

      import org.apache.kafka.clients.producer.*;

import java.util.Properties;

public class SimpleProducer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

        Producer<String, String> producer = new KafkaProducer<>(props);

        try {
            producer.send(new ProducerRecord<>("test", "Hello, Kafka!"));
        } finally {
            producer.close();
        }
    }
}
    

This producer connects to Kafka running on localhost:9092, sending a simple message to the 'test' topic. Key configurations include server details and serializers for keys and values.

Kafka Consumer

Now, let's look at a Kafka consumer, which reads messages from a topic:

      import org.apache.kafka.clients.consumer.*;
import java.util.Collections;
import java.util.Properties;

public class SimpleConsumer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "test-group");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");

        Consumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Collections.singletonList("test"));

        try {
            ConsumerRecords<String, String> records = consumer.poll(100);
            for (ConsumerRecord<String, String> record : records) {
                System.out.println(record.value());
            }
        } finally {
            consumer.close();
        }
    }
}
    

consumer connects to the same Kafka cluster and listens to the 'test' topic. Key configurations here include the server details, consumer group ID, and deserializers for keys and values. The consumer group concept allows Kafka to distribute message consumption across multiple consumers.

Understanding Configuration Settings

In both examples, bootstrap.servers specify the Kafka brokers to connect to. For production environments, this would be a list of multiple brokers for fault tolerance.

The key.serializer and value.serializer in the producer and their counterpart deserializers in the consumer handle the conversion of data to and from bytes, as Kafka stores and transmits messages in byte arrays.

The group.id in the consumer configures which consumer group this consumer belongs to. Kafka uses this for load balancing and to deliver each message to one consumer in a group, when messages are consumed from the same partition.

These settings are just the tip of the iceberg. Kafka offers a myriad of configurations for fine-tuning performance, security, and reliability, allowing it to be tailored to a wide array of use cases.

Kafka 3 Highlights

Apache Kafka 3.0 introduces a groundbreaking shift with its move towards a Zookeeper-less architecture, known as KRaft (Kafka Raft). This transition marks a pivotal moment in Kafka's evolution, promising a more streamlined, efficient, and scalable platform.

KRaft eliminates Kafka's historical dependence on Zookeeper, a separate service used for managing cluster metadata and coordination. By integrating the Raft consensus protocol directly into Kafka, KRaft simplifies the overall architecture and enhances operational efficiency.

Source: https://developer.confluent.io/learn/kraft/

This shift brings several key benefits:

Simplified Operations: Without Zookeeper, Kafka's deployment and management become more straightforward, reducing the complexity of running a Kafka cluster.
Improved Performance: KRaft leads to lower latencies in controller operations, as it removes the need to communicate with an external Zookeeper cluster.
Enhanced Scalability and Reliability: Direct control within Kafka enhances the platform's scalability and reliability, laying the groundwork for future enhancements that are not constrained by external dependencies.

In traditional Kafka setups using Zookeeper, metadata management is an external process, adding complexity and potential points of failure. KRaft mode, however, internalizes this process, leading to a leaner, more cohesive system. While Zookeeper-based Kafka has been robust and well-tested over time, KRaft promises to streamline the Kafka ecosystem, making it more agile and adaptable to future demands.

Challenges and Best Practices

Implementing Kafka, especially for beginners, can present challenges such as managing data consistency, understanding partitioning strategies, and handling cluster scalability. Best practices include:

Careful Planning: Understanding your data and how it flows through Kafka topics is crucial.
Monitoring and Management: Regularly monitor Kafka's performance and manage resources effectively.
Backups and Disaster Recovery: Always have a plan for data backup and disaster recovery.

Conclusion

Apache Kafka is not just a messaging system but a comprehensive platform for handling real-time data streams. Its use in critical sectors like health and banking underscores its reliability and efficiency. Kafka 3, with its move towards a Zookeeper-less architecture, marks a significant step towards an even more streamlined and efficient future.

For beginners stepping into Kafka, remember, the journey is as rewarding as the destination. With Kafka's growing community and rich documentation, mastering this powerful tool is an achievable goal.

Further Learning

To deepen your Kafka knowledge, explore these resources:

Axel S.

Axel Sirota is a Microsoft Certified Trainer with a deep interest in Deep Learning and Machine Learning Operations. He has a Masters degree in Mathematics and after researching in Probability, Statistics and Machine Learning optimization, he works as an AI and Cloud Consultant as well as being an Author and Instructor at Pluralsight, Develop Intelligence, and O'Reilly Media.

More about this author