Home Engineering

Kafka for Beginners: How It Works and Why It Exists

03 March 2026 · 22 min read
Table of contents

Most explanations of Kafka start with “Kafka is a distributed event streaming platform.” That sentence is accurate and tells you nothing useful. This post starts from the problem Kafka was built to solve, then explains each piece of the system from why it exists, not from what it is called.


Part 1: Why Kafka Exists

The problem with direct service calls

Suppose you have an e-commerce system. When a user places an order, several things must happen: the inventory service deducts stock, the notification service sends a confirmation email, the analytics service records the sale, and the fraud detection service runs a check.

The naive approach: the order service calls each downstream service directly.

Order Service
    |
    |-- HTTP --> Inventory Service
    |-- HTTP --> Notification Service
    |-- HTTP --> Analytics Service
    |-- HTTP --> Fraud Detection Service

This works until it does not. What breaks:

Tight coupling. The order service must know about every downstream consumer. When the analytics team wants to add a new event, they modify the order service. When fraud detection changes its API, the order service changes too. Every new consumer is a new dependency the order service has to manage.

Availability chain. If the notification service is down, the order service either fails the entire order or implements its own retry logic per downstream service. Four consumers means four different failure modes to handle.

Speed mismatch. Fraud detection might take 300ms. The customer should not wait 300ms for their order to complete just because one downstream service is slow. The order service either blocks on all of them or manages parallel calls with timeouts for each.

No replay. If analytics goes down for two hours and misses 10,000 events, those events are gone. There is no way to re-send them.

What Kafka changes

Kafka introduces a broker that sits between producers and consumers. The order service publishes one event. Every downstream service reads that event independently.

Order Service --> Kafka --> Inventory Service
                       --> Notification Service
                       --> Analytics Service
                       --> Fraud Detection Service

Now:

  • The order service does not know or care who reads the event. Adding a new consumer requires zero changes to the producer.
  • If notification is down, its events accumulate in Kafka. When it recovers, it reads from where it left off.
  • If analytics misses two hours, it replays those two hours from Kafka’s stored log.
  • Fraud detection runs independently at its own pace without slowing down the order response.

Trade-off: You have traded synchronous simplicity for asynchronous complexity. You no longer get an immediate confirmation that every downstream service succeeded. The order is accepted, but whether the email was sent is a separate question answered asynchronously. This is the right trade-off for high-throughput systems. It is the wrong trade-off when you need an immediate, synchronous response from a downstream service before you can proceed.


Part 2: Topics and Partitions

Topics: named channels for events

A topic is a named log of events. Producers write to a topic; consumers read from a topic. Think of a topic as a table in a database, except you can only append to it and reads do not remove records.

// A producer writing to the "orders" topic
ProducerRecord<String, String> record = new ProducerRecord<>("orders", orderId, orderJson);
producer.send(record);

Why partitions exist

A single topic can receive millions of events per second. One machine cannot write or read that fast. Kafka splits a topic into partitions, each an independent ordered log stored on a different broker.

Topic: "orders"

Partition 0: [msg0] [msg3] [msg6] [msg9] ...
Partition 1: [msg1] [msg4] [msg7] [msg10] ...
Partition 2: [msg2] [msg5] [msg8] [msg11] ...

Within a single partition, order is guaranteed. Across partitions, there is no ordering guarantee. This is the fundamental trade-off: parallelism in exchange for global ordering.

What breaks without partitions: If a topic had only one partition, one consumer could process it and throughput would be bounded by a single machine. With 10 partitions, 10 consumers can read in parallel, multiplying throughput by roughly 10.

How Kafka decides which partition a message goes to:

// If you provide a key, Kafka hashes it to choose a partition
// All messages with the same key always go to the same partition
// -> ordering is guaranteed per key
ProducerRecord<String, String> record = new ProducerRecord<>(
    "orders",
    customerId,   // key: same customer always goes to same partition
    orderJson     // value
);

// If no key, Kafka distributes round-robin across partitions
ProducerRecord<String, String> record = new ProducerRecord<>("orders", orderJson);

Production example: An order service sends events keyed by customerId. All events for customer 42 land on partition 3 in sequence. The fraud detection consumer reading partition 3 sees all of customer 42’s events in order, which is necessary to detect velocity patterns (three orders in one minute).

Drag · Scroll to zoom

Trade-off: The number of partitions is set when the topic is created and is hard to change later (increasing partitions changes the hash mapping for keys, breaking per-key ordering for existing consumers). Start with more partitions than you need today. A common starting point for a medium-traffic topic is 12 to 24 partitions.

The hot partition problem: If your partition key has one dominant value, all traffic concentrates on one partition. A payment system keyed by paymentMethod where 90% of users pay by credit card sends 90% of messages to one partition. One consumer handles that partition while the others sit mostly idle. The fix is to choose a high-cardinality key, customerId or orderId, that distributes load evenly across partitions. If no natural key exists, use a random or round-robin strategy and accept that per-key ordering is not guaranteed.


Part 3: Brokers and the Cluster

A Kafka cluster is a group of broker servers. Each broker stores some partitions. Kafka replicates each partition across multiple brokers so that if one broker fails, the data is not lost.

Broker 1: Partition 0 (leader), Partition 1 (replica)
Broker 2: Partition 1 (leader), Partition 2 (replica)
Broker 3: Partition 2 (leader), Partition 0 (replica)

Every partition has one leader broker, which handles all reads and writes for that partition. The other brokers holding copies are followers, which replicate from the leader.

What breaks without replication: If Broker 1 holds the only copy of Partition 0 and Broker 1 crashes, all messages in Partition 0 are gone and the producer cannot write new ones until the broker recovers. With replication factor 3, Broker 1 failing is not a crisis: Broker 3 (which holds a replica) is promoted to leader automatically.

The replication factor controls how many copies exist. Replication factor 1 means no redundancy. Replication factor 3 means the data survives the loss of any one broker. Most production topics use replication factor 3.


Part 4: Producers

What a producer does

A producer is any code that writes messages to Kafka. The Java producer client handles connection management, serialization, partition selection, batching, and retry automatically.

// Minimal producer setup
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-1:9092,kafka-2:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);

KafkaProducer<String, String> producer = new KafkaProducer<>(props);

// Send is asynchronous by default
Future<RecordMetadata> future = producer.send(
    new ProducerRecord<>("orders", customerId, orderJson)
);

// Always close to flush buffered records and release connections
producer.close();

Acknowledgment levels

The acks setting controls how many brokers must confirm a write before Kafka tells the producer it succeeded. This is the most important producer configuration for reliability.

// acks=0: fire and forget
// Producer does not wait for any acknowledgment
// Fastest, but messages can be lost if the broker crashes immediately after receiving them
props.put(ProducerConfig.ACKS_CONFIG, "0");

// acks=1: leader acknowledges
// Broker leader confirms the write to its local log
// Message is lost if the leader crashes before replicating to followers
props.put(ProducerConfig.ACKS_CONFIG, "1");

// acks=all (or -1): all in-sync replicas acknowledge
// Message is durable as long as at least one replica survives
// Slowest, but no data loss under any single-broker failure
props.put(ProducerConfig.ACKS_CONFIG, "all");

Production rule: Use acks=all for any data you care about losing. Use acks=1 for high-volume metrics or logs where occasional loss is acceptable. Never use acks=0 unless you are explicitly benchmarking throughput and can accept message loss.

Batching

The producer does not send one network request per message. It buffers records and sends them in batches, which dramatically reduces network overhead.

// How long the producer waits to fill a batch before sending
props.put(ProducerConfig.LINGER_MS_CONFIG, "5"); // wait up to 5ms

// Maximum batch size in bytes
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384); // 16KB default

// Enable snappy compression on batches
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");

With linger.ms=5, the producer waits 5ms before sending. If 1,000 messages arrive in those 5ms, they go out in one network call instead of 1,000. The trade-off: every message waits up to 5ms before being sent. For latency-sensitive paths, keep linger.ms at 0 or 1.


Part 5: Consumers and Consumer Groups

Offsets: how Kafka tracks what you have read

Every message in a partition has a sequential number called an offset. Offset 0 is the first message ever written, offset 1 the second, and so on. When a consumer reads messages, it commits its current offset back to Kafka. If the consumer crashes and restarts, it resumes from the last committed offset.

Partition 0 offsets:
  0       1       2       3       4       5
[msg0] [msg1] [msg2] [msg3] [msg4] [msg5]
                      ^
              committed offset = 2
              (consumer has processed up to msg2)
              next read starts at offset 3

What breaks without offset commits: If you read messages but never commit the offset, every restart re-reads from the beginning of the topic. If a topic has 6 months of order history, every restart reprocesses all of it.

What breaks with auto-commit: Kafka’s default setting (enable.auto.commit=true) commits the offset every 5 seconds regardless of whether your code successfully processed the messages. If your code reads a batch, the auto-commit fires, then your processing crashes before finishing: those messages are marked as done even though they were not processed. You lose them.

// Safer: disable auto-commit and commit manually after processing
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
    processOrder(record.value()); // process first
}
consumer.commitSync(); // then commit: only reached if processing succeeded

Consumer groups: parallel reading

A consumer group is a set of consumers that cooperate to read a topic. Kafka assigns each partition to exactly one consumer in the group. No two consumers in the same group read the same partition at the same time.

Topic "orders" with 3 partitions
Consumer Group "fraud-detection" with 3 consumers:

  Partition 0 --> Consumer A
  Partition 1 --> Consumer B
  Partition 2 --> Consumer C

Each consumer processes its assigned partitions independently. To double throughput, add a consumer to the group (up to the number of partitions). To have two independent systems read the same topic, use two separate consumer groups: each group gets a full, independent copy of every message.

Topic "orders" with 3 partitions

Group "fraud-detection":         Group "analytics":
  Partition 0 --> Consumer A       Partition 0 --> Consumer X
  Partition 1 --> Consumer B       Partition 1 --> Consumer Y
  Partition 2 --> Consumer C       Partition 2 --> Consumer Z

What happens if you add more consumers than partitions: Extra consumers sit idle. A topic with 3 partitions can use at most 3 consumers in the same group for parallel processing.

Drag · Scroll to zoom

Rebalancing

When a consumer joins or leaves a group, Kafka reassigns partitions among the remaining consumers. This is called a rebalance. During a rebalance, all consumers in the group pause reading until the new assignment is complete.

What breaks without handling rebalances: If your consumer holds an open database transaction or an in-flight HTTP request when a rebalance starts, Kafka revokes the partition mid-processing. The next consumer picks up from the last committed offset and reprocesses the same messages. Without idempotency in your processing logic, you get duplicate inserts, duplicate emails, and duplicate charges.

consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Called before partitions are taken away
        // Finish in-flight work and commit current offsets
        consumer.commitSync(currentOffsets);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Called after new partitions are assigned
        // Load any per-partition state for the newly assigned partitions
    }
});

Part 6: Delivery Guarantees

Kafka provides three delivery semantics. Understanding which one your system uses by default is not optional.

At-most-once

Messages may be lost but are never processed twice. Achieved by committing the offset before processing.

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
consumer.commitSync(); // commit first
for (ConsumerRecord<String, String> record : records) {
    processOrder(record.value()); // if this crashes, message is skipped
}

Use case: metrics, logs, and any data where occasional loss is acceptable and duplicates are expensive.

At-least-once

Messages are never lost but may be processed more than once. Achieved by committing the offset only after processing.

ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
    processOrder(record.value()); // if this crashes, message is retried from last committed offset
}
consumer.commitSync(); // commit after: no loss, but reprocessing possible on crash

This is the Kafka default for most production systems. Your processing logic must be idempotent: calling it twice with the same message must produce the same result as calling it once.

// Idempotent: safe to call multiple times with the same order
public void processOrder(Order order) {
    // Use INSERT ... ON CONFLICT DO NOTHING or INSERT ... ON CONFLICT DO UPDATE
    // to ensure duplicate events do not create duplicate records
    orderRepository.upsert(order);
}

Exactly-once

Messages are processed exactly once. Kafka supports this through transactions, which atomically write to Kafka and commit the consumer offset together. It is more complex to implement and has higher latency.

// Producer side: enable idempotence and transactions
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "order-processor-1");

producer.initTransactions();
try {
    producer.beginTransaction();
    producer.send(new ProducerRecord<>("processed-orders", key, value));
    // Commit consumer offset atomically with the produce
    producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);
    producer.commitTransaction();
} catch (Exception e) {
    producer.abortTransaction();
}

Use exactly-once when you are transforming one Kafka topic into another and duplicates are genuinely unacceptable. For most consumer use cases (writing to a database, sending an email), at-least-once with idempotent processing is simpler and sufficient.


Part 7: A Complete Spring Boot Example

// Producer configuration
@Configuration
public class KafkaProducerConfig {

    @Bean
    public KafkaTemplate<String, String> kafkaTemplate() {
        Map<String, Object> props = new HashMap<>();
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        props.put(ProducerConfig.RETRIES_CONFIG, 3);
        props.put(ProducerConfig.LINGER_MS_CONFIG, 5);
        return new KafkaTemplate<>(new DefaultKafkaProducerFactory<>(props));
    }
}

// Publishing an order event
@Service
public class OrderService {

    private final KafkaTemplate<String, String> kafkaTemplate;
    private final ObjectMapper objectMapper;

    public Order placeOrder(OrderRequest request) {
        Order order = createOrder(request);
        orderRepository.save(order);

        try {
            String payload = objectMapper.writeValueAsString(order);
            // key = orderId: all events for the same order land on the same partition
            kafkaTemplate.send("orders", order.getId(), payload);
        } catch (JsonProcessingException e) {
            throw new RuntimeException("Failed to serialize order event", e);
        }

        return order;
    }
}
// Consumer configuration
@Configuration
public class KafkaConsumerConfig {

    @Bean
    public ConsumerFactory<String, String> consumerFactory() {
        Map<String, Object> props = new HashMap<>();
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "fraud-detection");
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        return new DefaultKafkaConsumerFactory<>(props);
    }

    @Bean
    public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
        ConcurrentKafkaListenerContainerFactory<String, String> factory =
            new ConcurrentKafkaListenerContainerFactory<>();
        factory.setConsumerFactory(consumerFactory());
        // Manual offset commit: Spring commits only after the listener method returns successfully
        factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.RECORD);
        return factory;
    }
}

// Processing order events
@Service
public class FraudDetectionConsumer {

    @KafkaListener(topics = "orders", groupId = "fraud-detection")
    public void onOrder(ConsumerRecord<String, String> record) {
        Order order = deserialize(record.value());

        FraudResult result = fraudEngine.evaluate(order);

        if (result.isSuspicious()) {
            alertRepository.save(new FraudAlert(order.getId(), result.getReason()));
        }
        // Spring commits the offset after this method returns without exception
        // If an exception is thrown, the offset is not committed and the message is retried
    }
}

application.yml for the consumer:

spring:
  kafka:
    bootstrap-servers: localhost:9092
    consumer:
      group-id: fraud-detection
      auto-offset-reset: earliest
      enable-auto-commit: false
      key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
      value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
    listener:
      ack-mode: record

Part 8: Retention and Replay

Kafka stores messages on disk for a configurable period. The default is 7 days. During that window, any consumer can replay from any offset.

# Create a topic with 30-day retention
kafka-topics.sh --create \
  --topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=2592000000  # 30 days in milliseconds

# View topic configuration
kafka-topics.sh --describe --topic orders

Production example of replay: The analytics service crashes and loses its state. It resets its committed offset back to 7 days ago and replays everything. The order service (the producer) needs no changes and does not even know this happened.

// Reset a consumer group offset to the beginning (in code, for a specific use case)
consumer.subscribe(List.of("orders"));
consumer.poll(Duration.ofMillis(0)); // trigger partition assignment
consumer.seekToBeginning(consumer.assignment());

Trade-off: Long retention means more disk usage on the brokers. A topic receiving 1 GB per hour with 30-day retention uses 720 GB per partition replica. Size your broker disks accordingly or use tiered storage (available in Kafka 3.6+) to offload older data to object storage like S3.

Consumer lag: knowing when consumers fall behind

Consumer lag is the difference between the latest offset in a partition and the last committed offset of a consumer group. A lag of 0 means the consumer is caught up. A lag of 50,000 means the consumer is 50,000 messages behind the producer.

# Check lag for a consumer group
kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe \
  --group fraud-detection

# Output:
# GROUP           TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# fraud-detection orders  0          10420           10420           0
# fraud-detection orders  1          9876            10001           125
# fraud-detection orders  2          10100           10100           0

Partition 1 has a lag of 125, meaning the consumer is 125 messages behind. A small, stable lag is normal. A lag that keeps growing means the consumer cannot keep up with the producer’s throughput.

# Expose lag metrics via Micrometer in Spring Boot
management:
  metrics:
    enable:
      kafka.consumer.fetch-latency-avg: true
# Alert when lag exceeds a threshold for more than 2 minutes
# lag > 10000 for 2m -> consumer is falling behind, investigate processing speed

The two causes of growing lag: the consumer is too slow (processing takes longer than messages arrive), or the consumer keeps crashing and restarting without making progress. Check processing time per message first, then check for repeated exceptions in the logs.


Part 9: Common Beginner Mistakes

Mistake 1: Not setting auto.offset.reset explicitly

# Default is "latest": new consumer groups start reading from NOW
# Messages that arrived before the consumer started are skipped

# For a new consumer that should read all historical messages:
auto-offset-reset: earliest

# For a new consumer that should only read new messages going forward:
auto-offset-reset: latest

If you deploy a new analytics consumer and wonder why it has no data from the past week, auto.offset.reset=latest is almost always the reason.

Mistake 2: Fewer partitions than consumers

Topic with 3 partitions, consumer group with 5 consumers:
  Partition 0 --> Consumer A  (active)
  Partition 1 --> Consumer B  (active)
  Partition 2 --> Consumer C  (active)
  (no partition) Consumer D  (idle)
  (no partition) Consumer E  (idle)

Consumers D and E consume memory and CPU but process nothing. You cannot increase parallelism beyond the partition count without repartitioning the topic.

Mistake 3: Using auto-commit with non-idempotent processing

Auto-commit fires on a timer. If your processing takes longer than auto.commit.interval.ms (default 5 seconds), Kafka commits offsets for messages that your code is still processing. A crash after the commit but before processing finishes causes silent message loss.

Always use enable.auto.commit=false and commit explicitly.

Mistake 4: Long-running operations inside the poll loop

Kafka has a session timeout (session.timeout.ms, default 45 seconds) and a maximum poll interval (max.poll.interval.ms, default 5 minutes). If your consumer does not call poll() within that interval, the broker considers it dead and triggers a rebalance.

// WRONG: slow processing inside the poll loop
while (true) {
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    for (ConsumerRecord<String, String> record : records) {
        externalApiClient.process(record.value()); // 10 seconds per call
        // If you have 100 records, poll is not called for 1000 seconds -> rebalance
    }
}

// CORRECT: process asynchronously or increase max.poll.interval.ms
// and reduce max.poll.records to keep the batch size manageable
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 60000); // 1 minute

Mistake 5: Not handling deserialization errors

If a malformed message lands in the topic and your deserializer throws an exception, most consumer loops will retry that message indefinitely, blocking all subsequent messages in the partition.

// Spring Kafka: configure a dead-letter topic for messages that fail after retries
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, String> template) {
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(template,
            (record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));

    // Retry up to 3 times with 1-second backoff, then send to DLT
    return new DefaultErrorHandler(recoverer,
        new FixedBackOff(1000L, 3));
}

Part 10: Dead Letter Queues

When a consumer fails to process a message, it has two choices: keep retrying (blocking the entire partition) or skip it (losing the message). Dead letter queues give you a third option: park the bad message in a separate topic and keep moving. The original consumer stays unblocked, and you have a record of every failed message to inspect and replay later.

Why a consumer fails to process a message

Processing failures fall into a few categories. Deserialization errors happen when the message format does not match what the consumer expects, for example a producer switched to a new schema and the consumer has not been updated yet. Validation errors happen when the data is technically parseable but logically wrong, such as a negative quantity or a missing required field that your business logic cannot handle. Downstream failures happen when your consumer calls an external service (a database, a payment gateway, a third-party API) and that service is down or returns an unexpected error. Poison pills are messages that consistently cause an unhandled exception regardless of retries, often due to a bug introduced in a recent deployment.

Transient failures (downstream service temporarily down) are worth retrying. Permanent failures (deserialization errors, poison pills) will never succeed no matter how many retries you attempt. A good DLQ strategy handles both: retry a few times with a backoff, then park the message if it still fails.

The failure loop problem

Without a DLQ, a single bad message can stop an entire consumer group. If your consumer retries indefinitely on a deserialization error or a downstream service being down, no other messages in that partition make progress until the bad message is resolved.

Order events topic
─────────────────────────────────────────────
msg 1 ✓  msg 2 ✓  msg 3 ✗ (bad)  msg 4 ...
           Consumer stuck here, retrying
           msg 4 and beyond never processed

A DLQ breaks this loop. After N retries the consumer publishes the failed message to a dedicated dead letter topic and commits the offset, allowing the partition to keep moving.

Spring Kafka DLQ setup

@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
    DeadLetterPublishingRecoverer recoverer =
        new DeadLetterPublishingRecoverer(template,
            // Routes order-events failures to order-events.DLT
            (record, ex) -> new TopicPartition(
                record.topic() + ".DLT",
                record.partition()
            )
        );

    FixedBackOff backOff = new FixedBackOff(1000L, 3); // 3 retries, 1s apart
    return new DefaultErrorHandler(recoverer, backOff);
}

DeadLetterPublishingRecoverer is the Spring Kafka built-in that handles publishing to the dead letter topic. After 3 retries, the original message (along with exception headers) is written to order-events.DLT and the consumer moves on.

What lands in the DLT

Spring Kafka enriches dead letter messages with headers that tell you exactly what went wrong:

kafka_dlt-original-topic:      order-events
kafka_dlt-original-partition:  2
kafka_dlt-original-offset:     10421
kafka_dlt-exception-fqcn:      com.fasterxml.jackson.databind.exc.MismatchedInputException
kafka_dlt-exception-message:   Cannot deserialize value of type `OrderEvent` from ...
kafka_dlt-exception-stacktrace: (full stack trace)

These headers let you route different exception types differently, or build a dashboard that shows the most common failure causes.

Reprocessing dead letter messages

A DLT is only useful if you act on it. Common patterns:

Alert and inspect: Set a consumer lag alert on the .DLT topic. Any message landing there triggers investigation. Engineers inspect the message, fix the bug, then replay.

Replay after a fix: Once the root cause is fixed, a replay consumer reads from the DLT and publishes messages back to the original topic:

@KafkaListener(topics = "order-events.DLT", groupId = "dlt-replayer")
public void replayDeadLetter(ConsumerRecord<String, byte[]> record) {
    // Strip DLT headers before republishing
    kafkaTemplate.send("order-events", record.key(), record.value());
}

Permanent archive: Some teams treat the DLT as a forensics log and never replay automatically. They fix the consumer, then decide case by case whether to replay or discard.

When not to use a DLT

DLTs work well for transient failures (downstream service down, network blip) and data quality issues (malformed messages, schema mismatches). They are not the right tool when the failure is expected: a payment declined, a stock item out of stock. Those are valid business outcomes, not processing errors. Handle them with explicit branching in your consumer logic, not by routing to a DLT.


Summary

ConceptWhat it doesKey setting
TopicNamed log of eventsretention.ms
PartitionUnit of parallelism and orderingCount set at creation, hard to change
ProducerWrites to a topicacks, linger.ms
ConsumerReads from a topicenable.auto.commit, auto.offset.reset
Consumer groupParallel readers sharing a topicgroup.id
OffsetPosition in a partitionCommitted after successful processing
ReplicationDurability across broker failuresreplication.factor

Mental model: Kafka is an append-only log, not a queue. Messages are not deleted when read. They expire after the retention window. Multiple independent consumers read the same messages at their own pace. Every consumer group has its own offset per partition. Adding a consumer group never affects another group. This is the core behavior that makes Kafka useful: producers and consumers are completely decoupled in time and deployment.