Kafka for Beginners: How It Works and Why It Exists
Table of contents
Most explanations of Kafka start with “Kafka is a distributed event streaming platform.” That sentence is accurate and tells you nothing useful. This post starts from the problem Kafka was built to solve, then explains each piece of the system from why it exists, not from what it is called.
Part 1: Why Kafka Exists
The problem with direct service calls
Suppose you have an e-commerce system. When a user places an order, several things must happen: the inventory service deducts stock, the notification service sends a confirmation email, the analytics service records the sale, and the fraud detection service runs a check.
The naive approach: the order service calls each downstream service directly.
Order Service
|
|-- HTTP --> Inventory Service
|-- HTTP --> Notification Service
|-- HTTP --> Analytics Service
|-- HTTP --> Fraud Detection Service
This works until it does not. What breaks:
Tight coupling. The order service must know about every downstream consumer. When the analytics team wants to add a new event, they modify the order service. When fraud detection changes its API, the order service changes too. Every new consumer is a new dependency the order service has to manage.
Availability chain. If the notification service is down, the order service either fails the entire order or implements its own retry logic per downstream service. Four consumers means four different failure modes to handle.
Speed mismatch. Fraud detection might take 300ms. The customer should not wait 300ms for their order to complete just because one downstream service is slow. The order service either blocks on all of them or manages parallel calls with timeouts for each.
No replay. If analytics goes down for two hours and misses 10,000 events, those events are gone. There is no way to re-send them.
What Kafka changes
Kafka introduces a broker that sits between producers and consumers. The order service publishes one event. Every downstream service reads that event independently.
Order Service --> Kafka --> Inventory Service
--> Notification Service
--> Analytics Service
--> Fraud Detection Service
Now:
- The order service does not know or care who reads the event. Adding a new consumer requires zero changes to the producer.
- If notification is down, its events accumulate in Kafka. When it recovers, it reads from where it left off.
- If analytics misses two hours, it replays those two hours from Kafka’s stored log.
- Fraud detection runs independently at its own pace without slowing down the order response.
Trade-off: You have traded synchronous simplicity for asynchronous complexity. You no longer get an immediate confirmation that every downstream service succeeded. The order is accepted, but whether the email was sent is a separate question answered asynchronously. This is the right trade-off for high-throughput systems. It is the wrong trade-off when you need an immediate, synchronous response from a downstream service before you can proceed.
Part 2: Topics and Partitions
Topics: named channels for events
A topic is a named log of events. Producers write to a topic; consumers read from a topic. Think of a topic as a table in a database, except you can only append to it and reads do not remove records.
// A producer writing to the "orders" topic
ProducerRecord<String, String> record = new ProducerRecord<>("orders", orderId, orderJson);
producer.send(record);
Why partitions exist
A single topic can receive millions of events per second. One machine cannot write or read that fast. Kafka splits a topic into partitions, each an independent ordered log stored on a different broker.
Topic: "orders"
Partition 0: [msg0] [msg3] [msg6] [msg9] ...
Partition 1: [msg1] [msg4] [msg7] [msg10] ...
Partition 2: [msg2] [msg5] [msg8] [msg11] ...
Within a single partition, order is guaranteed. Across partitions, there is no ordering guarantee. This is the fundamental trade-off: parallelism in exchange for global ordering.
What breaks without partitions: If a topic had only one partition, one consumer could process it and throughput would be bounded by a single machine. With 10 partitions, 10 consumers can read in parallel, multiplying throughput by roughly 10.
How Kafka decides which partition a message goes to:
// If you provide a key, Kafka hashes it to choose a partition
// All messages with the same key always go to the same partition
// -> ordering is guaranteed per key
ProducerRecord<String, String> record = new ProducerRecord<>(
"orders",
customerId, // key: same customer always goes to same partition
orderJson // value
);
// If no key, Kafka distributes round-robin across partitions
ProducerRecord<String, String> record = new ProducerRecord<>("orders", orderJson);
Production example: An order service sends events keyed by customerId. All events for customer 42 land on partition 3 in sequence. The fraud detection consumer reading partition 3 sees all of customer 42’s events in order, which is necessary to detect velocity patterns (three orders in one minute).
Trade-off: The number of partitions is set when the topic is created and is hard to change later (increasing partitions changes the hash mapping for keys, breaking per-key ordering for existing consumers). Start with more partitions than you need today. A common starting point for a medium-traffic topic is 12 to 24 partitions.
Part 3: Brokers and the Cluster
A Kafka cluster is a group of broker servers. Each broker stores some partitions. Kafka replicates each partition across multiple brokers so that if one broker fails, the data is not lost.
Broker 1: Partition 0 (leader), Partition 1 (replica)
Broker 2: Partition 1 (leader), Partition 2 (replica)
Broker 3: Partition 2 (leader), Partition 0 (replica)
Every partition has one leader broker, which handles all reads and writes for that partition. The other brokers holding copies are followers, which replicate from the leader.
What breaks without replication: If Broker 1 holds the only copy of Partition 0 and Broker 1 crashes, all messages in Partition 0 are gone and the producer cannot write new ones until the broker recovers. With replication factor 3, Broker 1 failing is not a crisis: Broker 3 (which holds a replica) is promoted to leader automatically.
The replication factor controls how many copies exist. Replication factor 1 means no redundancy. Replication factor 3 means the data survives the loss of any one broker. Most production topics use replication factor 3.
Part 4: Producers
What a producer does
A producer is any code that writes messages to Kafka. The Java producer client handles connection management, serialization, partition selection, batching, and retry automatically.
// Minimal producer setup
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka-1:9092,kafka-2:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
KafkaProducer<String, String> producer = new KafkaProducer<>(props);
// Send is asynchronous by default
Future<RecordMetadata> future = producer.send(
new ProducerRecord<>("orders", customerId, orderJson)
);
// Always close to flush buffered records and release connections
producer.close();
Acknowledgment levels
The acks setting controls how many brokers must confirm a write before Kafka tells the producer it succeeded. This is the most important producer configuration for reliability.
// acks=0: fire and forget
// Producer does not wait for any acknowledgment
// Fastest, but messages can be lost if the broker crashes immediately after receiving them
props.put(ProducerConfig.ACKS_CONFIG, "0");
// acks=1: leader acknowledges
// Broker leader confirms the write to its local log
// Message is lost if the leader crashes before replicating to followers
props.put(ProducerConfig.ACKS_CONFIG, "1");
// acks=all (or -1): all in-sync replicas acknowledge
// Message is durable as long as at least one replica survives
// Slowest, but no data loss under any single-broker failure
props.put(ProducerConfig.ACKS_CONFIG, "all");
Production rule: Use acks=all for any data you care about losing. Use acks=1 for high-volume metrics or logs where occasional loss is acceptable. Never use acks=0 unless you are explicitly benchmarking throughput and can accept message loss.
Batching
The producer does not send one network request per message. It buffers records and sends them in batches, which dramatically reduces network overhead.
// How long the producer waits to fill a batch before sending
props.put(ProducerConfig.LINGER_MS_CONFIG, "5"); // wait up to 5ms
// Maximum batch size in bytes
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384); // 16KB default
// Enable snappy compression on batches
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
With linger.ms=5, the producer waits 5ms before sending. If 1,000 messages arrive in those 5ms, they go out in one network call instead of 1,000. The trade-off: every message waits up to 5ms before being sent. For latency-sensitive paths, keep linger.ms at 0 or 1.
Part 5: Consumers and Consumer Groups
Offsets: how Kafka tracks what you have read
Every message in a partition has a sequential number called an offset. Offset 0 is the first message ever written, offset 1 the second, and so on. When a consumer reads messages, it commits its current offset back to Kafka. If the consumer crashes and restarts, it resumes from the last committed offset.
Partition 0 offsets:
0 1 2 3 4 5
[msg0] [msg1] [msg2] [msg3] [msg4] [msg5]
^
committed offset = 2
(consumer has processed up to msg2)
next read starts at offset 3
What breaks without offset commits: If you read messages but never commit the offset, every restart re-reads from the beginning of the topic. If a topic has 6 months of order history, every restart reprocesses all of it.
What breaks with auto-commit: Kafka’s default setting (enable.auto.commit=true) commits the offset every 5 seconds regardless of whether your code successfully processed the messages. If your code reads a batch, the auto-commit fires, then your processing crashes before finishing: those messages are marked as done even though they were not processed. You lose them.
// Safer: disable auto-commit and commit manually after processing
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processOrder(record.value()); // process first
}
consumer.commitSync(); // then commit: only reached if processing succeeded
Consumer groups: parallel reading
A consumer group is a set of consumers that cooperate to read a topic. Kafka assigns each partition to exactly one consumer in the group. No two consumers in the same group read the same partition at the same time.
Topic "orders" with 3 partitions
Consumer Group "fraud-detection" with 3 consumers:
Partition 0 --> Consumer A
Partition 1 --> Consumer B
Partition 2 --> Consumer C
Each consumer processes its assigned partitions independently. To double throughput, add a consumer to the group (up to the number of partitions). To have two independent systems read the same topic, use two separate consumer groups: each group gets a full, independent copy of every message.
Topic "orders" with 3 partitions
Group "fraud-detection": Group "analytics":
Partition 0 --> Consumer A Partition 0 --> Consumer X
Partition 1 --> Consumer B Partition 1 --> Consumer Y
Partition 2 --> Consumer C Partition 2 --> Consumer Z
What happens if you add more consumers than partitions: Extra consumers sit idle. A topic with 3 partitions can use at most 3 consumers in the same group for parallel processing.
Rebalancing
When a consumer joins or leaves a group, Kafka reassigns partitions among the remaining consumers. This is called a rebalance. During a rebalance, all consumers in the group pause reading until the new assignment is complete.
What breaks without handling rebalances: If your consumer holds an open database transaction or an in-flight HTTP request when a rebalance starts, Kafka revokes the partition mid-processing. The next consumer picks up from the last committed offset and reprocesses the same messages. Without idempotency in your processing logic, you get duplicate inserts, duplicate emails, and duplicate charges.
consumer.subscribe(List.of("orders"), new ConsumerRebalanceListener() {
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Called before partitions are taken away
// Finish in-flight work and commit current offsets
consumer.commitSync(currentOffsets);
}
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// Called after new partitions are assigned
// Load any per-partition state for the newly assigned partitions
}
});
Part 6: Delivery Guarantees
Kafka provides three delivery semantics. Understanding which one your system uses by default is not optional.
At-most-once
Messages may be lost but are never processed twice. Achieved by committing the offset before processing.
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
consumer.commitSync(); // commit first
for (ConsumerRecord<String, String> record : records) {
processOrder(record.value()); // if this crashes, message is skipped
}
Use case: metrics, logs, and any data where occasional loss is acceptable and duplicates are expensive.
At-least-once
Messages are never lost but may be processed more than once. Achieved by committing the offset only after processing.
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
processOrder(record.value()); // if this crashes, message is retried from last committed offset
}
consumer.commitSync(); // commit after: no loss, but reprocessing possible on crash
This is the Kafka default for most production systems. Your processing logic must be idempotent: calling it twice with the same message must produce the same result as calling it once.
// Idempotent: safe to call multiple times with the same order
public void processOrder(Order order) {
// Use INSERT ... ON CONFLICT DO NOTHING or INSERT ... ON CONFLICT DO UPDATE
// to ensure duplicate events do not create duplicate records
orderRepository.upsert(order);
}
Exactly-once
Messages are processed exactly once. Kafka supports this through transactions, which atomically write to Kafka and commit the consumer offset together. It is more complex to implement and has higher latency.
// Producer side: enable idempotence and transactions
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "order-processor-1");
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("processed-orders", key, value));
// Commit consumer offset atomically with the produce
producer.sendOffsetsToTransaction(offsets, consumerGroupMetadata);
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
Use exactly-once when you are transforming one Kafka topic into another and duplicates are genuinely unacceptable. For most consumer use cases (writing to a database, sending an email), at-least-once with idempotent processing is simpler and sufficient.
Part 7: A Complete Spring Boot Example
// Producer configuration
@Configuration
public class KafkaProducerConfig {
@Bean
public KafkaTemplate<String, String> kafkaTemplate() {
Map<String, Object> props = new HashMap<>();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, 3);
props.put(ProducerConfig.LINGER_MS_CONFIG, 5);
return new KafkaTemplate<>(new DefaultKafkaProducerFactory<>(props));
}
}
// Publishing an order event
@Service
public class OrderService {
private final KafkaTemplate<String, String> kafkaTemplate;
private final ObjectMapper objectMapper;
public Order placeOrder(OrderRequest request) {
Order order = createOrder(request);
orderRepository.save(order);
try {
String payload = objectMapper.writeValueAsString(order);
// key = orderId: all events for the same order land on the same partition
kafkaTemplate.send("orders", order.getId(), payload);
} catch (JsonProcessingException e) {
throw new RuntimeException("Failed to serialize order event", e);
}
return order;
}
}
// Consumer configuration
@Configuration
public class KafkaConsumerConfig {
@Bean
public ConsumerFactory<String, String> consumerFactory() {
Map<String, Object> props = new HashMap<>();
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.GROUP_ID_CONFIG, "fraud-detection");
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
return new DefaultKafkaConsumerFactory<>(props);
}
@Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
// Manual offset commit: Spring commits only after the listener method returns successfully
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.RECORD);
return factory;
}
}
// Processing order events
@Service
public class FraudDetectionConsumer {
@KafkaListener(topics = "orders", groupId = "fraud-detection")
public void onOrder(ConsumerRecord<String, String> record) {
Order order = deserialize(record.value());
FraudResult result = fraudEngine.evaluate(order);
if (result.isSuspicious()) {
alertRepository.save(new FraudAlert(order.getId(), result.getReason()));
}
// Spring commits the offset after this method returns without exception
// If an exception is thrown, the offset is not committed and the message is retried
}
}
application.yml for the consumer:
spring:
kafka:
bootstrap-servers: localhost:9092
consumer:
group-id: fraud-detection
auto-offset-reset: earliest
enable-auto-commit: false
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
listener:
ack-mode: record
Part 8: Retention and Replay
Kafka stores messages on disk for a configurable period. The default is 7 days. During that window, any consumer can replay from any offset.
# Create a topic with 30-day retention
kafka-topics.sh --create \
--topic orders \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=2592000000 # 30 days in milliseconds
# View topic configuration
kafka-topics.sh --describe --topic orders
Production example of replay: The analytics service crashes and loses its state. It resets its committed offset back to 7 days ago and replays everything. The order service (the producer) needs no changes and does not even know this happened.
// Reset a consumer group offset to the beginning (in code, for a specific use case)
consumer.subscribe(List.of("orders"));
consumer.poll(Duration.ofMillis(0)); // trigger partition assignment
consumer.seekToBeginning(consumer.assignment());
Trade-off: Long retention means more disk usage on the brokers. A topic receiving 1 GB per hour with 30-day retention uses 720 GB per partition replica. Size your broker disks accordingly or use tiered storage (available in Kafka 3.6+) to offload older data to object storage like S3.
Part 9: Common Beginner Mistakes
Mistake 1: Not setting auto.offset.reset explicitly
# Default is "latest": new consumer groups start reading from NOW
# Messages that arrived before the consumer started are skipped
# For a new consumer that should read all historical messages:
auto-offset-reset: earliest
# For a new consumer that should only read new messages going forward:
auto-offset-reset: latest
If you deploy a new analytics consumer and wonder why it has no data from the past week, auto.offset.reset=latest is almost always the reason.
Mistake 2: Fewer partitions than consumers
Topic with 3 partitions, consumer group with 5 consumers:
Partition 0 --> Consumer A (active)
Partition 1 --> Consumer B (active)
Partition 2 --> Consumer C (active)
(no partition) Consumer D (idle)
(no partition) Consumer E (idle)
Consumers D and E consume memory and CPU but process nothing. You cannot increase parallelism beyond the partition count without repartitioning the topic.
Mistake 3: Using auto-commit with non-idempotent processing
Auto-commit fires on a timer. If your processing takes longer than auto.commit.interval.ms (default 5 seconds), Kafka commits offsets for messages that your code is still processing. A crash after the commit but before processing finishes causes silent message loss.
Always use enable.auto.commit=false and commit explicitly.
Mistake 4: Long-running operations inside the poll loop
Kafka has a session timeout (session.timeout.ms, default 45 seconds) and a maximum poll interval (max.poll.interval.ms, default 5 minutes). If your consumer does not call poll() within that interval, the broker considers it dead and triggers a rebalance.
// WRONG: slow processing inside the poll loop
while (true) {
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
for (ConsumerRecord<String, String> record : records) {
externalApiClient.process(record.value()); // 10 seconds per call
// If you have 100 records, poll is not called for 1000 seconds -> rebalance
}
}
// CORRECT: process asynchronously or increase max.poll.interval.ms
// and reduce max.poll.records to keep the batch size manageable
props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 10);
props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 60000); // 1 minute
Mistake 5: Not handling deserialization errors
If a malformed message lands in the topic and your deserializer throws an exception, most consumer loops will retry that message indefinitely, blocking all subsequent messages in the partition.
// Spring Kafka: configure a dead-letter topic for messages that fail after retries
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate<String, String> template) {
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));
// Retry up to 3 times with 1-second backoff, then send to DLT
return new DefaultErrorHandler(recoverer,
new FixedBackOff(1000L, 3));
}
Summary
| Concept | What it does | Key setting |
|---|---|---|
| Topic | Named log of events | retention.ms |
| Partition | Unit of parallelism and ordering | Count set at creation, hard to change |
| Producer | Writes to a topic | acks, linger.ms |
| Consumer | Reads from a topic | enable.auto.commit, auto.offset.reset |
| Consumer group | Parallel readers sharing a topic | group.id |
| Offset | Position in a partition | Committed after successful processing |
| Replication | Durability across broker failures | replication.factor |
Mental model: Kafka is an append-only log, not a queue. Messages are not deleted when read. They expire after the retention window. Multiple independent consumers read the same messages at their own pace. Every consumer group has its own offset per partition. Adding a consumer group never affects another group. This is the core behavior that makes Kafka useful: producers and consumers are completely decoupled in time and deployment.