Kafka Consumer Deadlock from Poll/Timeout Ratio

The Problem

An AI coding agent at a fintech client reported the following error pattern in production. Their Kafka consumer was getting kicked from the consumer group every ~45 seconds, then rejoining, then immediately failing again. The agent attempted three different approaches autonomously — none worked. Confidence dropped to 38%, and the protocol escalated to a human teacher.

# Agent log — autonomous attempt #3
ERROR [Consumer clientId=app-1, groupId=batch-processor]
  CommitFailedException: Offset commit cannot be completed since
  the consumer is not part of an active group for auto partition
  assignment; the consumer instance may have been kicked out of
  the group.

ERROR [Consumer clientId=app-1] Heartbeat thread failed due to
  unexpected error: org.apache.kafka.common.errors.RebalanceInProgressException

# Agent confidence: 0.38 — escalating to 0hiring HITL
# Matched Sara K. (kafka, distributed systems) in 22s

⚠ Common misdiagnosis

Most AI agents (and many humans) attempt to "fix" this by lowering session.timeout.ms, increasing max.poll.interval.ms, or adding more retry logic. None of these work. The actual fix is much simpler — see below.

Root Cause

The consumer is configured to poll up to 500 records per batch (max_poll_records=500). Each record takes ~120ms to process due to a downstream API call. That's 60 seconds per batch of work — but the session timeout is only 30 seconds.

The math:

500 records × 120ms = 60,000ms per poll
session.timeout.ms = 30,000ms

→ Consumer dies before commit fires.
→ Rebalance triggers.
→ Records re-delivered to next consumer.
→ Same problem. Infinite loop.

Why agents miss this

This is a class of bug where each individual config looks reasonable in isolation. max_poll_records=500 is a recommended starting point in many Kafka tutorials. session.timeout.ms=30000 is a sensible default. The issue only appears when you compute the ratio against actual per-record processing time — which the agent doesn't observe at config time.

The Fix

Reduce max_poll_records to a value where poll_records × per_record_ms < session.timeout.ms × 0.7 (leaving 30% safety margin). Then add an explicit manual commit after each batch.

 # consumer_config.py
 config = {
     'bootstrap.servers': 'kafka:9092',
     'group.id': 'batch-processor',
-    'max.poll.records': 500,
-    'session.timeout.ms': 30000,
+    'max.poll.records': 50,
+    'session.timeout.ms': 45000,
+    'enable.auto.commit': False,  # manual commit per batch
 }

Then in the consumer loop:

for message in consumer:
    process(message)        # ~120ms per record
if messages_in_batch > 0:
    consumer.commit()       # manual, after batch completes

✓ Verified by Sara K.

After applying this fix in production at a fintech client (40k events/sec), no rebalance loops have occurred for 14 months. The 0.7 safety margin handles GC pauses and downstream API latency spikes.

How AI Agents Use This Pattern

// integration with your agent

Agent encounters Kafka rebalance loop

Agent's local knowledge graph queries 0hiring API: GET /v1/patterns/match?error_signature=CommitFailedException&tech=kafka

Pattern matched with 91% similarity

0hiring returns this pattern's structured fix. Agent automatically computes per-record latency from logs and generates the correct config.

USDC royalty fires automatically

$0.04 USDC streams from the consuming agent's wallet to Sara K.'s wallet on Base L2. Settlement: 1.9 seconds.

Agent's confidence updates locally

For this problem class, agent's confidence: 38% → 94%. Next time this exact issue appears, no human escalation needed.

Where This Pattern Came From

An agent at a fintech client got stuck in production on March 15, 2026. Confidence dropped below the human_threshold of 0.75. The 0hiring protocol matched the agent to Sara Kim — a distributed systems specialist with 12 years of Kafka experience, including 4 at Confluent — in 22 seconds.

Sara diagnosed the issue in 4 minutes 12 seconds, including the safety margin reasoning. The protocol auto-formatted her solution into this structured pattern, signed it with her key, stored it on IPFS, and made it available to all consuming agents in the network.

Within 30 days, this pattern had been reused 1,204 times across 47 companies. None of those agents had to call a human. Sara has earned $48.16 in streaming royalties, while the agents collectively saved an estimated 800+ engineering hours.

Related Patterns

If you're seeing this issue, check these related patterns: