import ComparisonTable from ’../../components/ComparisonTable.astro’;
Kafka and SQS are both message systems, but they’re designed for fundamentally different use cases. Kafka is a distributed event log built for streaming, replay, and high throughput. SQS is a managed queue built for task distribution and decoupling. Choosing the wrong one is a common architectural mistake.
Quick Verdict
Choose Kafka if: High-throughput event streaming, log aggregation, event replay requirements, multiple consumer types reading the same events, or building an event-driven architecture.
Choose SQS if: Simple task distribution, AWS-native workloads, at-most-once or at-least-once delivery for background jobs, or teams who want zero infrastructure management.
Feature Comparison
<ComparisonTable headers={[“Feature”, “Apache Kafka”, “Amazon SQS”]} rows={[ [“Primary model”, “Event log / stream”, “Task queue”], [“Message retention”, “Configurable (days/forever)”, “Up to 14 days”], [“Replay messages”, “Yes (seek to offset)”, “No (consumed = gone)”], [“Consumer groups”, “Yes (independent offsets)”, “No (competing consumers)”], [“Ordering”, “Per partition”, “FIFO queue (optional)”], [“Max message size”, “1MB (default, configurable)”, “256KB”], [“Throughput”, “Millions/sec (horizontal scale)”, “Up to 3,000 msg/sec standard”], [“Delivery guarantee”, “At-least-once / exactly-once”, “At-least-once / FIFO exactly-once”], [“Infrastructure”, “Self-hosted or Confluent”, “Fully managed (AWS)”], [“Operational burden”, “Significant”, “None”], [“Cost model”, “Infrastructure + licensing”, “Per request + data transfer”], [“DLQ support”, “Manual (separate topic)”, “Native dead-letter queue”], [“Message filtering”, “Consumer-side”, “Native (SNS + SQS)”], ]} />
The Core Difference: Log vs. Queue
This distinction matters more than any feature comparison:
SQS is a queue:
Producer → [Queue] → Consumer
→ Consumer (competing)
→ Consumer (competing)
After consumer reads message:
- Message is deleted (or moved to DLQ)
- Other consumers cannot see it
- No replay possible
- No historical view
Kafka is a log:
Producer → [Topic: orders] → [Partition 0] → Consumer Group A (offset 1000)
→ [Partition 1] → Consumer Group B (offset 850)
→ Consumer Group C (offset 1200)
Events are retained. Each consumer group maintains its own offset.
Consumer Group B can replay from offset 0 tomorrow.
New consumer groups can read the entire history.
This difference drives everything else.
Kafka Architecture
Topic, partition, and offset:
Topic: "orders"
Partition 0: [msg1, msg2, msg3, msg4, msg5, ...]
Partition 1: [msg1, msg2, msg3, msg4, ...]
Partition 2: [msg1, msg2, msg3, msg4, msg5, msg6, ...]
Consumer Group "billing":
Consumer 0 → Partition 0 (at offset 4)
Consumer 1 → Partition 1 (at offset 3)
Consumer 2 → Partition 2 (at offset 5)
Consumer Group "analytics":
Consumer 0 → Partition 0 (at offset 2) # Behind billing, OK
Consumer 1 → Partition 1 (at offset 2)
Consumer 2 → Partition 2 (at offset 3)
Consumer Group "new-audit-service":
Consumer 0 → Partition 0 (starting from offset 0!)
# New service can replay entire history
Producer:
from confluent_kafka import Producer
producer = Producer({
'bootstrap.servers': 'kafka:9092',
'acks': 'all', # Wait for all replicas
'retries': 3,
'enable.idempotence': True, # Exactly-once semantics
})
def order_placed(order: dict):
producer.produce(
topic='orders',
key=str(order['customer_id']), # Same customer → same partition
value=json.dumps(order).encode('utf-8'),
callback=delivery_report
)
producer.poll(0)
def delivery_report(err, msg):
if err:
print(f'Delivery failed: {err}')
else:
print(f'Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}')
Consumer:
from confluent_kafka import Consumer
consumer = Consumer({
'bootstrap.servers': 'kafka:9092',
'group.id': 'billing-service',
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # Manual commit for at-least-once
})
consumer.subscribe(['orders'])
try:
while True:
msg = consumer.poll(timeout=1.0)
if msg is None:
continue
if msg.error():
handle_error(msg.error())
continue
order = json.loads(msg.value().decode('utf-8'))
try:
process_billing(order)
consumer.commit(msg) # Only commit after successful processing
except Exception as e:
# Don't commit — message will be redelivered
log_error(e)
finally:
consumer.close()
SQS Architecture
Standard Queue:
import boto3
sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/orders'
# Producer
def send_order(order: dict):
response = sqs.send_message(
QueueUrl=queue_url,
MessageBody=json.dumps(order),
MessageAttributes={
'OrderType': {
'DataType': 'String',
'StringValue': order['type']
}
}
)
return response['MessageId']
# Consumer
def process_orders():
while True:
response = sqs.receive_message(
QueueUrl=queue_url,
MaxNumberOfMessages=10,
WaitTimeSeconds=20, # Long polling
VisibilityTimeout=30 # 30 sec to process before re-visible
)
messages = response.get('Messages', [])
for msg in messages:
try:
order = json.loads(msg['Body'])
process_order(order)
# Delete after successful processing
sqs.delete_message(
QueueUrl=queue_url,
ReceiptHandle=msg['ReceiptHandle']
)
except Exception as e:
# Don't delete — message becomes visible again after VisibilityTimeout
log_error(e)
FIFO Queue (ordered, exactly-once):
# FIFO queue for ordered processing
fifo_queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/orders.fifo'
sqs.send_message(
QueueUrl=fifo_queue_url,
MessageBody=json.dumps(order),
MessageGroupId=str(order['customer_id']), # Orders per customer are ordered
MessageDeduplicationId=str(order['order_id']) # Prevent duplicates
)
Dead Letter Queue:
# After N failed processing attempts, message moves to DLQ
# Configure in SQS console or via boto3:
sqs.create_queue(
QueueName='orders',
Attributes={
'RedrivePolicy': json.dumps({
'deadLetterTargetArn': 'arn:aws:sqs:us-east-1:123456789:orders-dlq',
'maxReceiveCount': '3' # Move to DLQ after 3 failures
})
}
)
# Monitor DLQ for processing failures
# Reprocess DLQ messages when bug is fixed
Performance and Cost
Kafka throughput:
Production Kafka cluster:
- 3 brokers, 6 partitions per topic
- Throughput: 500,000+ messages/second
- Latency: <10ms end-to-end
- Storage: Retain last 7 days (or unlimited)
Kafka cost (3 broker cluster on AWS):
- 3x r5.2xlarge: ~$1,100/month
- EBS storage (10TB): ~$800/month
- OR Confluent Cloud: ~$0.08/GB + compute
- Total: $1,500-3,000/month at scale
SQS throughput and cost:
SQS Standard Queue:
- Throughput: 3,000 msg/sec (Standard), 300/sec (FIFO)
- Latency: milliseconds
- No infrastructure to manage
SQS pricing:
- First 1M requests/month: Free
- $0.40/million requests after (Standard)
- $0.50/million requests (FIFO)
- Data transfer: $0.09/GB
Example at 100M messages/month:
- SQS cost: ~$40/month
- Kafka cost: $1,500+ (infrastructure)
SQS is dramatically cheaper for low-medium volume. Kafka’s per-message cost approaches zero at very high volume.
Common Patterns
Kafka: Fan-out to multiple services:
Order placed event → Kafka topic: "orders"
├── Consumer Group: billing (charges card)
├── Consumer Group: inventory (reserves stock)
├── Consumer Group: fulfillment (creates shipment)
├── Consumer Group: analytics (updates dashboard)
└── Consumer Group: notifications (sends email)
Each service reads the same event independently.
If analytics is slow, it doesn't affect billing.
If a new notification service is added, it reads from offset 0.
SQS: Work queue for background jobs:
Web request → SQS queue: "image-resize-jobs"
→ Worker 1 (processes job, deletes message)
→ Worker 2 (processes job, deletes message)
→ Worker 3 (scales up during traffic spike)
→ DLQ (failed jobs after 3 attempts)
Auto-scaling group adds workers when queue depth > 100.
Hybrid (Kafka + SQS):
Common production pattern:
Events → Kafka (streaming, replay, analytics)
→ Kafka consumer → SQS (work distribution to Lambda)
→ Lambda workers (serverless processing)
Kafka for the event backbone.
SQS for distributing specific work to serverless.
Managed Kafka Options
If you want Kafka without infrastructure management:
| Option | Provider | Cost | Notes |
|---|---|---|---|
| Confluent Cloud | Confluent | $0.08/GB + compute | Most features, best managed experience |
| Amazon MSK | AWS | EC2 + storage cost | Native AWS integration |
| Redpanda Cloud | Redpanda | Kafka-compatible, lower overhead | Simpler, faster than Kafka |
| Aiven for Kafka | Aiven | Managed, multi-cloud | Good for multi-cloud |
When to Choose Each
Choose Kafka:
- Event streaming with multiple consumer types
- Need to replay historical events (audit, new service catchup)
- High throughput: millions of events/day
- Event-driven microservices architecture
- Log aggregation (application logs, clickstream, metrics)
- Change data capture (CDC) from databases
Choose SQS:
- Simple background job queues (email sending, image processing)
- AWS Lambda triggers (native integration)
- Low-to-medium volume task distribution
- At-most-once or at-least-once delivery without replay needs
- Teams who want zero infrastructure operations
- Cost-sensitive workloads at moderate volume
Bottom Line
Kafka and SQS are not direct competitors — they solve different problems. The key question: do you need a queue (task distribution, consumed and gone) or a log (streaming, replay, multiple readers)? SQS is the right queue for most task distribution needs in AWS. Kafka is the right log for event streaming, event-driven architecture, and any workload where replay or multiple independent consumers matter. Using Kafka as a simple task queue adds unnecessary complexity. Using SQS for event streaming creates painful limitations. Know which problem you’re solving.