Our Pick Apache Kafka — Kafka is the right choice for high-throughput event streaming, log aggregation, and replay requirements. SQS is the right choice for simple task queues, AWS-native workloads, and teams who want zero infrastructure to manage. They solve different problems — the 'winner' depends entirely on your use case.
Apache Kafka vs Amazon SQS

import ComparisonTable from ’../../components/ComparisonTable.astro’;

Kafka and SQS are both message systems, but they’re designed for fundamentally different use cases. Kafka is a distributed event log built for streaming, replay, and high throughput. SQS is a managed queue built for task distribution and decoupling. Choosing the wrong one is a common architectural mistake.

Quick Verdict

Choose Kafka if: High-throughput event streaming, log aggregation, event replay requirements, multiple consumer types reading the same events, or building an event-driven architecture.

Choose SQS if: Simple task distribution, AWS-native workloads, at-most-once or at-least-once delivery for background jobs, or teams who want zero infrastructure management.


Feature Comparison

<ComparisonTable headers={[“Feature”, “Apache Kafka”, “Amazon SQS”]} rows={[ [“Primary model”, “Event log / stream”, “Task queue”], [“Message retention”, “Configurable (days/forever)”, “Up to 14 days”], [“Replay messages”, “Yes (seek to offset)”, “No (consumed = gone)”], [“Consumer groups”, “Yes (independent offsets)”, “No (competing consumers)”], [“Ordering”, “Per partition”, “FIFO queue (optional)”], [“Max message size”, “1MB (default, configurable)”, “256KB”], [“Throughput”, “Millions/sec (horizontal scale)”, “Up to 3,000 msg/sec standard”], [“Delivery guarantee”, “At-least-once / exactly-once”, “At-least-once / FIFO exactly-once”], [“Infrastructure”, “Self-hosted or Confluent”, “Fully managed (AWS)”], [“Operational burden”, “Significant”, “None”], [“Cost model”, “Infrastructure + licensing”, “Per request + data transfer”], [“DLQ support”, “Manual (separate topic)”, “Native dead-letter queue”], [“Message filtering”, “Consumer-side”, “Native (SNS + SQS)”], ]} />


The Core Difference: Log vs. Queue

This distinction matters more than any feature comparison:

SQS is a queue:

Producer → [Queue] → Consumer
                   → Consumer (competing)
                   → Consumer (competing)

After consumer reads message:
- Message is deleted (or moved to DLQ)
- Other consumers cannot see it
- No replay possible
- No historical view

Kafka is a log:

Producer → [Topic: orders] → [Partition 0] → Consumer Group A (offset 1000)
                           → [Partition 1] → Consumer Group B (offset 850)
                                          → Consumer Group C (offset 1200)

Events are retained. Each consumer group maintains its own offset.
Consumer Group B can replay from offset 0 tomorrow.
New consumer groups can read the entire history.

This difference drives everything else.


Kafka Architecture

Topic, partition, and offset:

Topic: "orders"
  Partition 0: [msg1, msg2, msg3, msg4, msg5, ...]
  Partition 1: [msg1, msg2, msg3, msg4, ...]
  Partition 2: [msg1, msg2, msg3, msg4, msg5, msg6, ...]

Consumer Group "billing":
  Consumer 0 → Partition 0 (at offset 4)
  Consumer 1 → Partition 1 (at offset 3)
  Consumer 2 → Partition 2 (at offset 5)

Consumer Group "analytics":
  Consumer 0 → Partition 0 (at offset 2)  # Behind billing, OK
  Consumer 1 → Partition 1 (at offset 2)
  Consumer 2 → Partition 2 (at offset 3)

Consumer Group "new-audit-service":
  Consumer 0 → Partition 0 (starting from offset 0!)
  # New service can replay entire history

Producer:

from confluent_kafka import Producer

producer = Producer({
    'bootstrap.servers': 'kafka:9092',
    'acks': 'all',  # Wait for all replicas
    'retries': 3,
    'enable.idempotence': True,  # Exactly-once semantics
})

def order_placed(order: dict):
    producer.produce(
        topic='orders',
        key=str(order['customer_id']),  # Same customer → same partition
        value=json.dumps(order).encode('utf-8'),
        callback=delivery_report
    )
    producer.poll(0)

def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()} [{msg.partition()}] @ {msg.offset()}')

Consumer:

from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers': 'kafka:9092',
    'group.id': 'billing-service',
    'auto.offset.reset': 'earliest',
    'enable.auto.commit': False,  # Manual commit for at-least-once
})

consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            handle_error(msg.error())
            continue
        
        order = json.loads(msg.value().decode('utf-8'))
        try:
            process_billing(order)
            consumer.commit(msg)  # Only commit after successful processing
        except Exception as e:
            # Don't commit — message will be redelivered
            log_error(e)
finally:
    consumer.close()

SQS Architecture

Standard Queue:

import boto3

sqs = boto3.client('sqs', region_name='us-east-1')
queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/orders'

# Producer
def send_order(order: dict):
    response = sqs.send_message(
        QueueUrl=queue_url,
        MessageBody=json.dumps(order),
        MessageAttributes={
            'OrderType': {
                'DataType': 'String',
                'StringValue': order['type']
            }
        }
    )
    return response['MessageId']

# Consumer
def process_orders():
    while True:
        response = sqs.receive_message(
            QueueUrl=queue_url,
            MaxNumberOfMessages=10,
            WaitTimeSeconds=20,  # Long polling
            VisibilityTimeout=30  # 30 sec to process before re-visible
        )
        
        messages = response.get('Messages', [])
        for msg in messages:
            try:
                order = json.loads(msg['Body'])
                process_order(order)
                
                # Delete after successful processing
                sqs.delete_message(
                    QueueUrl=queue_url,
                    ReceiptHandle=msg['ReceiptHandle']
                )
            except Exception as e:
                # Don't delete — message becomes visible again after VisibilityTimeout
                log_error(e)

FIFO Queue (ordered, exactly-once):

# FIFO queue for ordered processing
fifo_queue_url = 'https://sqs.us-east-1.amazonaws.com/123456789/orders.fifo'

sqs.send_message(
    QueueUrl=fifo_queue_url,
    MessageBody=json.dumps(order),
    MessageGroupId=str(order['customer_id']),  # Orders per customer are ordered
    MessageDeduplicationId=str(order['order_id'])  # Prevent duplicates
)

Dead Letter Queue:

# After N failed processing attempts, message moves to DLQ
# Configure in SQS console or via boto3:

sqs.create_queue(
    QueueName='orders',
    Attributes={
        'RedrivePolicy': json.dumps({
            'deadLetterTargetArn': 'arn:aws:sqs:us-east-1:123456789:orders-dlq',
            'maxReceiveCount': '3'  # Move to DLQ after 3 failures
        })
    }
)

# Monitor DLQ for processing failures
# Reprocess DLQ messages when bug is fixed

Performance and Cost

Kafka throughput:

Production Kafka cluster:
- 3 brokers, 6 partitions per topic
- Throughput: 500,000+ messages/second
- Latency: <10ms end-to-end
- Storage: Retain last 7 days (or unlimited)

Kafka cost (3 broker cluster on AWS):
- 3x r5.2xlarge: ~$1,100/month
- EBS storage (10TB): ~$800/month
- OR Confluent Cloud: ~$0.08/GB + compute
- Total: $1,500-3,000/month at scale

SQS throughput and cost:

SQS Standard Queue:
- Throughput: 3,000 msg/sec (Standard), 300/sec (FIFO)
- Latency: milliseconds
- No infrastructure to manage

SQS pricing:
- First 1M requests/month: Free
- $0.40/million requests after (Standard)
- $0.50/million requests (FIFO)
- Data transfer: $0.09/GB

Example at 100M messages/month:
- SQS cost: ~$40/month
- Kafka cost: $1,500+ (infrastructure)

SQS is dramatically cheaper for low-medium volume. Kafka’s per-message cost approaches zero at very high volume.


Common Patterns

Kafka: Fan-out to multiple services:

Order placed event → Kafka topic: "orders"
                   ├── Consumer Group: billing (charges card)
                   ├── Consumer Group: inventory (reserves stock)
                   ├── Consumer Group: fulfillment (creates shipment)
                   ├── Consumer Group: analytics (updates dashboard)
                   └── Consumer Group: notifications (sends email)

Each service reads the same event independently.
If analytics is slow, it doesn't affect billing.
If a new notification service is added, it reads from offset 0.

SQS: Work queue for background jobs:

Web request → SQS queue: "image-resize-jobs"
           → Worker 1 (processes job, deletes message)
           → Worker 2 (processes job, deletes message)  
           → Worker 3 (scales up during traffic spike)
           → DLQ (failed jobs after 3 attempts)

Auto-scaling group adds workers when queue depth > 100.

Hybrid (Kafka + SQS):

Common production pattern:
Events → Kafka (streaming, replay, analytics)
       → Kafka consumer → SQS (work distribution to Lambda)
                       → Lambda workers (serverless processing)

Kafka for the event backbone.
SQS for distributing specific work to serverless.

Managed Kafka Options

If you want Kafka without infrastructure management:

OptionProviderCostNotes
Confluent CloudConfluent$0.08/GB + computeMost features, best managed experience
Amazon MSKAWSEC2 + storage costNative AWS integration
Redpanda CloudRedpandaKafka-compatible, lower overheadSimpler, faster than Kafka
Aiven for KafkaAivenManaged, multi-cloudGood for multi-cloud

When to Choose Each

Choose Kafka:

  • Event streaming with multiple consumer types
  • Need to replay historical events (audit, new service catchup)
  • High throughput: millions of events/day
  • Event-driven microservices architecture
  • Log aggregation (application logs, clickstream, metrics)
  • Change data capture (CDC) from databases

Choose SQS:

  • Simple background job queues (email sending, image processing)
  • AWS Lambda triggers (native integration)
  • Low-to-medium volume task distribution
  • At-most-once or at-least-once delivery without replay needs
  • Teams who want zero infrastructure operations
  • Cost-sensitive workloads at moderate volume

Bottom Line

Kafka and SQS are not direct competitors — they solve different problems. The key question: do you need a queue (task distribution, consumed and gone) or a log (streaming, replay, multiple readers)? SQS is the right queue for most task distribution needs in AWS. Kafka is the right log for event streaming, event-driven architecture, and any workload where replay or multiple independent consumers matter. Using Kafka as a simple task queue adds unnecessary complexity. Using SQS for event streaming creates painful limitations. Know which problem you’re solving.