System Design: Apache Pulsar vs Apache Kafka

For over a decade, Apache Kafka has been the undisputed king of the event streaming world. Born at LinkedIn to solve the problem of high throughput data ingestion, it revolutionized how we think about logs, streams, and real-time pipelines. However, as many of us who have managed large scale Kafka clusters at companies like Uber or Netflix can attest, Kafka is not without its architectural burdens. The operational complexity of rebalancing partitions, the tight coupling of storage and compute, and the challenges of multi-tenancy have led many engineering teams to seek a more modern alternative.

Enter Apache Pulsar. Originally developed at Yahoo to consolidate various internal messaging systems, Pulsar was designed from the ground up to address the specific pain points that Kafka users have grumbled about for years. This article provides an exhaustive technical analysis of Pulsar versus Kafka, moving beyond the marketing fluff to examine the underlying architectural differences that impact scalability, reliability, and operational overhead.

The Coupled vs Decoupled Dilemma

The fundamental difference between Kafka and Pulsar lies in their storage architecture. Kafka follows a monolithic architecture where the broker that handles client requests also manages the storage of the data on its local disks. In Kafka, a partition is the atomic unit of parallelism and storage. If a partition grows too large for a single disk, or if a broker becomes a bottleneck, you must move the entire partition to a new broker.

As documented in various engineering post-mortems from companies like New Relic, this rebalancing process is a significant operational hazard. When you add a new broker to a Kafka cluster, it starts empty. To utilize it, you must trigger a partition reassignment. This involves copying massive amounts of data across the network from existing brokers to the new one. During this time, the cluster experiences increased CPU and network utilization, which can lead to increased tail latency for producers and consumers.

Pulsar, by contrast, adopts a tiered, segment-centric architecture. It separates the serving layer (Brokers) from the storage layer (Bookies, powered by Apache BookKeeper).

In the diagram above, we see the separation of concerns. Pulsar brokers are stateless. They do not store any data locally. When a message arrives, the broker writes it to a set of Bookies in the storage layer. This decoupling is the "secret sauce" of Pulsar scalability. Because brokers are stateless, scaling the serving layer is as simple as spinning up a new container. There is no data to migrate. If you need more storage capacity or IOPS, you add more Bookies. The new Bookies are immediately available to accept new segments of data without requiring a manual rebalance of existing data.

Deep Dive into Segment-Centric Storage

To understand why Pulsar handles scaling better, we must look at how it manages data. In Kafka, a partition is a continuous append-only log stored on a specific set of brokers. In Pulsar, a partition is broken down into segments (ledgers). These segments are distributed across the BookKeeper ensemble.

When a segment reaches a certain size or time limit, it is closed, and a new one is opened. This allows for much more granular data distribution. If a Bookie fails, only the segments stored on that specific node need to be replicated from other Bookies. This process happens in the background at the storage layer, completely transparent to the brokers and the clients.

This architecture solves the "hot partition" problem that plagues Kafka. In Kafka, if one partition receives a disproportionate amount of traffic, the broker hosting that partition can become overwhelmed. In Pulsar, because the data is striped across many Bookies, the load is naturally balanced across the storage layer.

Architectural Comparison: Kafka vs Pulsar

The following table outlines the technical trade-offs between the two systems based on architectural first principles.

Feature	Apache Kafka	Apache Pulsar
Architecture	Coupled (Storage and Compute on same node)	Decoupled (Stateless Brokers, Stateful Bookies)
Storage Unit	Partition (Monolithic Log)	Segment (Distributed Ledgers)
Scaling	Slow (Requires data rebalancing/copying)	Instant (Stateless brokers, granular storage)
Multi-tenancy	Difficult (Requires separate clusters or complex ACLs)	Native (Tenants, Namespaces, Resource Quotas)
Message Consumption	Pull-based (Consumer polls)	Unified (Supports both Push and Pull)
Tiered Storage	Post-facto (Added later, often complex)	Native (First-class support for S3, GCS, Azure)
Replication	ISR (In-Sync Replicas) model	Quorum-based (Apache BookKeeper)

The Quorum-Based Replication Advantage

Kafka uses a leader-follower replication model with an In-Sync Replica (ISR) set. The leader handles all reads and writes, and followers pull data from the leader. If the leader fails, a follower from the ISR is elected as the new leader. This model is simple but can lead to data loss or unavailability if the ISR shrinks or if the leader fails before followers have caught up.

Pulsar utilizes the Apache BookKeeper replication protocol, which is a quorum-based system. When a broker writes a message, it sends it to multiple Bookies simultaneously. The write is considered successful once a "write quorum" of Bookies acknowledges receipt. This is more robust than Kafka's ISR model because it does not rely on a single leader for storage. Any Bookie in the ensemble can serve a read request for a confirmed segment.

This quorum approach also significantly improves tail latency. In Kafka, if a follower is slow, it might drop out of the ISR, but while it is struggling, it can slow down the leader's ability to commit messages. In Pulsar, as long as the quorum is met, the write succeeds. The system can tolerate a "slow" Bookie without impacting the overall latency of the producer.

Multi-tenancy and Isolation

In a modern enterprise environment, providing a "Streaming-as-a-Service" platform for multiple teams is a common requirement. Doing this in Kafka is notoriously difficult. You often end up with "cluster sprawl" where every team has their own Kafka cluster because isolating workloads on a single cluster is nearly impossible. One rogue consumer performing a massive backfill can saturate the network interface of a broker, impacting every other producer and consumer on that node.

Pulsar was built for multi-tenancy. It introduces a hierarchical structure: Property (Tenant) -> Namespace -> Topic.

As illustrated, Pulsar allows you to apply resource quotas, rate limiting, and storage policies at the namespace level. This means you can give the Marketing team and the Finance team their own namespaces on the same cluster, ensuring that a spike in Marketing's data ingestion does not starve the Finance team's critical processing pipelines. Splunk, for example, moved to Pulsar to take advantage of these multi-tenancy features, allowing them to manage thousands of customers on shared infrastructure with strict isolation.

Unified Messaging: Queuing and Streaming

One of the most compelling aspects of Pulsar is its ability to act as both a high-throughput stream processor (like Kafka) and a traditional message queue (like RabbitMQ).

Kafka is strictly a streaming platform. It uses a cursor-based consumption model where the consumer tracks its offset in the log. This is excellent for replayability and stream processing but poor for "work queue" patterns where you want multiple consumers to compete for individual messages and acknowledge them independently.

Pulsar supports four different subscription modes:

Exclusive: Only one consumer can subscribe.
Failover: Multiple consumers can subscribe, but only one receives messages. If it fails, the next one takes over.
Shared: Multiple consumers receive messages in a round-robin fashion. This is the classic work queue pattern.
Key_Shared: Messages with the same key are delivered to the same consumer.

This flexibility allows engineering teams to consolidate their infrastructure. Instead of maintaining a Kafka cluster for streaming and a RabbitMQ cluster for task distribution, you can use Pulsar for both.

Real-World Evidence: Tencent's Billing System

Tencent, one of the world's largest technology conglomerates, provides a powerful case study for Pulsar. Their billing system handles millions of transactions per second. In their early architecture, they used Kafka, but they faced significant challenges with data consistency and operational overhead during peak events like the Chinese New Year.

The primary issue was the "stop the world" effect during Kafka rebalances. When traffic spiked and they needed to scale the cluster, the resulting rebalance would cause latency spikes that were unacceptable for a financial system. By migrating to Pulsar, they leveraged the decoupled storage to scale brokers and bookies independently. They reported that Pulsar's quorum-based writes provided the strong consistency required for financial transactions while maintaining high availability even during node failures.

Implementation Blueprint: Building a Pulsar Producer

To demonstrate the developer experience, let's look at a basic producer implementation using TypeScript. Pulsar's API is intuitive and handles many of the complexities of connection management and batching under the hood.

import Pulsar from 'pulsar-client';

async function runProducer() {
  // Create a client instance
  // The serviceUrl can point to a Pulsar Proxy or a Broker
  const client = new Pulsar.Client({
    serviceUrl: 'pulsar://localhost:6650',
    operationTimeoutSeconds: 30,
  });

  // Create a producer
  const producer = await client.createProducer({
    topic: 'persistent://public/default/order-events',
    sendTimeoutMs: 30000,
    batchingEnabled: true,
    batchingMaxMessages: 1000,
    compressionType: 'LZ4', // Efficient compression for high throughput
  });

  const message = {
    orderId: 'ORD-12345',
    amount: 99.99,
    timestamp: Date.now(),
  };

  try {
    // Pulsar handles batching and background sending
    await producer.send({
      data: Buffer.from(JSON.stringify(message)),
      properties: { region: 'us-east-1' },
      partitionKey: 'ORD-12345', // Ensures ordering for this key
    });
    console.log('Message sent successfully');
  } catch (error) {
    console.error('Failed to send message', error);
  }

  await producer.flush();
  await producer.close();
  await client.close();
}

runProducer();

In this snippet, we see several key features. The topic string follows the hierarchical naming convention (persistent://tenant/namespace/topic). We enable batching and compression at the producer level, which is critical for performance. The partitionKey ensures that all messages for a specific order are routed to the same partition, maintaining strict ordering—a requirement for many stateful applications.

Common Implementation Pitfalls

Even with a superior architecture, Pulsar is not a silver bullet. Senior engineers should be aware of several common mistakes:

Ignoring the Proxy: In large, dynamic environments (like Kubernetes), clients should connect via the Pulsar Proxy rather than directly to brokers. This simplifies network configuration and improves security, as the proxy handles authentication and authorization.
Misconfiguring BookKeeper Quorums: The settings for Ensemble Size (E), Write Quorum (Qw), and Ack Quorum (Qa) are vital. A common mistake is setting Qw and Qa too high, which increases latency, or too low, which risks data loss. A typical robust configuration is E=3, Qw=3, Qa=2.
Over-partitioning: Just because Pulsar handles partitions better than Kafka doesn't mean you should have millions of them. Each partition adds metadata overhead to ZooKeeper (or the newer configuration store). Aim for a sensible number of partitions based on your throughput requirements.
Neglecting Ledger Rollover Policies: If ledgers (segments) are allowed to grow too large, the benefits of granular distribution are lost. If they are too small, you create too much metadata. Monitoring and tuning ledger rollover is a key operational task.

The Operational Reality: ZooKeeper and Metadata

One of the historical criticisms of both Kafka and Pulsar is their dependency on Apache ZooKeeper. Kafka has recently moved toward KRaft to remove this dependency, simplifying the architecture. Pulsar still relies on a metadata store (ZooKeeper is the default, but it also supports etcd or other pluggable backends).

While Kafka's move to KRaft is a significant improvement for small to medium clusters, Pulsar's use of ZooKeeper is arguably less of a burden because of how it is used. In Pulsar, ZooKeeper stores metadata about the segments and their locations. The heavy lifting of data storage is handled by BookKeeper. Because Pulsar is designed for massive scale (millions of topics), the metadata management is highly optimized.

Sequence of a Message Write

To truly appreciate the reliability of Pulsar, we must understand the sequence of events when a message is produced.

The sequence diagram highlights that the write to the BookKeeper ensemble happens in parallel. The broker does not wait for every Bookie, only for the Ack Quorum. This parallel write path is why Pulsar can often achieve better tail latencies than Kafka, especially in environments where disk I/O can be jittery (like public cloud instances).

Tiered Storage: The Cost Efficiency Play

In the modern data stack, we often want to keep data for long periods for backfilling models or auditing. In Kafka, keeping months of data on expensive SSDs attached to brokers is cost-prohibitive. Most teams end up building a separate process to offload Kafka data to S3.

Pulsar has tiered storage built into its core. You can configure a policy that automatically moves closed segments from BookKeeper to S3 or Google Cloud Storage once they reach a certain age. The beauty of this implementation is that it is transparent to the consumer. A consumer can read from a topic, and Pulsar will seamlessly fetch data from BookKeeper for recent messages and from S3 for older messages. The consumer uses the same API and the same offset management regardless of where the data is physically stored.

Nutanix, for example, utilizes Pulsar's tiered storage to manage massive amounts of log data, significantly reducing their storage costs while keeping the data accessible for long term analysis without manual intervention.

Strategic Considerations for Your Team

Choosing between Kafka and Pulsar is a strategic decision that depends on your organization's specific needs and existing expertise.

Choose Apache Kafka if:

You have a relatively small, well-defined data volume.
Your team already has deep expertise in managing Kafka and its ecosystem (Connect, Streams).
You rely heavily on specific integrations that are only available or more mature in the Kafka ecosystem.
You do not require strict multi-tenancy or complex queuing patterns.

Choose Apache Pulsar if:

You are building a multi-tenant platform for many different teams or customers.
You need to scale storage and compute independently (e.g., high data volume but low processing needs, or vice versa).
You require very low tail latency and high availability during cluster scaling.
You want to consolidate your messaging infrastructure (streaming + queuing).
You need long term data retention and want to leverage cost-effective tiered storage natively.

The Future of Event Streaming

The "Kafka vs Pulsar" debate is often framed as a zero-sum game, but the reality is more nuanced. Kafka is evolving, adding features like KRaft and tiered storage to address its shortcomings. Pulsar is also maturing, with its ecosystem growing and its community expanding.

However, from an architectural standpoint, Pulsar's layered approach is fundamentally more aligned with the "cloud-native" philosophy of separating state from logic. As we move toward more serverless and dynamic infrastructure, the ability to spin up stateless brokers and rely on a distributed, self-healing storage layer becomes increasingly valuable.

The operational pain of a Kafka rebalance is a high price to pay for a monolithic design. For senior engineers tasked with building systems that will last the next decade, the architectural elegance and operational flexibility of Apache Pulsar make it a compelling choice for the next generation of data platforms.

TL;DR (Too Long; Didn't Read)

Architecture: Kafka couples storage and compute on brokers. Pulsar decouples them using stateless brokers and a dedicated storage layer (Apache BookKeeper).
Scalability: Pulsar scales instantly without the "rebalance pain" of Kafka because data is stored in granular segments rather than monolithic partitions.
Multi-tenancy: Pulsar has native support for tenants and namespaces with resource isolation, whereas Kafka often requires separate clusters to achieve the same level of safety.
Messaging Patterns: Pulsar is a hybrid that supports both high-throughput streaming and traditional work queues (Shared subscriptions), potentially replacing both Kafka and RabbitMQ.
Reliability: Pulsar uses a quorum-based replication model that offers better consistency and more predictable tail latency than Kafka's ISR model.
Cost: Pulsar's native tiered storage allows for seamless offloading of old data to S3/GCS, significantly reducing long-term retention costs.
Verdict: Kafka is the industry standard with a massive ecosystem, but Pulsar is the superior architectural choice for large-scale, multi-tenant, cloud-native environments.

Apache Pulsar vs Apache Kafka

The Coupled vs Decoupled Dilemma

Deep Dive into Segment-Centric Storage

Architectural Comparison: Kafka vs Pulsar

The Quorum-Based Replication Advantage

Multi-tenancy and Isolation

Unified Messaging: Queuing and Streaming

Real-World Evidence: Tencent's Billing System

Implementation Blueprint: Building a Pulsar Producer

Common Implementation Pitfalls

The Operational Reality: ZooKeeper and Metadata

Sequence of a Message Write

Tiered Storage: The Cost Efficiency Play

Strategic Considerations for Your Team

The Future of Event Streaming

TL;DR (Too Long; Didn't Read)

Comments

System Design

Database Connection Pooling Best Practices

More from this blog

Domain-Driven Design in Microservices

Blue-Green vs Canary Deployment Strategies

Global Load Balancing and DNS-based Routing

Bulkhead Pattern for System Isolation

Auto-scaling and Load-based Scaling

Command Palette

The Coupled vs Decoupled Dilemma

Deep Dive into Segment-Centric Storage

Architectural Comparison: Kafka vs Pulsar

The Quorum-Based Replication Advantage

Multi-tenancy and Isolation

Unified Messaging: Queuing and Streaming

Real-World Evidence: Tencent's Billing System

Implementation Blueprint: Building a Pulsar Producer

Common Implementation Pitfalls

The Operational Reality: ZooKeeper and Metadata

Sequence of a Message Write

Tiered Storage: The Cost Efficiency Play

Strategic Considerations for Your Team

The Future of Event Streaming

TL;DR (Too Long; Didn't Read)

Comments

System Design

Database Connection Pooling Best Practices

More from this blog