๐งฉ What Exactly Is a Partition in Kafka?
A partition is the fundamental unit of storage, parallelism, and scalability in Kafka.
- An ordered, append‑only log
- Stored on a single broker (leader replica)
- Replicated to other brokers (follower replicas)
- The unit of parallelism for producers and consumers
- The unit of fault tolerance (via replication)
๐ A Partition Is an Ordered Log
Inside a partition, messages are stored in strict order:
offset 0 → offset 1 → offset 2 → offset 3 → ...
Kafka guarantees ordering only within a partition, not across partitions.
๐ฆ A Topic Is Split Into Multiple Partitions
Example:
topic: paymentspartitions: 3
You get:
- payments‑0
- payments‑1
- payments‑2
Each is an independent log.
๐ Replication Happens Per Partition
If replication factor = 3, then each partition has 3 copies:
payments-0 → leader on broker 1, followers on broker 2 and 3payments-1 → leader on broker 2, followers on broker 3 and 1payments-2 → leader on broker 3, followers on broker 1 and 2
This is why we say:
“A replication factor of 3 means each partition has 3 replicas.”
Not the topic — the partition.
⚙️ Why Partitions Matter
1. Scalability
More partitions → more parallelism.
- Producers can write in parallel
- Consumers in a group can read in parallel
- Throughput increases linearly with partitions
2. Fault Tolerance
Replication is done per partition, so Kafka can survive broker failures.
3. Ordering Guarantees
Kafka only guarantees ordering within a partition, not across the topic.
๐ง A Simple Visual
Topic: ordersPartition 0: [msg1, msg2, msg3]Partition 1: [msg4, msg5]Partition 2: [msg6, msg7, msg8, msg9]
Each partition is its own log, stored on its own leader broker, with follower replicas elsewhere.
๐ Want to Go Deeper?
Given your background in OS internals and distributed logs, I can also explain:
- How partitions map to segment files on disk
- How page cache interacts with partition reads
- How leader/follower replication works at the protocol level
- How partition count affects consumer group rebalancing
- How to choose the right number of partitions for a topic
- How partitioning interacts with exactly‑once semantics
Just tell me which direction you want to explore
Let’s go deep — this is one of those areas where your OS‑internals intuition (page cache, mmap, disk layout) gives you a huge advantage. Kafka’s storage engine is beautifully simple and mechanically sympathetic, and understanding how partitions map to segment files is the key to understanding Kafka’s performance model.
Below is the clearest, most accurate explanation you’ll find — the kind you’d use to teach a team.
๐งฉ How Kafka Partitions Map to Segment Files on Disk
A partition is stored as a directory on disk.
Inside that directory, Kafka stores the partition’s data as a sequence of segment files.
๐ 1. Partition → Directory
For a topic named orders with partition 2, Kafka creates:
/var/lib/kafka/data/orders-2/
This directory contains all data for that partition.
๐ 2. A Partition Is Split Into Multiple Segment Files
Kafka does not store the entire partition in one huge file.
Instead, it splits the log into segments, each typically 1 GB by default.
Example contents of orders-2:
00000000000000000000.log00000000000000000000.index00000000000000000000.timeindex00000000000001000000.log00000000000001000000.index00000000000001000000.timeindex00000000000002000000.log00000000000002000000.index00000000000002000000.timeindex
Each segment is identified by its base offset.
๐งฑ 3. What’s Inside a Segment?
Each segment consists of three files:
a) .log file
The actual message data (binary records).
b) .index file
Maps relative offsets → byte positions inside the .log file.
c) .timeindex file
Maps timestamps → offsets for time‑based lookups.
Kafka uses sparse indexing, meaning it doesn’t index every message — only periodic entries.
This keeps index files tiny.
๐ 4. How Kafka Uses Segments
Appending
Kafka always writes to the active segment (the last one).
When the segment reaches the configured size (e.g., 1 GB), Kafka:
- closes it
- creates a new segment with a new base offset
- continues writing
Reading
Consumers read sequentially:
- Use
.indexto find the byte position - Jump into the
.logfile - Sequentially scan forward
This is extremely efficient because:
- sequential disk access
- page cache
- OS read‑ahead
- memory‑mapped files
⚙️ 5. Why Kafka Uses Segment Files
This design gives Kafka:
✔ Infinite log size
Old segments can be deleted or compacted without touching newer ones.
✔ Fast recovery
On restart, Kafka only loads segment metadata, not the whole log.
✔ Efficient retention
Kafka can delete entire segment files when retention is exceeded.
✔ High throughput
Sequential writes + page cache + mmap = near‑disk‑limit performance.
๐ง 6. How Offsets Map to Segments
Offsets are global within a partition, not per segment.
Example:
- Segment 0: offsets 0–999,999
- Segment 1: offsets 1,000,000–1,999,999
- Segment 2: offsets 2,000,000–2,999,999
Kafka finds the right segment by comparing the offset to the segment’s base offset.
๐งฉ 7. How This Relates to Replication
Each replica (leader or follower) stores its own copy of the same segment files.
Replication is simply:
- follower fetches from leader
- appends to its own segment files
- updates its own indexes
This is why replication is fast and cheap.
๐ฅ 8. Why This Matters for You
Given your background in:
- page cache
- mmap
- disk I/O
- log‑structured storage
- distributed messaging
…you’ll appreciate that Kafka’s design is essentially:
A distributed, replicated, append‑only log built on top of OS page cache and memory‑mapped segment files.
It’s elegant because it lets the OS do the heavy lifting.
๐ Want to go deeper?
I can explain:
- How Kafka uses memory‑mapped files
- How page cache affects throughput
- How compaction rewrites segments
- How index lookups work internally
- How segment deletion interacts with consumer offsets
- How to tune segment size for performance
Just tell me which direction you want to explore.
Let’s walk through this step by step, exactly how Kafka does it internally, using your example:
You want offset 1573.
This is a perfect example to illustrate how Kafka finds data inside a partition that is split across multiple segment files.
๐งฉ Step 1 — Kafka lists all segment base offsets
Suppose the partition directory contains these segment files:
00000000000000000000.log (base offset = 0)
00000000000000100000.log (base offset = 100000)
00000000000000200000.log (base offset = 200000)
But for simplicity, let’s use smaller numbers:
Segment A → base offset 0
Segment B → base offset 1000
Segment C → base offset 2000
These base offsets are encoded in the filenames.
Kafka loads only the base offsets, not the entire file.
๐ง Step 2 — Kafka chooses the correct segment
Kafka applies this rule:
Pick the segment whose base offset is ≤ target offset, and whose next segment’s base offset is > target offset.
For offset 1573:
- Segment A: base = 0 → 0 ≤ 1573 ✔
- Segment B: base = 1000 → 1000 ≤ 1573 ✔
- Segment C: base = 2000 → 2000 ≤ 1573 ✘ (too high)
So Kafka picks:
✅ Segment B (base offset 1000)
Because:
- 1000 ≤ 1573
- Next segment starts at 2000, which is > 1573
This is exactly how Kafka narrows down the search.
๐ Step 3 — Kafka opens the .index file for Segment B
Each segment has:
1000.log
1000.index
1000.timeindex
Kafka uses the .index file to map:
relative offset → byte position in the .log file
Relative offset = target offset − base offset
So:
1573 − 1000 = 573
Kafka now needs to find relative offset 573 inside the index.
๐งญ Step 4 — Kafka uses sparse index lookup
Kafka’s index is sparse, meaning it doesn’t store every offset.
Example index entries:
relative offset → file position
0 → 0 bytes
200 → 8123 bytes
400 → 16200 bytes
600 → 24100 bytes
Kafka finds the largest relative offset ≤ 573.
That is:
400 → 16200 bytes
So Kafka jumps to byte 16200 in the .log file.
๐ Step 5 — Kafka reads sequentially from that byte
Now Kafka scans forward:
- read record at offset 1400
- read record at offset 1401
- …
- read record at offset 1573 ← target found
Kafka stops when it reaches the desired offset.
This sequential scan is extremely fast because:
- the log is append‑only
- records are stored contiguously
- the OS page cache + read‑ahead optimize the scan
๐งจ Putting it all together
Here’s the full flow in one shot:
List segment base offsets
→ 0, 1000, 2000Pick segment where base ≤ 1573 < next base
→ Segment B (base 1000)Compute relative offset
→ 1573 − 1000 = 573Use sparse index to find nearest entry ≤ 573
→ relative offset 400 → byte position 16200Jump to byte 16200 in the .log file
→ sequentially scan until offset 1573
This is exactly how Kafka performs fast lookups without loading entire files.
๐ If you want, I can also explain:
- How sparse indexing is built
- How time‑based lookups use
.timeindex - How compaction rewrites segments while preserving offsets
- How page cache + mmap make this blazing fast
- How retention deletes segments without breaking offsets
Just tell me what direction you want to explore next.
No comments:
Post a Comment