Understanding Data Partitioning
Data partitioning (also known as sharding) is the process of breaking a large dataset into smaller, more manageable pieces called partitions or shards. Each partition contains a subset of the data and can be stored on a separate database server or node.
Why Partition Data?
- Scalability: Overcome hardware limitations by distributing data across multiple machines
- Performance: Improve query performance through parallel processing across partitions
- Availability: Enhance system resilience by isolating failures to specific partitions
- Manageability: Make maintenance operations like backups and index rebuilds more efficient
- Data Locality: Store data close to the users or services that access it most frequently
Partitioning vs. Replication
It’s important to distinguish between partitioning and replication:
- Partitioning: Divides data into distinct, non-overlapping subsets
- Replication: Creates redundant copies of the same data
These techniques are often used together: data is partitioned across multiple nodes, and each partition is replicated for fault tolerance.
Partitioning:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ User IDs 1-1M │ │ User IDs 1M-2M│ │ User IDs 2M-3M│
└───────────────┘ └───────────────┘ └───────────────┘
Replication:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Primary Node │ │ Replica 1 │ │ Replica 2 │
│ All User Data │→ │ All User Data │→ │ All User Data │
└───────────────┘ └───────────────┘ └───────────────┘
Partitioning + Replication:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ Users 1-1M │ │ Users 1M-2M │ │ Users 2M-3M │
└───────┬───────┘ └───────┬───────┘ └───────┬───────┘
│ │ │
┌───────▼───────┐ ┌───────▼───────┐ ┌───────▼───────┐
│ Replica of │ │ Replica of │ │ Replica of │
│ Node 1 │ │ Node 2 │ │ Node 3 │
└───────────────┘ └───────────────┘ └───────────────┘
Horizontal vs. Vertical Partitioning
There are two fundamental approaches to partitioning data: horizontal and vertical.
Horizontal Partitioning (Sharding)
Horizontal partitioning divides a table by rows, with each partition containing a subset of the rows based on a partitioning key.
Original Table (Users)
┌────┬──────────┬─────────┬─────────────────┐
│ ID │ Username │ Country │ Email │
├────┼──────────┼─────────┼─────────────────┤
│ 1 │ alice │ US │ [email protected] │
│ 2 │ bob │ UK │ [email protected] │
│ 3 │ charlie │ CA │ [email protected]│
│ 4 │ david │ AU │ [email protected] │
└────┴──────────┴─────────┴─────────────────┘
Horizontally Partitioned (by ID range)
Partition 1 (IDs 1-2) Partition 2 (IDs 3-4)
┌────┬──────────┬─────┬─────┐ ┌────┬──────────┬─────┬─────┐
│ ID │ Username │ ... │ ... │ │ ID │ Username │ ... │ ... │
├────┼──────────┼─────┼─────┤ ├────┼──────────┼─────┼─────┤
│ 1 │ alice │ ... │ ... │ │ 3 │ charlie │ ... │ ... │
│ 2 │ bob │ ... │ ... │ │ 4 │ david │ ... │ ... │
└────┴──────────┴─────┴─────┘ └────┴──────────┴─────┴─────┘
Implementation Example (PostgreSQL)
-- Create parent table
CREATE TABLE users (
id SERIAL PRIMARY KEY,
username VARCHAR(50) NOT NULL,
country VARCHAR(2) NOT NULL,
email VARCHAR(100) NOT NULL
);
-- Create partitions
CREATE TABLE users_1_1000000 PARTITION OF users
FOR VALUES FROM (1) TO (1000000);
CREATE TABLE users_1000001_2000000 PARTITION OF users
FOR VALUES FROM (1000001) TO (2000000);
CREATE TABLE users_2000001_3000000 PARTITION OF users
FOR VALUES FROM (2000001) TO (3000000);
Implementation Example (MongoDB)
// Enable sharding for database
sh.enableSharding("mydb")
// Create a sharded collection with a shard key on user_id
sh.shardCollection("mydb.users", { user_id: 1 })
// Add shards
sh.addShard("shard1/server1:27017")
sh.addShard("shard2/server2:27017")
sh.addShard("shard3/server3:27017")
Vertical Partitioning
Vertical partitioning divides a table by columns, with each partition containing a subset of the columns.
Original Table (Users)
┌────┬──────────┬─────────┬─────────────────┬────────────┬───────────┐
│ ID │ Username │ Country │ Email │ Bio │ Settings │
├────┼──────────┼─────────┼─────────────────┼────────────┼───────────┤
│ 1 │ alice │ US │ [email protected] │ Lorem... │ {...} │
│ 2 │ bob │ UK │ [email protected] │ Ipsum... │ {...} │
└────┴──────────┴─────────┴─────────────────┴────────────┴───────────┘
Vertically Partitioned
Core User Data User Profile Data
┌────┬──────────┬─────────┐ ┌────┬────────────┬───────────┐
│ ID │ Username │ Email │ │ ID │ Bio │ Settings │
├────┼──────────┼─────────┤ ├────┼────────────┼───────────┤
│ 1 │ alice │ [email protected] │ │ 1 │ Lorem... │ {...} │
│ 2 │ bob │ [email protected] │ │ 2 │ Ipsum... │ {...} │
└────┴──────────┴─────────┘ └────┴────────────┴───────────┘
Implementation Example (SQL)
-- Create tables for vertical partitioning
CREATE TABLE user_core (
id SERIAL PRIMARY KEY,
username VARCHAR(50) NOT NULL,
country VARCHAR(2) NOT NULL,
email VARCHAR(100) NOT NULL
);
CREATE TABLE user_profile (
user_id INTEGER PRIMARY KEY REFERENCES user_core(id),
bio TEXT,
settings JSONB
);
Partitioning Strategies
The effectiveness of data partitioning depends largely on the strategy used to distribute data across partitions. Let’s explore the most common strategies:
1. Range Partitioning
Range partitioning divides data based on ranges of a partition key value.
How It Works
Data is partitioned based on ranges of a key value, such as:
- User IDs 1-1,000,000 in partition 1
- User IDs 1,000,001-2,000,000 in partition 2
- And so on
Implementation Example (Cassandra)
CREATE TABLE users (
user_id UUID,
username TEXT,
email TEXT,
created_at TIMESTAMP,
PRIMARY KEY (user_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
AND PARTITIONING STRATEGY (range)
AND PARTITION KEYS (user_id);