Understanding Data Partitioning

Data partitioning (also known as sharding) is the process of breaking a large dataset into smaller, more manageable pieces called partitions or shards. Each partition contains a subset of the data and can be stored on a separate database server or node.

Why Partition Data?

  1. Scalability: Overcome hardware limitations by distributing data across multiple machines
  2. Performance: Improve query performance through parallel processing across partitions
  3. Availability: Enhance system resilience by isolating failures to specific partitions
  4. Manageability: Make maintenance operations like backups and index rebuilds more efficient
  5. Data Locality: Store data close to the users or services that access it most frequently

Partitioning vs. Replication

It’s important to distinguish between partitioning and replication:

  • Partitioning: Divides data into distinct, non-overlapping subsets
  • Replication: Creates redundant copies of the same data

These techniques are often used together: data is partitioned across multiple nodes, and each partition is replicated for fault tolerance.

Partitioning:
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Node 1        │  │ Node 2        │  │ Node 3        │
│ User IDs 1-1M │  │ User IDs 1M-2M│  │ User IDs 2M-3M│
└───────────────┘  └───────────────┘  └───────────────┘

Replication:
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Primary Node  │  │ Replica 1     │  │ Replica 2     │
│ All User Data │→ │ All User Data │→ │ All User Data │
└───────────────┘  └───────────────┘  └───────────────┘

Partitioning + Replication:
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Node 1        │  │ Node 2        │  │ Node 3        │
│ Users 1-1M    │  │ Users 1M-2M   │  │ Users 2M-3M   │
└───────┬───────┘  └───────┬───────┘  └───────┬───────┘
        │                  │                  │
┌───────▼───────┐  ┌───────▼───────┐  ┌───────▼───────┐
│ Replica of    │  │ Replica of    │  │ Replica of    │
│ Node 1        │  │ Node 2        │  │ Node 3        │
└───────────────┘  └───────────────┘  └───────────────┘

Horizontal vs. Vertical Partitioning

There are two fundamental approaches to partitioning data: horizontal and vertical.

Horizontal Partitioning (Sharding)

Horizontal partitioning divides a table by rows, with each partition containing a subset of the rows based on a partitioning key.

Original Table (Users)
┌────┬──────────┬─────────┬─────────────────┐
│ ID │ Username │ Country │ Email           │
├────┼──────────┼─────────┼─────────────────┤
│ 1  │ alice    │ US      │ [email protected]  │
│ 2  │ bob      │ UK      │ [email protected]    │
│ 3  │ charlie  │ CA      │ [email protected]│
│ 4  │ david    │ AU      │ [email protected]  │
└────┴──────────┴─────────┴─────────────────┘

Horizontally Partitioned (by ID range)
Partition 1 (IDs 1-2)          Partition 2 (IDs 3-4)
┌────┬──────────┬─────┬─────┐  ┌────┬──────────┬─────┬─────┐
│ ID │ Username │ ... │ ... │  │ ID │ Username │ ... │ ... │
├────┼──────────┼─────┼─────┤  ├────┼──────────┼─────┼─────┤
│ 1  │ alice    │ ... │ ... │  │ 3  │ charlie  │ ... │ ... │
│ 2  │ bob      │ ... │ ... │  │ 4  │ david    │ ... │ ... │
└────┴──────────┴─────┴─────┘  └────┴──────────┴─────┴─────┘

Implementation Example (PostgreSQL)

-- Create parent table
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    country VARCHAR(2) NOT NULL,
    email VARCHAR(100) NOT NULL
);

-- Create partitions
CREATE TABLE users_1_1000000 PARTITION OF users
    FOR VALUES FROM (1) TO (1000000);
    
CREATE TABLE users_1000001_2000000 PARTITION OF users
    FOR VALUES FROM (1000001) TO (2000000);
    
CREATE TABLE users_2000001_3000000 PARTITION OF users
    FOR VALUES FROM (2000001) TO (3000000);

Implementation Example (MongoDB)

// Enable sharding for database
sh.enableSharding("mydb")

// Create a sharded collection with a shard key on user_id
sh.shardCollection("mydb.users", { user_id: 1 })

// Add shards
sh.addShard("shard1/server1:27017")
sh.addShard("shard2/server2:27017")
sh.addShard("shard3/server3:27017")

Vertical Partitioning

Vertical partitioning divides a table by columns, with each partition containing a subset of the columns.

Original Table (Users)
┌────┬──────────┬─────────┬─────────────────┬────────────┬───────────┐
│ ID │ Username │ Country │ Email           │ Bio        │ Settings  │
├────┼──────────┼─────────┼─────────────────┼────────────┼───────────┤
│ 1  │ alice    │ US      │ [email protected]  │ Lorem...   │ {...}     │
│ 2  │ bob      │ UK      │ [email protected]    │ Ipsum...   │ {...}     │
└────┴──────────┴─────────┴─────────────────┴────────────┴───────────┘

Vertically Partitioned
Core User Data                 User Profile Data
┌────┬──────────┬─────────┐   ┌────┬────────────┬───────────┐
│ ID │ Username │ Email   │   │ ID │ Bio        │ Settings  │
├────┼──────────┼─────────┤   ├────┼────────────┼───────────┤
│ 1  │ alice    │ [email protected] │   │ 1  │ Lorem...   │ {...}     │
│ 2  │ bob      │ [email protected] │   │ 2  │ Ipsum...   │ {...}     │
└────┴──────────┴─────────┘   └────┴────────────┴───────────┘

Implementation Example (SQL)

-- Create tables for vertical partitioning
CREATE TABLE user_core (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) NOT NULL,
    country VARCHAR(2) NOT NULL,
    email VARCHAR(100) NOT NULL
);

CREATE TABLE user_profile (
    user_id INTEGER PRIMARY KEY REFERENCES user_core(id),
    bio TEXT,
    settings JSONB
);

Partitioning Strategies

The effectiveness of data partitioning depends largely on the strategy used to distribute data across partitions. Let’s explore the most common strategies:

1. Range Partitioning

Range partitioning divides data based on ranges of a partition key value.

How It Works

Data is partitioned based on ranges of a key value, such as:

  • User IDs 1-1,000,000 in partition 1
  • User IDs 1,000,001-2,000,000 in partition 2
  • And so on

Implementation Example (Cassandra)

CREATE TABLE users (
    user_id UUID,
    username TEXT,
    email TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND PARTITIONING STRATEGY (range)
  AND PARTITION KEYS (user_id);