Understanding Databases — Storage Engines

Frazy Nondo
27 min readMay 1, 2024

TABLE OF CONTENTS

- Introduction

- The Scope of Our Discussion: The Storage Engine

- Log-Structured Storage Engines: The LSM-Tree Approach

- Page-Oriented Storage Engines: The B-Tree Approach

- Trade-offs and Considerations

- Conclusion

Introduction:

Databases are the cornerstone of most modern applications, providing a robust and efficient method for storing and retrieving data. At a basic level, a database allows for the storage of data, which can later be retrieved as needed. However, the underlying complexity of databases involves multiple interacting components, all orchestrated by a software system known as a Database Management System (DBMS).

A DBMS is a software suite designed to define, manipulate, retrieve, and manage data within a database. It acts as an intermediary between the database and its end-users or application programs, facilitating the creation, update, and extraction of information. The DBMS also ensures data integrity, security, and performance by managing access to the database, enforcing data consistency, and optimizing query execution.

To understand how the DBMS works, let’s first examine its architecture. While modern databases often employ distributed architectures to enhance scalability and availability, this article focuses on the storage engine layer of a DBMS. Specifically, we compare and contrast two common types of storage engines: log-structured merge (LSM) trees and page-oriented B-trees. By examining the core principles, strengths, and trade-offs of each, we aim to provide insights into how these storage engines influence database performance, scalability, and efficiency.

As depicted in the image above, the DBMS operates on a client-server model. Here, the database server functions as the core of the system, supporting data storage, manipulation, and retrieval, while client applications interact with it through an __API layer__.

Key Components of a DBMS Architecture include:

___App Client___:

These are the client applications that send requests to and receive responses from the database server. They can range from web interfaces to desktop applications.

___API Layer Server___:

Serving as the crucial interface between client applications and the database, the __API Layer__ handles communications, typically through SQL queries or API calls via protocols like HTTP, REST, or GraphQL. Responsibilities of the __API Layer__ include but not limited to:

  • Request Parsing and Routing: Decoding client requests and directing them to the correct database server/services.
  • Authentication and Authorization: Ensuring that only legitimate and authorized requests can access the database.
  • Performance Enhancements: Implementing caching strategies and load balancing to optimize the system’s responsiveness and reliability.

By performing these tasks, the __API Layer__ Server plays an important role in enabling efficient, secure, and scalable communication between client applications and the database system.

___Database Server: The Core of the DBMS___:

The database server is the heart of the Database Management System (DBMS), composed of several crucial sub-components/elements that collectively manage data storage, process queries, and ensure data integrity and consistency. Understanding these components provides insight into the complex processes behind database management:

Transport Layer:

This layer ensures reliable and efficient data transmission between clients and servers using protocols like TCP/IP. It establishes and maintains network connections, manages connection pooling, handles data packet flows, and provides secure data transfer. It also implements load balancing and congestion control while facilitating error detection and recovery.

Query Processor:

The query processor evaluates queries to devise the most efficient execution strategy by analyzing the query structure, indexes, and data distribution. It operates on the Database Layer, playing a key role in optimizing queries and serving end-user applications.

Execution Engine:

Once the query is planned, the Execution Engine processes the data by scanning tables, filtering data, and joining tables as needed. It interacts with both the Storage Layer and physical storage media to retrieve or modify data, working across multiple OSI layers.

Storage Engine:

Situated at the lower level of the DBMS architecture, the Storage Engine plays an important role in managing how data is physically stored. It uses in-memory and on-disk storage methods to optimize performance and ensure data persistence. The Storage Engine interacts with storage media like HDDs and SSDs and collaborates with the Operating System to manage resources, access control, and data retrieval.

The Scope of Our Discussion: The Storage Engine

In this article, we will focus our attention on the Storage Engine, a critical component of Database Management Systems (DBMS). Modern databases usually use a hybrid storage approach, combining in-memory storage for quick access with on-disk storage for stability and capacity. This strategy not only facilitates quick access to frequently used data via in-memory techniques but also ensures data integrity and supports larger datasets on disk

We will explore two common families of storage engines used in contemporary databases today: log-structured merge (LSM) trees and page-oriented B-trees. These storage engines differ in how they manage data in memory and on disk, each presenting unique advantages and challenges. Understanding the strengths and weaknesses of LSM trees and B-trees is crucial for designing scalable storage solutions, as the choice of storage engine greatly impacts a database’s performance, scalability, and reliability.

Log-Structured Storage Engines: The LSM-Tree Approach:

Databases with log-structured storage engines, such as Apache HBase, Cassandra, and RocksDB, use a data structure called a Log-Structured Merge (LSM) Tree. This design is specifically optimized to handle write-heavy workloads using several strategic techniques:

___In-Memory Buffering__:

Incoming key-value pairs are initially written to an in-memory buffer known as a memtable, which serves as a staging area. By storing the data in memory first, the system can rapidly process write operations with an O(1) time complexity. This immediate storage in memory significantly reduces the latency typically associated with disk I/O, thus accelerating data ingestion rates.

___Periodic Flushing to Disk___:

When the memtable reaches capacity limit, its contents are periodically flushed to disk as an immutable, sorted file called a Sorted String Table (SSTable). This file format is specifically designed to store key-value pairs in a sorted manner, which facilitates efficient key lookups during read operations with a time complexity of O(log n). This process not only manages the size of the memtable but also ensures data persistence on the disk, thereby enhancing the durability of the data stored.

___Merging SSTables___

As SSTables accumulate on disk, the storage engine performs a merge operation to maintain efficient data access and organization. This process combines multiple SSTables into a single, larger SSTable, with two key benefits:

  • Data Compaction: The merging of SSTables helps remove duplicate and outdated key-value pairs, which reduces the required storage space and enhances read performance by minimizing the number of files the system must search. This process includes the deletion of expired or superseded keys, ensuring that only relevant data is retained. By eliminating redundant and obsolete data, the merge process optimizes storage utilization and boosts query efficiency.
  • Log Structure Preservation: The merge operation preserves the sorted order of data across SSTables, ensuring that the log-structured organization remains intact as new data is continuously added. This is crucial for maintaining the efficiency of range queries and sequential access patterns, which are essential for many database workloads.

The time complexity of the merge operation is O(n), where ’n’ represents the total number of key-value pairs in the SSTables being merged. This linear time complexity means the operation’s duration increases directly with the size of the dataset. The merge process involves iterating through all key-value pairs in the SSTables, comparing and consolidating them as needed. Although this process requires significant computational overhead, it results in long-term benefits, including improved query performance, reduced storage requirements, and better data organization.

LSM-Tree: A Python Example

Let’s look at a simple Python implementation to illustrate how an LSM-Tree-based key-value store works in practice:

class MemTable:
"""In-memory table for storing database entries before flushing to SSTable."""

def __init__(self):
"""Initialize the MemTable with an empty dictionary."""
self.data = {}

def put(self, key, value):
"""Store the key-value pair in the dictionary."""
self.data[key] = value

def get(self, key):
"""Retrieve a value by key from the dictionary. Returns None if key not found."""
return self.data.get(key)

def flush(self):
"""Convert in-memory data to an SSTable and return it."""
return SSTable(self.data)


class SSTable:
"""Disk-based table storing data from MemTable in sorted order for efficient querying."""

def __init__(self, data):
"""Initialize the SSTable with sorted data."""
self.data = sorted(data.items())

def get(self, key):
"""Perform binary search to find the key in the sorted data. Return None if not found."""
left, right = 0, len(self.data) - 1
while left <= right:
mid = (left + right) // 2
mid_key, mid_value = self.data[mid]
if mid_key == key:
return mid_value
elif mid_key < key:
left = mid + 1
else:
right = mid - 1
return None


class LSMTree:
"""Log-Structured Merge-tree for efficient writes and reads by leveraging MemTable and SSTables."""

def __init__(self):
"""Initialize the LSMTree with a new MemTable and an empty list of SSTables."""
self.memtable = MemTable()
self.sstables = []

def put(self, key, value):
"""Insert or update a key-value pair. Flushes MemTable to SSTable if size exceeds 1000."""
self.memtable.put(key, value)
if len(self.memtable.data) > 1000:
self.sstables.append(self.memtable.flush())
self.memtable = MemTable()

def get(self, key):
"""Retrieve a value by key from MemTable first, then from the newest SSTable to oldest."""
value = self.memtable.get(key)
if value is not None:
return value
for sstable in reversed(self.sstables):
value = sstable.get(key)
if value is not None:
return value
return None

Here’s a step-by-step walkthrough of the LSM-Tree operations:

  1. When you perform a put(key, value) operation, the data is first added to the in-memory MemTable.
  2. Once the MemTable reaches a certain size (e.g., 1000 entries), it is flushed to disk as a new SSTable file.
  3. The SSTable files are sorted by key, allowing for efficient binary search lookups.
  4. When you perform a get(key) operation, the system first checks the in-memory MemTable, then searches through the sorted SSTable files in reverse order (newest to oldest) to find the value.

Pros of LSM-Trees

LSM-Trees excel in environments with high write workloads, thanks to their efficient write mechanisms. They are especially suited for applications that require high write throughput, offering the following advantages:

___Batched Compaction__:

Periodic compaction of SSTable files reduces the overhead of index maintenance, improving write performance and reducing the cost of index updates.

___Improved Read Performance__:

The sorted nature of SSTable files enables the use of binary search techniques, which significantly enhance the efficiency of read operations. This method minimizes disk seeks and thereby improves query performance. Additionally, the integration of Bloom filters and caching mechanisms further boosts read efficiency. These tools rapidly assess whether a key is likely within an SSTable file, reducing unnecessary disk seeks by helping to determine if a deeper search into the files is needed.

Bloom Filters and Caching

A Bloom filter is a probabilistic data structure that quickly determines whether a key is potentially in an SSTable file. When searching for a key, the Bloom filter is consulted first to check if the key might be present. If the Bloom filter indicates that the key is not present, the storage engine avoids searching the SSTable file altogether, which cuts down on disk seeks and enhances read performance. On the other hand, if the Bloom filter suggests that the key might be present, the storage engine proceeds to search the SSTable file to verify whether the key actually exists.

How Bloom Filters Work

A Bloom filter determines whether a key is potentially in a SSTable file by using a series of hash functions to map the key to a set of bits in a bit array.

Here’s a simplified breakdown of how Bloom filters work:

  1. Initialization: A Bloom filter starts as a bit array of a fixed size, with all bits set to 0. For example, a Bloom filter might have 100 bits.

2. Adding Keys: When a key is added, it is hashed using multiple hash functions — typically between three to five. Each hash function generates a distinct index in the bit array, at which the corresponding bit is set to 1. For instance, if the hash functions determine the indices 10, 30, and 70, those bits are turned to 1.

3. Searching Keys: To check for a key’s presence:

  • The key is hashed using the same hash functions.
  • The filter checks the bits at the generated indices.
  • If all bits are set to 1, the Bloom filter suggests that the key might be in the SSTable file.
  • If any bit at these indices is 0, the key is definitively not in the file.

Efficiency of Bloom Filters

Bloom filters are particularly time-efficient with an operational time complexity of O(k), where ‘k’ is the number of hash functions. This efficiency means that the time to add a key or check for membership remains constant, irrespective of the size of the data set. They are also space-efficient, requiring fewer bits per element than other data structures. However, Bloom filters trade off accuracy for efficiency by allowing a certain probability of false positives — where a key seems to be present but is not. This characteristic makes them well-suited for applications where such a trade-off is acceptable.

Here’s some simple Python code to illustrate how Bloom filters work:

import hashlib

class BloomFilter:
"""
Attributes:
size (int): The size of the bit array.
bit_array (list): A list of bits (0s and 1s) representing the filter.
"""

def __init__(self, size):
"""
Initialize the Bloom filter with a given size.

Args:
size (int): The size of the bit array.
"""
# Step 1: Initialize the size attribute with the given size
self.size = size

# Step 2: Create a list of zeros with the given size, representing the bit array
self.bit_array = [0] * size

def add(self, key):
"""
Add a key to the Bloom filter.

Args:
key (str): The key to add to the filter.
"""
# Step 1: Define a list of hash functions to use (MD5, SHA1, and SHA256)
hash_functions = [hashlib.md5, hashlib.sha1, hashlib.sha256]

# Step 2: Iterate over each hash function
for hash_func in hash_functions:
# Step 3: Hash the key using the current hash function
hash_result = hash_func(key.encode()).hexdigest()

# Step 4: Calculate the index in the bit array using the hash result and the size
index = int(hash_result, 16) % self.size

# Step 5: Set the bit at the calculated index to 1
self.bit_array[index] = 1

def might_contain(self, key):
"""
Check if a key might be present in the Bloom filter.

Args:
key (str): The key to check.

Returns:
bool: True if the key might be present, False otherwise.
"""
# Step 1: Define a list of hash functions to use (MD5, SHA1, and SHA256)
hash_functions = [hashlib.md5, hashlib.sha1, hashlib.sha256]

# Step 2: Iterate over each hash function
for hash_func in hash_functions:
# Step 3: Hash the key using the current hash function
hash_result = hash_func(key.encode()).hexdigest()

# Step 4: Calculate the index in the bit array using the hash result and the size
index = int(hash_result, 16) % self.size

# Step 5: Check if the bit at the calculated index is 0
if self.bit_array[index] == 0:
# If any of the bits are 0, return False (key is definitely not present)
return False

# If all bits are 1, return True (key might be present)
return True

# Example usage:
bf = BloomFilter(100)
bf.add(b"key1")
bf.add(b"key2")

print(bf.might_contain(b"key1")) # True
print(bf.might_contain(b"key2")) # True
print(bf.might_contain(b"key3")) # False

___High Write Throughput___:

LSM-Trees are designed to handle large volumes of write operations efficiently, making them ideal for write-heavy applications.

___Improved Overall Performance___:

LSM-Trees offer improved performance, reduced disk seeks, and efficient write and read operations, making them a popular choice for modern storage systems

Trade-offs and Considerations

While LSM-Trees provide significant benefits for write-heavy workloads, they also come with certain trade-offs that need consideration:

___Read Performance___:

Read operations in LSM-Trees may be slower compared to those in B-tree indexes due to the necessity of checking multiple SSTable files to locate the most recent version of a key. However, techniques like Bloom filters and caching help mitigate this issue by reducing unnecessary disk seeks.

___Compaction Overhead___:

The periodic compaction process, which merges multiple SSTable files into a single larger file, can be resource-intensive and may temporarily impact system performance.

___Increased Storage Requirements___:

LSM-Trees typically require more disk space than B-tree indexes because they can store multiple versions of the same key across different SSTable files until these files are compacted.

___Write Amplification___:

LSM-Trees experience write amplification, where the number of writes required to store data exceeds the number of writes initiated by the application. This is due to the multi-level indexing structure of LSM-Trees, where each level maintains its own SSTable files. For example, suppose an application writes 100 bytes of data to an LSM-Tree. The LSM-Tree might require 500 bytes of writes to store this data, including:

  • 100 bytes to write the data to the memtable
  • 100 bytes to write the data to the Level 0 SSTable file
  • 100 bytes to write the metadata for the Level 0 SSTable file
  • 100 bytes to write the merged data to the Level 1 SSTable file
  • 100 bytes to write the metadata for the Level 1 SSTable file

In this example, the write amplification factor is 5 (500 bytes of writes required to store 100 bytes of data). This means that the LSM-Tree requires 5 times more writes than the application to store the same amount of data.

___Disk Seeks in LSM-trees___:

In LSM-trees, updates are managed by writing new entries rather than directly modifying existing ones, which results in multiple copies of the same key at different stages of update. During read operations, LSM-trees must navigate through multiple SSTable files to locate the most recent version of the requested data. Optimizations such as Bloom filters, compaction, and caching are employed to streamline this process:

  • Bloom Filters help by quickly verifying the presence of a key in an SSTable file, thereby avoiding unnecessary reads.
  • Compaction significantly cuts down the number of SSTable files required for read operations by merging several smaller files into one.
  • Caching keeps frequently accessed data readily available in memory, reducing the need for repeated disk accesses/disk seeks.

Despite these optimizations, LSM-trees may still fall short in read-heavy scenarios or environments requiring a balanced mix of reads and writes compared to B-trees. B-trees, with their efficient indexing structure, provide quicker data location capabilities, making them better suited for such workloads.

Overview of LSM-tree Suitability

In essence, while LSM-trees offer optimized performance for high-write environments and effectively reduce the number of disk seeks through intelligent data structures like Bloom filters and caching, they require careful consideration of their trade-offs, particularly for read performance and during compaction. These characteristics make LSM-trees ideal for specific applications where write throughput and latency are more critical than immediate read efficiency.

While LSM-trees excel in write-heavy scenarios due to their optimized data ingestion techniques, they can sometimes fall short in terms of read efficiency and data compaction. B-trees, on the other hand, excel in read-heavy scenarios with their ability to maintain a balanced structure and offer consistent read performance across different scales. The following section delves into the B-tree architecture, highlighting how its design principles make it suited for applications requiring fast read access, efficient indexing, and effective management of large datasets.

Page-Oriented Storage Engines: The B-Tree Approach

Page oriented B-Tree Storage

Page-oriented storage engines, widely used in traditional relational databases like MySQL, PostgreSQL, and Oracle, rely on the efficient B-tree data structure to organize and index data. B-trees excel at handling fast read and write operations while maintaining a balanced structure, making them ideal for complex database operations. With a logarithmic time complexity of O(log n) for both search (read) and insert/delete (write) operations, B-trees facilitate rapid data retrieval and management, even in large datasets.

B-Tree Structure

A B-tree is a self-balancing tree data structure that efficiently stores and manages sorted data in a page-oriented storage engine. It enables rapid search, sequential access, insertion, and deletion operations. In a B-tree, data is stored in 4KB pages (or nodes), each containing a sorted array of key-value pairs and pointers to child pages, ensuring efficient data retrieval and management.

In-Memory Buffering in B-trees

B-trees use in-memory buffering to optimize write operations and minimize disk I/O. When data is inserted or updated in a B-tree, it first goes into an in-memory buffer, known as the “buffer pool” or “cache,” which is a portion of the database’s main memory dedicated to holding frequently accessed data.

Buffer Pool Operations

When a write operation occurs, data is written to the buffer pool, and the corresponding page is marked as “dirty,” meaning it has been modified but not yet written to disk. Data remains in the buffer pool until one of the following occurs:

  • The buffer pool reaches its capacity and needs to evict some data pages.
  • A checkpoint operation triggers the writing of all dirty pages to disk.
  • A flush command explicitly instructs the database system to write dirty pages to disk.

By keeping data in memory, B-trees can significantly improve write performance by allowing data to be manipulated faster than directly writing to disk for each operation.

Checkpoint Operations

The checkpoint process is crucial for maintaining data integrity in B-trees. By periodically writing dirty pages from the buffer pool to disk, the database system ensures data is safely stored even in the event of a system failure. This helps maintain data integrity and allows for faster recovery after unexpected shutdowns.

Optimizing Buffer Pool Performance

Optimizing in-memory buffering involves techniques such as:

  • Buffer Pool Sizing: Adjusting the buffer pool size based on available memory and workload to maximize cache hits and minimize disk I/O.
  • Asynchronous I/O: Performing disk writes asynchronously in the background while other operations continue.
  • Write-Ahead Logging (WAL): Maintaining a log of all write operations before applying them to the B-tree for faster recovery and data consistency.

In summary, in-memory buffering significantly enhances B-trees’ write performance by reducing disk I/O overhead. However, it’s important to balance the buffer pool size with available memory resources for optimal performance.

Data Storage and Retrieval

In a page-oriented storage engine, data is organized into fixed-size pages, usually 4KB or 8KB in size. Each page contains a header, which stores metadata about the page, and a set of key-value pairs. When data is inserted or updated, the B-tree:

  • Locates the appropriate page based on the key.
  • Performs the necessary operation (insert, update, or delete).
  • Rebalances the tree as needed to maintain its properties.

For data retrieval, the storage engine traverses the B-tree from the root node to the leaf nodes, comparing the search key with the keys stored in each node. This traversal, a form of binary search, has a time complexity of O(log n), where n is the number of keys in the B-tree.

Image of B-tree Index from “Designing Data Intensive Applications”

Clustered and Non-Clustered Indexes

B-trees can have clustered or non-clustered indexes:

  • Clustered Indexes: The engine performs a sequential scan of the data pages, enabling it to read the data in a continuous sequence without jumping across the disk.
  • Non-Clustered Indexes: The engine follows the pointers from the index pages to the corresponding data pages, allowing it to retrieve the requested data.

Let’s look at both clustered and non-clustered indexes in more detail.

___Non-Clustered Index___:

The image above provides a visual representation of a non-clustered index in a database:

  1. Data Pages: On the left side of the image, we see data pages storing the actual table records. Each page contains rows with data like “ID,” “Name,” “Age,” and “Department.” These pages are not ordered by any particular column.
  2. Non-Clustered Index: On the right side, the non-clustered index is built on a specific column or set of columns. In this example, the index points to records based on a specific column value. The non-clustered index contains a sorted list of the values from the indexed column, and each value appears only once. Along with each index key, the non-clustered index stores a pointer to the corresponding data row.

How It Works:

  1. Query Execution: The database engine receives a query that involves the indexed column (e.g., “Age”).
  2. Index Search: The engine searches the non-clustered index to find the matching keys using a binary search.
  3. Row Pointer Retrieval: Once the matching keys are found, the engine retrieves the corresponding row pointers from the index.
  4. Data Page Retrieval: The engine uses the row pointers to locate the relevant data pages.
  5. Data Retrieval: The engine retrieves the required data from the data pages.

___Clustered Index___:

The image above shows a clustered index, which determines the physical order of data in a table. Unlike non-clustered indexes, which store a separate copy of the indexed columns and row pointers, a clustered index is directly associated with the actual data pages.

In the image we see:

  • Data Pages: On the left, the data pages are sorted according to the index key values (1, 2, 3, 4).
  • Index Tree Structure: On the right, the index tree structure represents the clustered index, where each leaf node points to the corresponding data page.

How Clustered Indexes Work:

When a table has a clustered index, the data rows are physically reordered according to the index key. This means that the database rearranges the actual data pages to match the index, making it more efficient for range queries.

  1. Query Execution: The engine receives a query involving the clustered index key, such as “Find all names starting with ‘S’.”
  2. Index Traversal: The engine traverses the index tree to locate the relevant data pages.
  3. Data Retrieval: With data physically ordered, the engine can efficiently scan consecutive pages to retrieve the desired records.

Time Complexity of Clustered Index Operations:

  1. Point Queries:
  • Description: For queries like “Find the employee with ID 1234.”
  • Time Complexity: O(log n) for index traversal, then O(1) for data retrieval.

2. Range Queries:

  • Description: For queries like “Find all employees with a salary between 60,000 and 80,000.”
  • Time Complexity: O(log n) for index traversal, plus O(m) for data retrieval from multiple pages.

3. Inserting and Deleting Records:

  • Description: Inserting or deleting a record involves updating the index and rearranging data pages.
  • Time Complexity: O(log n + m), potentially with additional I/O operations for page splitting/merging.

Clustered indexes provide an efficient way to organize and retrieve data based on the index key. By physically reordering the data pages according to the index key, they enable fast data access, particularly for range queries and sequential scans.

To see how these concepts translate into code, let’s explore a Python example demonstrating how a B-tree-based storage engine operates:

B-Tree: A Python Example

Here’s a simple Python implementation to demonstrate how a B-tree-based storage engine works:

class BTreeNode:
def __init__(self, order):
self.order = order # Maximum number of keys in a node
self.keys = [] # List of keys in the node
self.values = [] # List of values associated with the keys
self.children = [] # List of child nodes
self.is_leaf = True # Flag indicating if the node is a leaf

def insert(self, key, value):
if self.is_leaf:
# If the node is a leaf, insert the key-value pair
index = self._find_index(key)
self.keys.insert(index, key)
self.values.insert(index, value)

# If the number of keys exceeds the order, split the node
if len(self.keys) > self.order - 1:
self._split()
else:
# If the node is not a leaf, find the appropriate child node to insert the key
index = self._find_index(key)
child = self.children[index]
child.insert(key, value)

# If the child node was split, update the current node's keys and children
if len(child.keys) > self.order - 1:
self._update_keys_and_children(index)

def search(self, key):
index = self._find_index(key)

if index < len(self.keys) and self.keys[index] == key:
# If the key is found in the current node, return its value
return self.values[index]
elif self.is_leaf:
# If the key is not found and the node is a leaf, return None
return None
else:
# If the key is not found and the node is not a leaf, recursively search the appropriate child node
child = self.children[index]
return child.search(key)

def _find_index(self, key):
# Find the index where the given key should be inserted or searched
index = 0
while index < len(self.keys) and self.keys[index] < key:
index += 1
return index

def _split(self):
# Split the node into two and distribute keys and values evenly
mid = self.order // 2
new_node = BTreeNode(self.order)

# Move the second half of the keys and values to the new node
new_node.keys = self.keys[mid:]
new_node.values = self.values[mid:]
new_node.is_leaf = self.is_leaf

# Update the current node's keys and values
self.keys = self.keys[:mid]
self.values = self.values[:mid]

if not self.is_leaf:
# If the node is not a leaf, move the second half of the children to the new node
new_node.children = self.children[mid:]
self.children = self.children[:mid]

# Add the new node as a child of the current node's parent
parent = self.children[0].parent if not self.is_leaf else None
if parent:
index = parent._find_index(new_node.keys[0])
parent.children.insert(index + 1, new_node)
parent._update_keys_and_children(index)
else:
# If the current node is the root, create a new root with the current and new nodes as children
new_root = BTreeNode(self.order)
new_root.keys = [new_node.keys[0]]
new_root.children = [self, new_node]
new_root.is_leaf = False
self.parent = new_node.parent = new_root

def _update_keys_and_children(self, index):
# Update the keys and children of the current node after a child node is split
child = self.children[index]
self.keys.insert(index, child.keys[-1])
self.children.insert(index + 1, child.children[-1])
child.keys.pop()
child.children.pop()


class BTree:
def __init__(self, order):
self.root = BTreeNode(order)

def insert(self, key, value):
# Insert a key-value pair into the B-tree
self.root.insert(key, value)

def search(self, key):
# Search for a key in the B-tree and return its associated value
return self.root.search(key)

In this implementation:

  1. BTreeNode Class: The BTreeNode class represents a node in the B-tree. It contains the following attributes:

→ Atrributes:

  • order: The maximum number of keys allowed in a node.
  • keys: A list of keys stored in the node.
  • values: A list of values associated with the keys.
  • children: A list of child nodes.
  • is_leaf: A flag indicating whether the node is a leaf node.

→ Methods:

  • insert(key, value): Handles key-value pair insertion. If the node is a leaf, the key-value pair is inserted directly. If the node is not a leaf, the appropriate child node is recursively searched for insertion. If the number of keys in a node exceeds the order after insertion, the node is split.
  • search(key): Searches for a given key and returns its value. If the key is found, its associated value is returned. If the key is not found and the node is a leaf, None is returned. If the key is not found and the node is not a leaf, the appropriate child node is recursively searched.
  • _find_index(key): Finds the index where a given key should be inserted or searched in the node.
  • _split(): Splits a node when the number of keys exceeds the order. The keys and values are evenly distributed between the current node and the new node. If the node is not a leaf, the children are also distributed accordingly.
  • _update_keys_and_children(index): Updates keys and children after a split.

2. BTree Class: The BTree class represents the B-tree itself. It contains a reference to the root node.

→ Attributes:

  • root: Reference to the root node.

→ Methods:

  • insert(key, value): Inserts a key-value pair into the B-tree.
  • search(key): Searches for a key in the B-tree.

PROS OF B-TREES

B-trees offer several advantages for page-oriented storage engines:

  1. Efficient Search: B-trees provide logarithmic-time search complexity, making them efficient for large-scale data retrieval. The balanced tree structure ensures consistent search times regardless of dataset size.
  • Example: A search in a dataset with millions of records is quick due to the logarithmic structure.

2. Range Queries: B-trees excel at handling range queries, where data needs to be retrieved within a specific key range. The sorted nature of keys allows for efficient sequential data access.

  • Example: A database query that retrieves all records between two dates.

3. Updates and Concurrency: B-trees support efficient updates and deletions while maintaining balance. They are well-suited for concurrent access, allowing readers to traverse the tree while writers perform updates without contention.

  • Example: A multi-user database where some users are reading while others are updating data.

TRADE-OFFS AND CONSIDERATIONS

While B-trees are a robust and widely-used data structure for page-oriented storage engines, they also have some trade-offs:

1. Write Amplification: B-trees may experience write amplification, where a single write operation can trigger multiple page writes due to node splitting and tree rebalancing. For example, in a write-heavy workload, this can lead to reduced write performance.

Let’s look at an example to illustrate write amplification in B-trees. Suppose we have a B-tree with a root node and two child nodes, each with two keys. We want to insert a new key that belongs to the second child node.

          +--------+
| Root |
+--------+
/ \
+---------+ +------------+
| Node | | Node |
| (2 keys) | (2 keys) |
+---------+ +------------+

To insert the new key, we need to split the second child node, which requires writing two new nodes and updating the parent node.

           +--------+
| Root |
+--------+
/ \
+--------+ +----------+
| Node | | Node |
| (2 keys) | (1 key) |
+--------+ +----------+
/
+------------+
| New Node |
| (1 key) |
+------------+

This results in three write operations: two for the new nodes and one for the updated parent node. Write amplification occurs because a single insert operation requires multiple write operations to update the tree structure.

Here’s an example of Python code that demonstrates write amplification in a B-tree:

import random

class BTreeNode:
# Initialize a B-tree node
def __init__(self, leaf=False):
# Whether this node is a leaf node
self.leaf = leaf
# List of keys in this node
self.keys = []
# List of child nodes (only for non-leaf nodes)
self.child_nodes = []

# Insert a key into the B-tree
def insert(self, key):
# If this is a leaf node, simply append the key
if self.leaf:
self.keys.append(key)
return

# Find the appropriate child node to insert into
child_index = 0
while child_index < len(self.keys) and key > self.keys[child_index]:
child_index += 1

# If the child node is full, split it and rebalance the tree
if len(self.child_nodes[child_index].keys) == 3:
self.split_child(child_index)

# Recursively insert into the child node
self.child_nodes[child_index].insert(key)

# Split a child node and rebalance the tree
def split_child(self, index):
# Get the child node to split
child_node = self.child_nodes[index]

# Split the child node into two smaller nodes
# This is where write amplification occurs
left_node = BTreeNode(leaf=child_node.leaf)
right_node = BTreeNode(leaf=child_node.leaf)

# Split the keys and child nodes between the two new nodes
split_index = len(child_node.keys) // 2
left_node.keys = child_node.keys[:split_index]
right_node.keys = child_node.keys[split_index:]
if not child_node.leaf:
left_node.child_nodes = child_node.child_nodes[:split_index + 1]
right_node.child_nodes = child_node.child_nodes[split_index + 1:]

# Update the parent node to reflect the split
self.keys.insert(index, child_node.keys[split_index])
self.child_nodes.insert(index + 1, right_node)

# Rebalance the tree by updating the parent nodes
# This may require additional write operations
self.rebalance()

# Rebalance the tree after a node split
def rebalance(self):
# Rebalance the tree by updating the parent nodes
# This may require additional write operations
pass

root = BTreeNode()

# Insert 10 random keys into the B-tree
for _ in range(10):
root.insert(random.randint(0, 100))

**Disk Seeks**
Traversing a B-tree may require multiple disk seeks, introducing latency, particularly for deep trees or slow storage devices. For instance, in a database with a large dataset, this can impact query performance.

**Space Overhead**
B-trees require additional space for storing internal nodes and pointers, resulting in higher storage overhead compared to other data structures. This can be a consideration in storage-constrained environments.

Despite these limitations, B-trees remain a fundamental data structure in many database systems due to their efficient search, update, and concurrency characteristics. They are widely used in:

* Relational databases like MySQL (InnoDB), PostgreSQL, Oracle, and SQL Server
* Embedded databases such as SQLite and Berkeley DB
* File systems and storage systems that require efficient indexing and retrieval of data

2. Disk Seeks: B-trees organize data into fixed-size pages, with the tree structure helping navigate and locate data. When a query is executed, the B-tree index is traversed from the root node to the leaf nodes, following the appropriate child pointers at each level. This traversal requires disk seeks to read the index pages.

The number of disk seeks required in B-trees depends on the height of the tree and the location of the desired data. In the worst case, the number of disk seeks is equal to the height of the tree.

For example - suppose we have a B-tree with a root node, two child nodes, and two leaf nodes. The root node contains a key that points to one of the child nodes, which in turn contains a key that points to one of the leaf nodes. To retrieve data from the leaf node, the query engine needs to read the following pages:

  • Root node (page 1)
  • Child node (page 2)
  • Leaf node (page 3)

In this example, the query engine needs to perform three disk seeks to read the required pages.

Latency

Disk seeks introduce latency, which can be influenced by:

  • Tree Depth: A deeper tree requires more disk seeks.
  • Storage Speed: Slower storage devices (e.g., hard drives) increase page read times.

Optimizations in B-trees

B-trees employ various techniques to minimize disk seeks:

  • Keeping the tree balanced: B-trees maintain a balanced structure, ensuring that the height of the tree remains logarithmic to the number of keys. This limits the number of disk seeks required to reach any specific data.
  • Node size: B-trees use a larger node size compared to LSM trees, allowing more keys to be stored in each node, which leads to a shorter tree height and consequently, fewer disk seeks. This is because B-trees are designed to store more keys in each node, reducing the number of levels in the tree, and resulting in faster read performance. Additionally, B-trees are self-organizing, store data in multiple nodes with links between them.
  • Cache utilization: B-trees leverage caching mechanisms to keep frequently accessed nodes in memory, reducing the need for disk seeks.

Space Overhead in B-trees

B-trees require additional space for storing internal nodes and pointers, which can result in higher storage overhead than other data structures.

Despite these trade-offs, B-trees excel in fast read operations due to their balanced structure, sequential data storage, and caching mechanisms.

CONCLUSION

In this article, we explored the architecture and operational characteristics of LSM-trees and B-trees, highlighting how these two storage engines address different database requirements. We discussed how LSM-trees are optimized for write-heavy workloads, offering high write throughput and efficient disk space utilization through techniques like Bloom filters and compaction. However, we also noted that LSM-trees may face challenges in read-heavy scenarios and have higher storage overhead compared to B-trees.

On the other hand, we highlighted how B-trees provide exceptional read performance, making them suitable for applications that prioritize fast query execution and real-time data access. Their balanced tree structure and efficient indexing mechanism enable B-trees to handle complex queries and support concurrent access effectively.

Choosing the right storage engine should align with the application’s specific needs. For systems with high write throughput and latency sensitivity, LSM-trees offer an optimized solution. For applications that prioritize fast read access, data consistency, and efficient indexing, B-trees are often the better choice.

Ultimately, the choice of storage engine will depend on the unique needs of your application and its data management challenges.

I hope this helps…….

--

--