Data Storage

DBMS stores the following:

Relations (tables created)
System catalog (data dictionary): metadata about relations
- Relation schemas: structure of relations, constraints, triggers
- View definitions
- Indexes: derived information to speed up access to relations
- Statistical information about relations for use by query optimizer
Log files: infromation maintained for data recovery

Memory Hierarchy

Under memory, it can be classified into primary, secondary and tertiary memory.

Primary Memory: registers, static RAM(cache), dyanmic RAM(physical memory)
Seconary Memory: magnetic disks (HDD), solid-state disks (SSD)
Tertiary Memory: optical disks, tapes

Some tradeoffs between the different memory include its capacity, cost, access speed and volatility. For more info, see cs2100 notes

The focus will be on interaction between main memory and the disk. During process of transaction/query, data is retrieved from disk and stored in the main memory

DBMS Storage

DBMS stores data on non-volatile disk for persistence and process data in main memory (RAM). Disk access operations include read (transfer data from disk to RAM) and write (transfer data from RAM to disk).

dbmsStorage

Magnetic Hard-Disk Drive (HDD)

magneticHDD

There is a collection of disk platters. On each surface there are tracks which is further divided into sectors. The smallest unit being read will be some number of sectors.

A cylinder is a collection of all the track with the same track numbers. When the arm moves (all the arms move together) and you position the head on e.g. track 7, all the heads will be position on track 7, allowing for more efficient reading of the whole cylinder.

Disk Access Time

The disk access time consists of the following:

Command processing time: interpreting access command by disk controller (usually negligible)
Seek time: moving arms to position disk head on track (move from existing location to desired track)
Rotational delay: waiting for block to rotate under head
Transfer time: actually moving data to/from disk surface (time for head to read sector and copy data to main memory)

Access time = seek time + rotational delay + transfer time

Note that command processing time is considered negligible.

Response time for disk access = queueing delay + access time

Components of Disk Acces Time

DAT

Assuming disk rotating anticlockwise and disk arm positioned on blue sector with the goal being the red sector, the system has to:

Seek: move disk arm to outer track where red sector resides
Rotational latency: waits for disk to rotate to disk head position
Read the desired sector

Seek Time

The average seek time is 5-6ms

Rotational Delay

Aka rotational latency, it depends on the rotation speed measured in rotations per minute (RPM). The average delay is the time is typically assumed to be 1/2 revolution, e.g. a 10000RPM has average delay of 0.5(60/10000) i.e. 3ms.

Transfer Time

Transfer time = n * (time for 1 revolution / number of sectors per track)

where n is the number of requested sectors on the same track. The average sector time is 100-200 μs.

Sequential vs Random I/O

For sequential I/O, if the sectors of a magnetic disk are layed out in an optimal manner, you can minimise the amount of seeks and rotational delay needed.

Random I/O means you are reading sectors scattered randomly throughout the disk, for every sector read you need to seek again, which will be bad!

Solid-State Drive (SSD)

Built with NAND flash memory without any mechanical/moving parts
Random I/O performance: 100X faster than HDD
Sequential I/O: slightly faster than HDD
Lower power consumption

Some disadvantages are update to a page requires erasure of multiple pages (erasure cycle) of ≈ 5 ms before overwriting page (before writing, has to read multiple pages that form a block then erase them before you can overwrite it. write cost >> read cost) and there is a limited number of times a page can be erased (≈ 10^5 - 10^6)

Storage Manager Components

Data is stored and retrieved in components called disk blocks/pages, where each block is a sequence of 1 or more contiguous sectors. They consist of the following:

File and access methods layer (file layer): deals with organizationa and retrieval of data
Buffer Manager: controls reading/writing of disk pages. Manages retrieval of disk from disk pages to main memory.
Disk Space Manager: keeps track of pages used by file layer. Every table in relational database is stored as a file (or >1)

Buffer Manager

Buffer pool is the main memory allocated for DBMS (sequence of main memory pages allocated, initialized when server is started for the purpose of storing pages you read from disk), and is partitioned into block-sized pages called frames (can be though of as array of pages). Everytime you read a disk block, it would have to be read into some page or frame in the buffer pool.

Clients (functions that use the buffer manager) of buffer pool can request for a disk page to be fetched into buffer pool or release a disk page in buffer pool (it is important to release or the frame occupied cannot be used). A page in the buffer is dirty if it has been modified but not updated on disk.

When multiple clients connect to server, each of them are connected to their own individual backend server process. In order for this process to coordinate, they need to access data in the shared memory. The buffer pool is a data structure stored in the shared memory. bufferPool

There are 2 variables maintained for each frame in buffer pool:

Pin count: number of clients using page (initialized to 0). Having positive pin count means page on buffer frame is being used by active transaction and should not be replaced.
Dirty flag: whether page is dirty (initialized to false)

When handling a request for a page p, the buffer manager follows this flowchart:

pageHandle

First, the buffer manager checks if the page has already been fetched from the buffer pool
If not found, buffer manager invoke a replacement algorithm to choose which to replace (pin count must be 0)

Incrementing pin count is called pinning the requested page in its frame while decrementing the pin count is called unpinning the page. When unpinning a page, its dirty flag should be updated to true if the page is dirty

Replacement Policies

A page in buffer can be replaced only when its pin count is 0. Before replacing a buffer page, it needs to be written back to disk if its dirty flag is true. Buffer manager coordinates with transaction manager to ensure data correctness and recoverability. Some replacement policies include:

Random
First In First Out (FIFO)
Most Recently Use (MRU)
Least Recently Used (LRU): Uses a queue of pointers to frames with pin count = 0
Clock (variant of LRU): A current variable points to some buffer frame, where each frame has a reference bit which turns on when its pin count = 0. This means it is just replaced, don't use it yet. (approximates recency of use)
The page is replaced when reference bit is off and pin count = 0.

Clock Replacement Policy

The clock replacement policy can be thought of with the flowchart below where n is the number of frames in buffer pool

crp

As seen, first encounter with a on reference bit will turn it off so next encounter can use.

Here, we have an example of how it works

Files

File abstraction:

Each relation is a file of records which has a unique record identifier RID (or TID). Some common file operations include create/delete file, insert record, delete/get record using RID and scan all records.

File organisation is a method arranging data records in a file that is stored on disk. There are different types of files as follows:

Heap file: unordered file
Sorted file: records sorted on some search key
Hashed file: records located in blocks via hash function

Suppose you have already inserted a number of records into a table and stored into a file and want to insert a new record into the table. Which page can we use to insert? (We have to look for a page with free space.) Here are 2 ways for Heap files are implementation:

Linked List ImplementationPage Directory Implementation

linkedList

Here, each rectangle represents a list page and pointer is the address. This is a doubly-linked list of pages with free space and full pages. When inserting a record, you will traverse the grey list. If on the last page and still no space, new page will be allocated. Pages will move to respective linked list when it becomes full/ has sapce.

Recrods can be fixed length (int, short) and variable length records(VARCHAR).

pageDir

Pages are organised into a directory structure. In every directory entry, some information pertaining to the amount of free space is stored. When searching for free space, first see the first directory page and look for data entry with sufficient space. Else read next directory page. In the worst case, just need to read all directory page and 1 data page.

Page Formats

Records are organised within a page by their RID consisting of page id and slot number.

There are fixed-length records, which consist of:

Packed organization: Store records in contiguous slots. If see got 4 records, know need store in slot 5. This allows us to compute the starting point offset. However deleting a page will violate the contiguous nature, the last record needs to replace the empty slot, changing the RID of the record.
Unpacked organization: Uses a bit array to maintain free slots

The page is devided into fixed-length slots depending on the size of the fixed-length records (e.g. record is 100 byte, each slot 100 byte long)

pageFormat

and variable length records that use the slotted page organisation. It has the start address and offset (pointer to the start of free slot) + size of free slot.

varLength

PostgreSQL

PostgreSQL's slotted page organization works as follows:

postgres

Records

Fixed length: infromation of size of attribute are alr stored, can simply store the bytes and system will figure out the offset and can easily extract the relevant

variable: when create schema, dk how long it'll be

Delimiter, special char to separate fields e.g. $ symbol, read until hit the symbol
Explicitly store starting byte offset, dont need scan for delimiter for each attribute value.