File System Abstraction

We need file systems as data (pphysical memory) is volatile. We want out files to persist so we use external storage to store persistent information. Direct access to the storage media is not portable and dependent on hardware specification and organization.

A file system provides an abstraction on top of the physical media (so you can e.g. drag and drop files rather than deal with the specific location). As an extension, being a high level resource management scheme, it provides protection and enables sharing between processes and users.

General Criteria

A file system should be self-contained. Information stored on a media is enough to describe the entire organization where it should be "plug-and-play", in the sense that when it comes to access the media, all the information required to access should already be available. You should not be providing extra information.

Of course, it should also be persistent beyond the lifetime of OS and processes. We don't want to lose our files!

It should also be efficient (in the manner you are used to using, easy to access). It should provide good management of free and unused space with minimum overhead for bookkeeping information.

Memory Management vs File Management

memvsFile

Abstraction

ls color

If you run ls --color=tty on your terminal, it adds colour to directory/files etc!

The OS many things the file system does to users through system calls. But first, what is a file?

File

Files represent a logicla unit of information created by a process. It is an abstraction - essentially an abstract data type (it just has 0s and 1s) with a set of common operations with various possible implementations.

Apart from data (information structured in some ways) and metadata (additional information associated with the file, aka file attributes). The file attributes are basically what you see when you look for file information. If you want more information, you can also use stat filename!

Metadata

metadata

As seen, a lot of information needs to be kept in order to access information about the file.

Name

Different file systems has different naming rules, as the system has to understand how to interprete the files and determine if it is a valid file name. Some common naming rules include:

Length of file name
Case sensitivity
Allowed special symbols
File extension
- Usually is Name.Extension
- On some FS, Extension is used to indicate the file type

File Type

An OS commonly supports a number of file types. Each type has an associated set of operations and possibly a specific program for processing. Common file types include:

Regular files
- contains user information
Directories
- system files for FS structure
Special files
- character/block oriented
- like for keyboards, read in streams of bytes

A regular file we are familiar with is the ASCII files (e.g. text file, program source code etc.) which can be displayed or printed as is. There are also binary files (executable, Java class file, pdf, mp3/4, png/jpeg etc.). They usually have a predefined internal structure that can be processed by a specific program (JVM for java class file, PDF reader for pdf files etc.)

To distinguish file types, we can refer to the file extension (used by Windows OS) such as .docx which would be word document. A change of extension implies a change in file type. We can also use embedded information in the file (used by UNIX) which is usually stored at the beginning of the file and commonly known as a magic number.

File Protection

Since we store and access file, we need a good understanding of what we are allowed to do on the file. Protection helps us control access ot the information stored in the file. The types of access include

Read: Retrieves information from the file
Write: (Re)Write the file
Execute: Load file into memory and execute it
Append: Add new information to the end of the file
Delete: Remove the file from FS
List: Read metadata of a file

Normally for file access, there is some kind of access control built into the system. This is where OS and file systems start "talking" closely. It is the OS that determines your file access, not the file system.

Usually, the most common approach is to restrict access based on the user identity - role-based access control (RBAC). We add a role and say what the user is allowed to do. What/who the user is, what can the user do and what the files allow them to do. A general scheme is to have a Acces Control List which is a list of user identity and the allowed access types. This is very customizable but there is too much information associated with a file.

A common condensed file protection scheme is to classify the users into 3 classes.

Owner: User who created the file
Group: Set of users who need similar access to a file
Universe: All other users in the system

We can control which user has what type of access.

ls -l

running ls -l shows the permission bits for a file

In Unix, the Access Control List (ACL) can be Minimal ACL (same as the permission bits) or Extended ACL (added nameed users/group). To see the file access control list, you can run getfacl.

Operations on Metadata

File metadata can be changed when you rename a file, change the attributes (file access permissions, dates, ownership etc.) and when you read the attribute (e.g. file creation time)

File Data Structure

The file is just a bunch of bytes. But how do we actually acces them? Most files have a array of bytes where each bytes has a unique offset (distance) from the file start. This is one way we can access the bytes. Array of bytes gives an advantage of O(1) lookup, so most files adopt this random access (unless the bytes has to be accessed sequentially).

With fixed length records, we have an array of records which can grow or shrink. We can jump to any record easily where the offset of the Nth record = size of record * (N - 1). For variable length, it is flexible but harder to locate a record.

Access Methods

Sequential Access

Data is read in order starting from the beginning. You cannot skip but it is possible to rewind.

Random Access

Data can be read in any order and be provided in 2 ways:

Read(offset): Every read operation explicitly states the position to be accessed
Seek(offset): A special operation is povided to move to a new location in file

Direct Access

This is used for files containing fixed-length records and allows random access to any record directly. This is very useful where there is a large amount of records. Basic random access method can be viewed as a special case where each record == 1 byte.

Generic Operations

genOp

System Calls

The OS provides file operations as system calls. It provides protection, concurrent and efficient access. It also maintains relevant information. Not when you make the system calls, you can't use the file system/OS for other stuff.

The information kept for an open file includes the file pointers (tracks current location in file), disk location (tracks actual file locataion on disk) and open count (tracks how many process has the file open, useful to determine when to remove entry in table). Basically which process, what file and where in the file.

Tracking

When several processes can open the same file and several different files can be opened at the same time, we need a good way to oranize the open-file information.

System-wide open-file table
- 1 entry per unique file
- Tells which files are open in the system then a process that has the file can lookup table to find information about the file
Per-process open-file table
- 1 entry per file used in the process
- Each entry points to the system-wide table

Each process has information about the file within it when it does a system call. There is a return on the syscall which value is the descriptor of whatever is the entity you are going to use in the system. For file its file descriptor (FD). When you open the descriptor, you get a handle on the file. File descriptor points to information about file actually open in file table.

Information is stored and be accessed independently, so information can be shared.

Both process has their "version" of the file (they are seeing different parts of the file, 5000 and 2000).

When fork(), you share information. Both parent and child share file descriptors (pointing to same file, doing same thing).