LINUX Internals: 2014

A file system is an organized collection of regular files and directories. A file system is created using the mkfs command.

Linux supports a wide variety of file systems:
The traditional ext2 file system
Various native UNIX file systems such as the Minix, System V, and BSD
Microsoft's FAT, FAT32 and NTFS
The ISO 9660 CDROM file system
Apple Macintosh's HFS
A range of network file systems including Sun's widely used NFS, IBM and Microsoft's SMB, Novell's NCP, and the Coda file system developed at CMU
A range of journaling file systems including ext3, ext4, Reiserfs, JFS, XFS and Btrfs

The file system types currently known by the kernel can be viewed through # cat /proc/filesystems

File system structure

The basic unit for allocating space in a file system is a logical block, which is some multiple of contiguous physical blocks on the disk device on which the file system resides. For example, the logical block size on ext2 is 1024, 2048 or 4096 bytes.

The below diagram illustrates the relationship between disk partitions and file systems:

A file system contains the following components:
Boot block: This is always the first block in a file system. The boot block is not used by the file system; rather, it contains information used to boot the operating system. Although only one boot block is needed by the operating system, all file systems have a boot block (most of which are unused).

Superblock: This is a single block, immediately following the boot block, which contains parameter information about the file system, including:
--- the size of the i-node table
--- the size of logical blocks in this file system
--- the size of the file system in logical blocks.

Different file systems residing on the same physical device can be of different types and sizes, and have different parameter settings (e.g., block size). This is one of the reasons for splitting a disk into multiple partitions.

i-node table: Each file or directory in the file system has a unique entry in the i-node table. This entry records various information about the file. The i-node table is sometimes also called the i-list.

Data blocks: The great majority of space in a file system is used for the blocks of data that form the files and directories residing in the file system.

Device Special Files (Devices)

A device special file corresponds to a device on the system. Within the kernel, each device type has a corresponding device driver, which handles all I/O requests for the device. A device driver is a unit of kernel code that implements a set of operations that normally correspond to input and output actions on an associated piece of hardware. The API provided by device drivers is fixed, and includes operations corresponding to the system calls open(), close(), read(), write(), mmap() and ioctl(). The fact that each device driver provides a consistent interface, hiding the differences in operation of individual devices, allows for universality of I/O.

Some devices are real, such as mice, disks and tape drives. Others are virtual, meaning that there is no corresponding hardware; rather, the kernel provides through the device driver an abstract device with an API that is the same as a real device.

Devices can be divided into two types:
Character devices: Handle data on a character-by-character basis. Terminals and keyboards are examples of character devices.
Block devices: Handle data a block at a time. The size of a block depends on the type of device, but is typically some multiple of 512 bytes. Disks are a common example of block devices.

Device files appear within the file system, just like other files, usually under the /dev directory. The superuser can create a device file using the mknod command, and the same task can be performed in a privileged (CAP_MKNOD) program using the mknod() system call. mknod stands for "make file system i-node".

In earlier versions of Linux, /dev contained entries for all possible devices on the system, even if such devices were not actually connected to the system. This meant that /dev could contain literally thousands of unused entries, slowing the task of programs that needed that needed to scan the contents of that directory, and making it impossible to use the contents of the directory as a means of discovering which devices were actually present on the system. In Linux 2.6, these problems are solved by the udev program. The udev program relies on the sysfs file system, which exports information about devices and other kernel objects into user space through a pseudo-file system mounted under /sys.

Device IDs:
Each device file has a major ID number and a minor ID number. The major ID identifies the general class of device, and is used by the kernel to look up the appropriate driver for this type of device. The minor ID uniquely identifies a particular device within a general class. The major and minor IDs of a device file are displayed by the ls -l command.

A device's major and minor IDs are recorded in the i-node for the device file. Each device driver registers its association with a specific major device ID, and this association provides the connection between the device special file and the device driver. The name of the device file has no relevance when the kernel looks for the device driver.

On Linux 2.4 and earlier, the total number of devices on the system is limited by the fact that device major and minor IDs are each represented using just 8 bits. The fact that major device IDs are fixed and centrally assigned (by www.lanana.org) further exacerbates this limitation. Linux 2.6 eases this limitation by using more bits to hold the major and minor device IDs (respectively 12 and 20 bits).

I-nodes

A file system's i-node table contains one i-node for each file residing in the file system. An i-node, short for index node, is identified numerically by their sequential location in the i-node table. The i-node number of a file is the first field displayed by the ls -li command. The information maintained in an i-node includes the following:

File type (e.g., regular file, directory, symbolic link, character device).
Owner (also referred to as the User ID or UID) for the file.
Group (also referred to as the Group ID or GID) for the file
Access permissions fr three categories of user: owner, group and others.
Three timestamps: time of last access to the file (ls -lu), time of last modification of the file (ls -l), and time of last status change (ls -lc). As on most other UNIX implementations, it is notable that most Linux file systems don't record the creation time of a file.
Number of hard links to the file.
Size of the file in bytes.
Number of blocks actually allocated to the file, measured in units of 512-byte blocks. There may not be a simple correspondence between this number and the size of the file in bytes, since a file can contain holes, and thus require fewer allocated blocks than would be expected according to its nominal size in bytes.
Pointers to the data blocks of the file.

Virtual File System (VFS)

Each of the file systems available on Linux differs in the details of its implementation. Such differences include, for example, the way in which the blocks of a file are allocated and the manner in which directories are organized. If every program that worked with files needed to understand the specific details of each file system, the task of writing programs that worked with all of the different file systems would be nearly impossible. The virtual file system VFS is a kernel feature that resolves this problem by creating an abstraction layer for file system operations.

The ideas behind the VFS are:

The VFS defines a generic interface for file system operations. All programs that work with files specify their operations in terms of this generic interface.
Each file system provides an implementation for the VFS interface.

Under this scheme, programs need to understand only the VFS interface and can ignore details of individual file system implementations.

The VFS interface includes operations corresponding to all of the usual system calls for working with file systems and directories, such as open(), read(), write(), lseek(), close(), truncate(), stat(), mount(), umount(), mmap(), mkdir(), link(), unlink(), symlink(), and rename().

Journaling File Systems

The ext2 file system is a good example of a traditional UNIX file system, and suffers from a classic limitation of such file systems: after a system crash, a file system consistency check (fsck) must be performed on reboot in order to ensure the integrity of the file system. This is necessary because at the time of the system crash a file update may have been only partially completed, and the file system metadata (directory entries, i-node information, and file data block pointers) may be in an inconsistent state, so that the file system might be further damaged if these inconsistencies are not repaired. A file system consistency check ensures the consistency of the file system metadata. Where possible, repairs are performed; otherwise, information that is not retrievable (possibly including file data) is discarded.

The problem is that a consistency check requires examining the entire file system. On a small file system, this may take anything from several seconds to a few minutes. On a large file system, this may require several hours, which is a serious problem for systems that must maintain high availability.

Journaling file systems eliminate the need for lengthy file system consistency checks after a system crash. A journaling file system logs (journals) all metadata updates to a special on-disk journal file before they are actually carried out. The updates are logged in groups of related metadata updates (transactions). In the event of a system crash in the middle of a transaction, on system reboot, the log can be used to rapidly redo any incomplete updates and bring the file system back to a consistent state. (To borrow database parlance, we can say that a journaling file system ensures that file metadata transactions are always committed as a complete unit.) Even very large journaling file systems can typically be available within seconds after a system crash, making them very attractive for systems with high availability requirements.

The most notable disadvantage of journaling is that it adds time to file updates, though good design can make this overhead low.

Some journaling file systems ensure only the consistency of file metadata. Because they don't log file data, data may still be lost in the event of a crash. The ext3, ext4, and Rieserfs file systems provide options for logging data updates, but depending on the workload this may result in lower file I/O performance.

The journaling file systems available for Linux include the following:

Reiserfs was the first of the journaling file systems to be integrated into the kernel (in version 2.4.1). Resierfs provides a feature called tail packing (or tail merging): small files (and the final fragment of large files) are packed into the same disk blocks as the file metadata. Because many systems have (and some applications create) large numbers of small files, this can save a significant amount of disk space.
The ext3 file system was the result of a project to add journaling to ext2 with minimal impact. The migration path from ext2 to ext3 is very easy (no backup and restore are required), and it is possible to migrate in the reverse direction as well. The ext3 file system was integrated into the kernel in version 2.4.15 .
JFS was developed at IBM. It was integrated into the 2.4.20 kernel.
XFS was originally developed by Silicon Graphics (SGI) in the early 1990s for Irix, its proprietary UNIX implementation. In 2001, XFS was ported to Linux and made available as a free software project. XFS was integrated into the 2.4.24 kernel.

Support for the various file systems is enabled using kernel options that are set under the File systems menu when configuring the kernel.

Work has been in progress on two other file systems that provide journaling and a range of other advanced features:

The ext4 file system is the successor to ext3. The first pieces of the implementation were added in kernel 2.6.19, and various features were added in later kernel versions. Among the features for ext4 are extents (reservation of contiguous blocks of storage), and other allocation features that aim to reduce file fragmentation, online file system defragmentation, faster file system checking, and support for nanosecond timestamps.
Btrfs (B-tree FS) is a new file system designed from the ground up to provide a range of modern features, including extents, writable snapshots (which provide functionality equivalent to metadata and data journaling), checksums on data and metadata, online file system checking, online file system defragmentation, space-efficient packing of small files, and space-efficient indexed directories. It was integrated into the kernel in version 2.6.29 .

Mounting and Unmounting File Systems

The mount() and umount() system calls allow a privileged (CAPS_SYS_ADMIN) process to mount and unmount file systems. Most UNIX implementations provide versions of these system calls. However they are not standardized by SUSv3, and their operation varies both across UNIX implementations and across file systems.

LINUX Internals

Friday, September 19, 2014

File Systems

Thursday, September 18, 2014

TCP/IP

Blog Archive