diff options
Diffstat (limited to 'Documentation/filesystems/ext4/ondisk/inodes.rst')
-rw-r--r-- | Documentation/filesystems/ext4/ondisk/inodes.rst | 575 |
1 files changed, 0 insertions, 575 deletions
diff --git a/Documentation/filesystems/ext4/ondisk/inodes.rst b/Documentation/filesystems/ext4/ondisk/inodes.rst deleted file mode 100644 index 655ce898f3f5..000000000000 --- a/Documentation/filesystems/ext4/ondisk/inodes.rst +++ /dev/null @@ -1,575 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Index Nodes ------------ - -In a regular UNIX filesystem, the inode stores all the metadata -pertaining to the file (time stamps, block maps, extended attributes, -etc), not the directory entry. To find the information associated with a -file, one must traverse the directory files to find the directory entry -associated with a file, then load the inode to find the metadata for -that file. ext4 appears to cheat (for performance reasons) a little bit -by storing a copy of the file type (normally stored in the inode) in the -directory entry. (Compare all this to FAT, which stores all the file -information directly in the directory entry, but does not support hard -links and is in general more seek-happy than ext4 due to its simpler -block allocator and extensive use of linked lists.) - -The inode table is a linear array of ``struct ext4_inode``. The table is -sized to have enough blocks to store at least -``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the -block group containing an inode can be calculated as -``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the -group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There -is no inode 0. - -The inode checksum is calculated against the FS UUID, the inode number, -and the inode structure itself. - -The inode table entry is laid out in ``struct ext4_inode``. - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - i\_mode - - File mode. See the table i_mode_ below. - * - 0x2 - - \_\_le16 - - i\_uid - - Lower 16-bits of Owner UID. - * - 0x4 - - \_\_le32 - - i\_size\_lo - - Lower 32-bits of size in bytes. - * - 0x8 - - \_\_le32 - - i\_atime - - Last access time, in seconds since the epoch. However, if the EA\_INODE - inode flag is set, this inode stores an extended attribute value and - this field contains the checksum of the value. - * - 0xC - - \_\_le32 - - i\_ctime - - Last inode change time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the lower 32 bits of the attribute value's - reference count. - * - 0x10 - - \_\_le32 - - i\_mtime - - Last data modification time, in seconds since the epoch. However, if the - EA\_INODE inode flag is set, this inode stores an extended attribute - value and this field contains the number of the inode that owns the - extended attribute. - * - 0x14 - - \_\_le32 - - i\_dtime - - Deletion Time, in seconds since the epoch. - * - 0x18 - - \_\_le16 - - i\_gid - - Lower 16-bits of GID. - * - 0x1A - - \_\_le16 - - i\_links\_count - - Hard link count. Normally, ext4 does not permit an inode to have more - than 65,000 hard links. This applies to files as well as directories, - which means that there cannot be more than 64,998 subdirectories in a - directory (each subdirectory's '..' entry counts as a hard link, as does - the '.' entry in the directory itself). With the DIR\_NLINK feature - enabled, ext4 supports more than 64,998 subdirectories by setting this - field to 1 to indicate that the number of hard links is not known. - * - 0x1C - - \_\_le32 - - i\_blocks\_lo - - Lower 32-bits of “block” count. If the huge\_file feature flag is not - set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks - on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in - ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi - << 32)`` 512-byte blocks on disk. If huge\_file is set and - EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file - consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on - disk. - * - 0x20 - - \_\_le32 - - i\_flags - - Inode flags. See the table i_flags_ below. - * - 0x24 - - 4 bytes - - i\_osd1 - - See the table i_osd1_ for more details. - * - 0x28 - - 60 bytes - - i\_block[EXT4\_N\_BLOCKS=15] - - Block map or extent tree. See the section “The Contents of inode.i\_block”. - * - 0x64 - - \_\_le32 - - i\_generation - - File version (for NFS). - * - 0x68 - - \_\_le32 - - i\_file\_acl\_lo - - Lower 32-bits of extended attribute block. ACLs are of course one of - many possible extended attributes; I think the name of this field is a - result of the first use of extended attributes being for ACLs. - * - 0x6C - - \_\_le32 - - i\_size\_high / i\_dir\_acl - - Upper 32-bits of file/directory size. In ext2/3 this field was named - i\_dir\_acl, though it was usually set to zero and never used. - * - 0x70 - - \_\_le32 - - i\_obso\_faddr - - (Obsolete) fragment address. - * - 0x74 - - 12 bytes - - i\_osd2 - - See the table i_osd2_ for more details. - * - 0x80 - - \_\_le16 - - i\_extra\_isize - - Size of this inode - 128. Alternately, the size of the extended inode - fields beyond the original ext2 inode, including this field. - * - 0x82 - - \_\_le16 - - i\_checksum\_hi - - Upper 16-bits of the inode checksum. - * - 0x84 - - \_\_le32 - - i\_ctime\_extra - - Extra change time bits. This provides sub-second precision. See Inode - Timestamps section. - * - 0x88 - - \_\_le32 - - i\_mtime\_extra - - Extra modification time bits. This provides sub-second precision. - * - 0x8C - - \_\_le32 - - i\_atime\_extra - - Extra access time bits. This provides sub-second precision. - * - 0x90 - - \_\_le32 - - i\_crtime - - File creation time, in seconds since the epoch. - * - 0x94 - - \_\_le32 - - i\_crtime\_extra - - Extra file creation time bits. This provides sub-second precision. - * - 0x98 - - \_\_le32 - - i\_version\_hi - - Upper 32-bits for version number. - * - 0x9C - - \_\_le32 - - i\_projid - - Project ID. - -.. _i_mode: - -The ``i_mode`` value is a combination of the following flags: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - S\_IXOTH (Others may execute) - * - 0x2 - - S\_IWOTH (Others may write) - * - 0x4 - - S\_IROTH (Others may read) - * - 0x8 - - S\_IXGRP (Group members may execute) - * - 0x10 - - S\_IWGRP (Group members may write) - * - 0x20 - - S\_IRGRP (Group members may read) - * - 0x40 - - S\_IXUSR (Owner may execute) - * - 0x80 - - S\_IWUSR (Owner may write) - * - 0x100 - - S\_IRUSR (Owner may read) - * - 0x200 - - S\_ISVTX (Sticky bit) - * - 0x400 - - S\_ISGID (Set GID) - * - 0x800 - - S\_ISUID (Set UID) - * - - - These are mutually-exclusive file types: - * - 0x1000 - - S\_IFIFO (FIFO) - * - 0x2000 - - S\_IFCHR (Character device) - * - 0x4000 - - S\_IFDIR (Directory) - * - 0x6000 - - S\_IFBLK (Block device) - * - 0x8000 - - S\_IFREG (Regular file) - * - 0xA000 - - S\_IFLNK (Symbolic link) - * - 0xC000 - - S\_IFSOCK (Socket) - -.. _i_flags: - -The ``i_flags`` field is a combination of these values: - -.. list-table:: - :widths: 1 79 - :header-rows: 1 - - * - Value - - Description - * - 0x1 - - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented) - * - 0x2 - - This file should be preserved, should undeletion be desired - (EXT4\_UNRM\_FL). (not implemented) - * - 0x4 - - File is compressed (EXT4\_COMPR\_FL). (not really implemented) - * - 0x8 - - All writes to the file must be synchronous (EXT4\_SYNC\_FL). - * - 0x10 - - File is immutable (EXT4\_IMMUTABLE\_FL). - * - 0x20 - - File can only be appended (EXT4\_APPEND\_FL). - * - 0x40 - - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL). - * - 0x80 - - Do not update access time (EXT4\_NOATIME\_FL). - * - 0x100 - - Dirty compressed file (EXT4\_DIRTY\_FL). (not used) - * - 0x200 - - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used) - * - 0x400 - - Do not compress file (EXT4\_NOCOMPR\_FL). (not used) - * - 0x800 - - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was - EXT4\_ECOMPR\_FL (compression error), which was never used. - * - 0x1000 - - Directory has hashed indexes (EXT4\_INDEX\_FL). - * - 0x2000 - - AFS magic directory (EXT4\_IMAGIC\_FL). - * - 0x4000 - - File data must always be written through the journal - (EXT4\_JOURNAL\_DATA\_FL). - * - 0x8000 - - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4) - * - 0x10000 - - All directory entry data should be written synchronously (see - ``dirsync``) (EXT4\_DIRSYNC\_FL). - * - 0x20000 - - Top of directory hierarchy (EXT4\_TOPDIR\_FL). - * - 0x40000 - - This is a huge file (EXT4\_HUGE\_FILE\_FL). - * - 0x80000 - - Inode uses extents (EXT4\_EXTENTS\_FL). - * - 0x200000 - - Inode stores a large extended attribute value in its data blocks - (EXT4\_EA\_INODE\_FL). - * - 0x400000 - - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL). - (deprecated) - * - 0x01000000 - - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) - * - 0x04000000 - - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in - mainline) - * - 0x08000000 - - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in - mainline) - * - 0x10000000 - - Inode has inline data (EXT4\_INLINE\_DATA\_FL). - * - 0x20000000 - - Create children with the same project ID (EXT4\_PROJINHERIT\_FL). - * - 0x80000000 - - Reserved for ext4 library (EXT4\_RESERVED\_FL). - * - - - Aggregate flags: - * - 0x4BDFFF - - User-visible flags. - * - 0x4B80FF - - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and - EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's - EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of - these flags in a special manner and they are masked out of the set of - flags that are saved directly to i\_flags. - -.. _i_osd1: - -The ``osd1`` field has multiple meanings depending on the creator: - -Linux: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - l\_i\_version - - Inode version. However, if the EA\_INODE inode flag is set, this inode - stores an extended attribute value and this field contains the upper 32 - bits of the attribute value's reference count. - -Hurd: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - h\_i\_translator - - ?? - -Masix: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le32 - - m\_i\_reserved - - ?? - -.. _i_osd2: - -The ``osd2`` field has multiple meanings depending on the filesystem creator: - -Linux: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - l\_i\_blocks\_high - - Upper 16-bits of the block count. Please see the note attached to - i\_blocks\_lo. - * - 0x2 - - \_\_le16 - - l\_i\_file\_acl\_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). See the Extended Attributes section below. - * - 0x4 - - \_\_le16 - - l\_i\_uid\_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - \_\_le16 - - l\_i\_gid\_high - - Upper 16-bits of the GID. - * - 0x8 - - \_\_le16 - - l\_i\_checksum\_lo - - Lower 16-bits of the inode checksum. - * - 0xA - - \_\_le16 - - l\_i\_reserved - - Unused. - -Hurd: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - h\_i\_reserved1 - - ?? - * - 0x2 - - \_\_u16 - - h\_i\_mode\_high - - Upper 16-bits of the file mode. - * - 0x4 - - \_\_le16 - - h\_i\_uid\_high - - Upper 16-bits of the Owner UID. - * - 0x6 - - \_\_le16 - - h\_i\_gid\_high - - Upper 16-bits of the GID. - * - 0x8 - - \_\_u32 - - h\_i\_author - - Author code? - -Masix: - -.. list-table:: - :widths: 1 1 1 77 - :header-rows: 1 - - * - Offset - - Size - - Name - - Description - * - 0x0 - - \_\_le16 - - h\_i\_reserved1 - - ?? - * - 0x2 - - \_\_u16 - - m\_i\_file\_acl\_high - - Upper 16-bits of the extended attribute block (historically, the file - ACL location). - * - 0x4 - - \_\_u32 - - m\_i\_reserved2[2] - - ?? - -Inode Size -~~~~~~~~~~ - -In ext2 and ext3, the inode structure size was fixed at 128 bytes -(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of -128 bytes. Starting with ext4, it is possible to allocate a larger -on-disk inode at format time for all inodes in the filesystem to provide -space beyond the end of the original ext2 inode. The on-disk inode -record size is recorded in the superblock as ``s_inode_size``. The -number of bytes actually used by struct ext4\_inode beyond the original -128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each -inode, which allows struct ext4\_inode to grow for a new kernel without -having to upgrade all of the on-disk inodes. Access to fields beyond -EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within -``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as -of October 2013) the inode structure is 156 bytes -(``i_extra_isize = 28``). The extra space between the end of the inode -structure and the end of the inode record can be used to store extended -attributes. Each inode record can be as large as the filesystem block -size, though this is not terribly efficient. - -Finding an Inode -~~~~~~~~~~~~~~~~ - -Each block group contains ``sb->s_inodes_per_group`` inodes. Because -inode 0 is defined not to exist, this formula can be used to find the -block group that an inode lives in: -``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode -can be found within the block group's inode table at -``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte -address within the inode table, use -``offset = index * sb->s_inode_size``. - -Inode Timestamps -~~~~~~~~~~~~~~~~ - -Four timestamps are recorded in the lower 128 bytes of the inode -structure -- inode change time (ctime), access time (atime), data -modification time (mtime), and deletion time (dtime). The four fields -are 32-bit signed integers that represent seconds since the Unix epoch -(1970-01-01 00:00:00 GMT), which means that the fields will overflow in -January 2038. For inodes that are not linked from any directory but are -still open (orphan inodes), the dtime field is overloaded for use with -the orphan list. The superblock field ``s_last_orphan`` points to the -first inode in the orphan list; dtime is then the number of the next -orphaned inode, or zero if there are no more orphans. - -If the inode structure size ``sb->s_inode_size`` is larger than 128 -bytes and the ``i_inode_extra`` field is large enough to encompass the -respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime -inode fields are widened to 64 bits. Within this “extra” 32-bit field, -the lower two bits are used to extend the 32-bit seconds field to be 34 -bit wide; the upper 30 bits are used to provide nanosecond timestamp -accuracy. Therefore, timestamps should not overflow until May 2446. -dtime was not widened. There is also a fifth timestamp to record inode -creation time (crtime); this field is 64-bits wide and decoded in the -same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible -through the regular stat() interface, though debugfs will report them. - -We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)). -In other words: - -.. list-table:: - :widths: 20 20 20 20 20 - :header-rows: 1 - - * - Extra epoch bits - - MSB of 32-bit time - - Adjustment for signed 32-bit to 64-bit tv\_sec - - Decoded 64-bit tv\_sec - - valid time range - * - 0 0 - - 1 - - 0 - - ``-0x80000000 - -0x00000001`` - - 1901-12-13 to 1969-12-31 - * - 0 0 - - 0 - - 0 - - ``0x000000000 - 0x07fffffff`` - - 1970-01-01 to 2038-01-19 - * - 0 1 - - 1 - - 0x100000000 - - ``0x080000000 - 0x0ffffffff`` - - 2038-01-19 to 2106-02-07 - * - 0 1 - - 0 - - 0x100000000 - - ``0x100000000 - 0x17fffffff`` - - 2106-02-07 to 2174-02-25 - * - 1 0 - - 1 - - 0x200000000 - - ``0x180000000 - 0x1ffffffff`` - - 2174-02-25 to 2242-03-16 - * - 1 0 - - 0 - - 0x200000000 - - ``0x200000000 - 0x27fffffff`` - - 2242-03-16 to 2310-04-04 - * - 1 1 - - 1 - - 0x300000000 - - ``0x280000000 - 0x2ffffffff`` - - 2310-04-04 to 2378-04-22 - * - 1 1 - - 0 - - 0x300000000 - - ``0x300000000 - 0x37fffffff`` - - 2378-04-22 to 2446-05-10 - -This is a somewhat odd encoding since there are effectively seven times -as many positive values as negative values. There have also been -long-standing bugs decoding and encoding dates beyond 2038, which don't -seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels -incorrectly use the extra epoch bits 1,1 for dates between 1901 and -1970. At some point the kernel will be fixed and e2fsck will fix this -situation, assuming that it is run before 2310. |