Skip to content

XFS Optimization and Troubleshooting

cpx May 25, 2026 8 min read File Systems xfs

XFS — A Reference Architecture

A practical, layered view of the XFS filesystem: what it is, how it is structured on disk, and the tradeoffs that fall out of those design choices. Aimed at engineers and architects choosing a filesystem for production workloads — not at kernel hackers, but deep enough to be useful when something goes wrong.


1. Overview

XFS is a 64-bit journaling filesystem originally developed by SGI for IRIX in 1993 and ported to Linux in 2001. It has been the default filesystem in RHEL/CentOS/Rocky/Alma since RHEL 7 and is a first-class option across enterprise Linux. Its design priorities, in order, are scalability, parallelism, and metadata performance — particularly for large files and large filesystems.

It is mature, well-supported, and conservative in ways that matter for production: the on-disk format is stable, recovery semantics are well understood, and the tooling around it (xfs_repair, xfs_db, xfs_io) is unusually capable.


2. Design Principles

PrincipleWhat it means in practice
Parallelism via Allocation GroupsThe filesystem is partitioned into independent regions, each with its own metadata, allowing concurrent allocation and I/O
B+ trees everywhereFree space, inode allocation, directory entries, extent maps, reverse maps, reference counts — all O(log n) lookups
Extent-based allocationVariable-length runs of contiguous blocks rather than per-block bitmaps
Metadata journalingOnly metadata is journalled; data integrity relies on ordering, CRC, and the application
Delayed allocationBlock allocation deferred until writeback to maximise contiguity
64-bit throughoutBlock addresses, inode numbers, file offsets — all 64-bit, no scaling cliffs
Online operationsGrow, defragment, partial repair without unmount

3. Top-Level Filesystem Layout

An XFS filesystem is composed of three logical regions:

RegionRequiredPurpose
Data sectionYesAllocation Groups (AGs) — all user data and most metadata
LogYesMetadata write-ahead log; internal (within the data section) or external (separate device)
Real-time sectionNoOptional secondary region with deterministic allocation and no AG overhead

The data section dominates. The log is small (typically tens to hundreds of MiB). The real-time section is rare outside specialised media workloads.


4. Allocation Groups

The Allocation Group is XFS’s fundamental architectural unit. The data section is divided into N AGs, each between 16 MiB and 1 TiB in size. mkfs.xfs picks an AG count that balances parallelism against overhead — typically 4 AGs for filesystems up to ~16 TiB, more for larger devices.

Each AG is self-contained: it has its own free-space tracking, its own inode allocation, and its own copy of the superblock. Independent threads can allocate from different AGs without contending on global structures. This is the single design choice that most defines XFS’s behaviour.

4.1 AG Headers

Every AG begins with a small set of headers occupying the first few blocks:

HeaderPurpose
Superblock (SB)Primary copy lives in AG 0; every other AG holds a backup copy used by xfs_repair
AGFRoot pointers for the free-space B+ trees
AGIRoot pointers for the inode B+ trees
AGFLFree list — a small pool of pre-reserved blocks for B+ tree operations, so tree splits never deadlock waiting for free space

4.2 AG-Local B+ Trees

TreeIndexed byPurpose
bnobtStarting block numberFind free extents adjacent to a target location
cntbtExtent lengthFind a free extent of at least N blocks
inobtInode numberLocate any inode chunk
finobtInode number (free chunks only)Fast allocation of new inodes (v5)
rmapbtOwnerReverse mapping — given any block, find what owns it (v5)
refcountbtBlock numberReference counts for reflink-shared extents (v5)

The two free-space trees (bnobt and cntbt) index the same data in two different ways — one for spatial locality, one for size matching — so the allocator can answer both “place this near block X” and “find me a free extent of at least N blocks” without scanning.


5. Inodes

Inodes are allocated dynamically, in chunks of 64, as files are created. A fresh filesystem carries no inode tax; maximum file count is bounded by free space rather than by a mkfs-time decision.

PropertyValue
Default inode size512 bytes (v5)
Allocation unitChunks of 64 inodes
Inode numberEncodes AG number + AG-relative offset
Forks per inodeData fork (always), attribute fork, optional CoW fork (v5 reflink)

5.1 Inode Formats

An inode’s data fork can take one of three formats depending on file size and type:

FormatUsed whenStorage
LocalSmall files, short symlinks, small directoriesData lives inline in the inode itself
ExtentsUp to roughly 19 extentsExtent records packed into the inode
B+ treeMore than fits inlineInode holds the root of an extent B+ tree

The “local” format is why XFS handles symlinks and small directories with no extra block reads — they fit entirely within the inode that describes them.


6. Extents and Block Mapping

XFS does not maintain per-block allocation bitmaps. Instead, files and free space are described as extents(start_block, length, offset, flags) tuples. A 1 GiB contiguous file uses one extent record; a heavily fragmented 1 GiB file might use thousands. With delayed allocation and healthy free space, extent counts stay low and metadata overhead is small.

Reading or writing a sparse 4 KiB region of a 100 GiB file involves a B+ tree lookup in the inode’s extent map and then the I/O. No bitmap walk, no indirect-block chain.


7. The Log

XFS uses a physical metadata journal. Every metadata change writes a pre-image and post-image to the log; on crash, the log is replayed to bring metadata to a consistent state.

AspectDetail
TypeMetadata only — data blocks are never logged
LocationInternal (default, in the data section) or external (separate device, -l logdev=)
In-memory buffers8 by default (logbufs=), 32 KiB each (logbsize=)
Sizingmkfs auto-sizes (typically ~0.1% of the filesystem, capped); larger logs help metadata-heavy workloads
RecoveryAutomatic on mount; replays committed transactions

An external log on a fast device (NVMe, NVRAM, or a battery-backed write-cache) reduces metadata commit latency dramatically for workloads with synchronous fsync() patterns — databases, mail servers, message brokers.


8. Delayed Allocation

Writes land in the page cache without block allocation. Allocation is deferred until writeback, at which point the allocator sees the full size of the dirty region and can pick a contiguous extent.

This is the primary reason XFS produces highly contiguous files on healthy filesystems — but it widens the window in which a power failure can leave a recently written file as zero-length (the inode was created but blocks were never allocated). Applications that need durability must call fsync(). This was the source of the infamous “XFS zeroes my files” complaints from a decade ago, and is the reason ext4 shipped auto_da_alloc to mimic some of XFS’s behaviour while papering over naive write-then-rename patterns.


9. v5 (CRC) Features

Filesystems formatted with crc=1 (default since 2014) use the v5 on-disk format, which adds:

FeatureBenefit
Per-metadata-block CRC32cDetects silent corruption in all metadata blocks
Per-block UUIDsCatches blocks accidentally written to the wrong filesystem
Block ownership in headersCatches metadata-block confusion
Sparse inodes (spinodes)Allocates partial 64-inode chunks when free space is fragmented
finobtFast free-inode lookup; accelerates create-heavy workloads
rmapbtReverse mapping; foundation for online repair and reflink
ReflinkCopy-on-write data sharing between files — powers cp --reflink, snapshots, dedupe
Bigtime (kernel ≥ 5.10)Timestamps extend beyond 2038
inobtcount, nrext64Faster mount; larger per-inode extent counts

If you are formatting a filesystem today, v5 is the only sensible choice. The CRC overhead is negligible; the corruption-detection gain is substantial.


10. Operational Tooling

ToolPurpose
mkfs.xfsFormat-time configuration: block size, AG count/size, log size/location, inode size, feature flags
xfs_infoDisplay the geometry of a mounted filesystem
xfs_growfsOnline grow (XFS cannot shrink)
xfs_repairOffline check and repair — the filesystem must be unmounted
xfs_scrubOnline metadata verification (kernel + userspace, v5 only)
xfs_dbRead/write debug access to on-disk structures — forensic, dangerous
xfs_ioUserspace harness for I/O syscalls — fallocate, hole punch, reflink, direct I/O
xfs_quotaUser/group/project quota management
xfs_fsrOnline filesystem reorganiser — defragmenter
xfs_bmapShow a file’s extent map
xfsdump / xfsrestoreNative backup with full attribute and ACL fidelity

11. Performance Topology

KnobEffect
AG countParallelism ceiling — too few causes lock contention, too many wastes overhead
Log size and locationBigger log helps metadata-heavy workloads; external log helps fsync-heavy workloads
allocsize mount optionSpeculative preallocation size — larger reduces fragmentation but holds free space
logbsize, logbufsLog buffer count and size — tune for high metadata throughput
inode64Allow inode allocation across the whole filesystem (default on modern kernels)
noatime / relatimeSkip atime updates — reduces metadata writes
Real-time subvolumeBypass AG metadata path for predictable latency (rare in general workloads)

A useful rule of thumb: workloads that bottleneck on metadata operations want more AGs and a larger log. Workloads that bottleneck on streaming I/O want fewer, larger AGs and aligned stripe parameters (sunit / swidth) matching the RAID geometry underneath.


12. When XFS Fits — and When It Doesn’t

Good fitPoor fit
Large files (media, scientific datasets, VM images)Many tiny files in a tiny filesystem (overhead dominates)
Large filesystems (multi-TB, multi-PB)Workloads that need shrink
High-parallelism workloads (databases, file servers, container hosts)Write-once-then-delete with deep nesting and churn
Streaming and sequential I/OWorkloads needing native filesystem-level encryption (use dm-crypt below XFS)
RHEL-family environments where it is the defaultWorkloads needing native transparent compression (use Btrfs or ZFS)
Reflink-based snapshots and dedupeWorkloads where COW everywhere is desired (use Btrfs or ZFS)

13. Security Posture

ConcernXFS position
Silent corruptionMitigated by v5 CRC for metadata; data blocks rely on the application or the device layer for integrity
EncryptionNo native fscrypt. Encrypt at the block layer with LUKS / dm-crypt, or at the application layer
QuotasUser, group, and project (directory-tree) quotas — useful for multi-tenant containers and shared storage
Per-inode flagsappend-only, immutable, no-dump, sync — supported via chattr
Xattrs and ACLsFull POSIX ACL and extended attribute support
AuditMetadata operations are journalled, but the log is for crash consistency, not audit retention; pair with auditd for security audit
Reflink and integrityReflink shares blocks via the reference-count B+ tree; a write to one of the sharing files triggers CoW and severs the share, so reflink itself does not expose data between files

14. Architectural Diagram

Xfs Reference Architecture
Xfs Reference Architecture
XFS Reference Architecture

15. Further Reading

  • XFS Algorithms & Data Structures — Dave Chinner et al., the canonical on-disk format reference, maintained in the kernel tree under Documentation/filesystems/xfs/
  • man 5 xfs and man 8 mkfs.xfs — succinct and accurate
  • xfsprogs source — xfs_db in particular is the fastest way to develop intuition about the on-disk structures
  • The fstests (xfstests) suite — the regression battery used to qualify every XFS change

Reference architecture, May 2026. Verify version-specific behaviour against the kernel and xfsprogs versions in your environment.

0 0 votes
Article Rating
guest

0 Comments
Oldest
Newest Most Voted
0
Would love your thoughts, please comment.x
()
x