XFS Optimization and Troubleshooting

● cpx May 25, 2026 8 min read File Systems xfs

XFS — A Reference Architecture

A practical, layered view of the XFS filesystem: what it is, how it is structured on disk, and the tradeoffs that fall out of those design choices. Aimed at engineers and architects choosing a filesystem for production workloads — not at kernel hackers, but deep enough to be useful when something goes wrong.

1. Overview

XFS is a 64-bit journaling filesystem originally developed by SGI for IRIX in 1993 and ported to Linux in 2001. It has been the default filesystem in RHEL/CentOS/Rocky/Alma since RHEL 7 and is a first-class option across enterprise Linux. Its design priorities, in order, are scalability, parallelism, and metadata performance — particularly for large files and large filesystems.

It is mature, well-supported, and conservative in ways that matter for production: the on-disk format is stable, recovery semantics are well understood, and the tooling around it (xfs_repair, xfs_db, xfs_io) is unusually capable.

2. Design Principles

Principle	What it means in practice
Parallelism via Allocation Groups	The filesystem is partitioned into independent regions, each with its own metadata, allowing concurrent allocation and I/O
B+ trees everywhere	Free space, inode allocation, directory entries, extent maps, reverse maps, reference counts — all O(log n) lookups
Extent-based allocation	Variable-length runs of contiguous blocks rather than per-block bitmaps
Metadata journaling	Only metadata is journalled; data integrity relies on ordering, CRC, and the application
Delayed allocation	Block allocation deferred until writeback to maximise contiguity
64-bit throughout	Block addresses, inode numbers, file offsets — all 64-bit, no scaling cliffs
Online operations	Grow, defragment, partial repair without unmount

3. Top-Level Filesystem Layout

An XFS filesystem is composed of three logical regions:

Region	Required	Purpose
Data section	Yes	Allocation Groups (AGs) — all user data and most metadata
Log	Yes	Metadata write-ahead log; internal (within the data section) or external (separate device)
Real-time section	No	Optional secondary region with deterministic allocation and no AG overhead

The data section dominates. The log is small (typically tens to hundreds of MiB). The real-time section is rare outside specialised media workloads.

4. Allocation Groups

The Allocation Group is XFS’s fundamental architectural unit. The data section is divided into N AGs, each between 16 MiB and 1 TiB in size. mkfs.xfs picks an AG count that balances parallelism against overhead — typically 4 AGs for filesystems up to ~16 TiB, more for larger devices.

Each AG is self-contained: it has its own free-space tracking, its own inode allocation, and its own copy of the superblock. Independent threads can allocate from different AGs without contending on global structures. This is the single design choice that most defines XFS’s behaviour.

4.1 AG Headers

Every AG begins with a small set of headers occupying the first few blocks:

Header	Purpose
Superblock (SB)	Primary copy lives in AG 0; every other AG holds a backup copy used by `xfs_repair`
AGF	Root pointers for the free-space B+ trees
AGI	Root pointers for the inode B+ trees
AGFL	Free list — a small pool of pre-reserved blocks for B+ tree operations, so tree splits never deadlock waiting for free space

4.2 AG-Local B+ Trees

Tree	Indexed by	Purpose
bnobt	Starting block number	Find free extents adjacent to a target location
cntbt	Extent length	Find a free extent of at least N blocks
inobt	Inode number	Locate any inode chunk
finobt	Inode number (free chunks only)	Fast allocation of new inodes (v5)
rmapbt	Owner	Reverse mapping — given any block, find what owns it (v5)
refcountbt	Block number	Reference counts for reflink-shared extents (v5)

The two free-space trees (bnobt and cntbt) index the same data in two different ways — one for spatial locality, one for size matching — so the allocator can answer both “place this near block X” and “find me a free extent of at least N blocks” without scanning.

5. Inodes

Inodes are allocated dynamically, in chunks of 64, as files are created. A fresh filesystem carries no inode tax; maximum file count is bounded by free space rather than by a mkfs-time decision.

Property	Value
Default inode size	512 bytes (v5)
Allocation unit	Chunks of 64 inodes
Inode number	Encodes AG number + AG-relative offset
Forks per inode	Data fork (always), attribute fork, optional CoW fork (v5 reflink)

5.1 Inode Formats

An inode’s data fork can take one of three formats depending on file size and type:

Format	Used when	Storage
Local	Small files, short symlinks, small directories	Data lives inline in the inode itself
Extents	Up to roughly 19 extents	Extent records packed into the inode
B+ tree	More than fits inline	Inode holds the root of an extent B+ tree

The “local” format is why XFS handles symlinks and small directories with no extra block reads — they fit entirely within the inode that describes them.

6. Extents and Block Mapping

XFS does not maintain per-block allocation bitmaps. Instead, files and free space are described as extents — (start_block, length, offset, flags) tuples. A 1 GiB contiguous file uses one extent record; a heavily fragmented 1 GiB file might use thousands. With delayed allocation and healthy free space, extent counts stay low and metadata overhead is small.

Reading or writing a sparse 4 KiB region of a 100 GiB file involves a B+ tree lookup in the inode’s extent map and then the I/O. No bitmap walk, no indirect-block chain.

7. The Log

XFS uses a physical metadata journal. Every metadata change writes a pre-image and post-image to the log; on crash, the log is replayed to bring metadata to a consistent state.

Aspect	Detail
Type	Metadata only — data blocks are never logged
Location	Internal (default, in the data section) or external (separate device, `-l logdev=`)
In-memory buffers	8 by default (`logbufs=`), 32 KiB each (`logbsize=`)
Sizing	`mkfs` auto-sizes (typically ~0.1% of the filesystem, capped); larger logs help metadata-heavy workloads
Recovery	Automatic on mount; replays committed transactions

An external log on a fast device (NVMe, NVRAM, or a battery-backed write-cache) reduces metadata commit latency dramatically for workloads with synchronous fsync() patterns — databases, mail servers, message brokers.

8. Delayed Allocation

Writes land in the page cache without block allocation. Allocation is deferred until writeback, at which point the allocator sees the full size of the dirty region and can pick a contiguous extent.

This is the primary reason XFS produces highly contiguous files on healthy filesystems — but it widens the window in which a power failure can leave a recently written file as zero-length (the inode was created but blocks were never allocated). Applications that need durability must call fsync(). This was the source of the infamous “XFS zeroes my files” complaints from a decade ago, and is the reason ext4 shipped auto_da_alloc to mimic some of XFS’s behaviour while papering over naive write-then-rename patterns.

9. v5 (CRC) Features

Filesystems formatted with crc=1 (default since 2014) use the v5 on-disk format, which adds:

Feature	Benefit
Per-metadata-block CRC32c	Detects silent corruption in all metadata blocks
Per-block UUIDs	Catches blocks accidentally written to the wrong filesystem
Block ownership in headers	Catches metadata-block confusion
Sparse inodes (`spinodes`)	Allocates partial 64-inode chunks when free space is fragmented
`finobt`	Fast free-inode lookup; accelerates create-heavy workloads
`rmapbt`	Reverse mapping; foundation for online repair and reflink
Reflink	Copy-on-write data sharing between files — powers `cp --reflink`, snapshots, dedupe
Bigtime (kernel ≥ 5.10)	Timestamps extend beyond 2038
`inobtcount`, `nrext64`	Faster mount; larger per-inode extent counts

If you are formatting a filesystem today, v5 is the only sensible choice. The CRC overhead is negligible; the corruption-detection gain is substantial.

10. Operational Tooling

Tool	Purpose
`mkfs.xfs`	Format-time configuration: block size, AG count/size, log size/location, inode size, feature flags
`xfs_info`	Display the geometry of a mounted filesystem
`xfs_growfs`	Online grow (XFS cannot shrink)
`xfs_repair`	Offline check and repair — the filesystem must be unmounted
`xfs_scrub`	Online metadata verification (kernel + userspace, v5 only)
`xfs_db`	Read/write debug access to on-disk structures — forensic, dangerous
`xfs_io`	Userspace harness for I/O syscalls — `fallocate`, hole punch, reflink, direct I/O
`xfs_quota`	User/group/project quota management
`xfs_fsr`	Online filesystem reorganiser — defragmenter
`xfs_bmap`	Show a file’s extent map
`xfsdump` / `xfsrestore`	Native backup with full attribute and ACL fidelity

11. Performance Topology

Knob	Effect
AG count	Parallelism ceiling — too few causes lock contention, too many wastes overhead
Log size and location	Bigger log helps metadata-heavy workloads; external log helps `fsync`-heavy workloads
`allocsize` mount option	Speculative preallocation size — larger reduces fragmentation but holds free space
`logbsize`, `logbufs`	Log buffer count and size — tune for high metadata throughput
`inode64`	Allow inode allocation across the whole filesystem (default on modern kernels)
`noatime` / `relatime`	Skip atime updates — reduces metadata writes
Real-time subvolume	Bypass AG metadata path for predictable latency (rare in general workloads)

A useful rule of thumb: workloads that bottleneck on metadata operations want more AGs and a larger log. Workloads that bottleneck on streaming I/O want fewer, larger AGs and aligned stripe parameters (sunit / swidth) matching the RAID geometry underneath.

12. When XFS Fits — and When It Doesn’t

Good fit	Poor fit
Large files (media, scientific datasets, VM images)	Many tiny files in a tiny filesystem (overhead dominates)
Large filesystems (multi-TB, multi-PB)	Workloads that need shrink
High-parallelism workloads (databases, file servers, container hosts)	Write-once-then-delete with deep nesting and churn
Streaming and sequential I/O	Workloads needing native filesystem-level encryption (use dm-crypt below XFS)
RHEL-family environments where it is the default	Workloads needing native transparent compression (use Btrfs or ZFS)
Reflink-based snapshots and dedupe	Workloads where COW everywhere is desired (use Btrfs or ZFS)

13. Security Posture

Concern	XFS position
Silent corruption	Mitigated by v5 CRC for metadata; data blocks rely on the application or the device layer for integrity
Encryption	No native `fscrypt`. Encrypt at the block layer with LUKS / dm-crypt, or at the application layer
Quotas	User, group, and project (directory-tree) quotas — useful for multi-tenant containers and shared storage
Per-inode flags	`append-only`, `immutable`, `no-dump`, `sync` — supported via `chattr`
Xattrs and ACLs	Full POSIX ACL and extended attribute support
Audit	Metadata operations are journalled, but the log is for crash consistency, not audit retention; pair with `auditd` for security audit
Reflink and integrity	Reflink shares blocks via the reference-count B+ tree; a write to one of the sharing files triggers CoW and severs the share, so reflink itself does not expose data between files

14. Architectural Diagram

Xfs Reference Architecture

15. Further Reading

XFS Algorithms & Data Structures — Dave Chinner et al., the canonical on-disk format reference, maintained in the kernel tree under Documentation/filesystems/xfs/
man 5 xfs and man 8 mkfs.xfs — succinct and accurate
xfsprogs source — xfs_db in particular is the fastest way to develop intuition about the on-disk structures
The fstests (xfstests) suite — the regression battery used to qualify every XFS change

Reference architecture, May 2026. Verify version-specific behaviour against the kernel and xfsprogs versions in your environment.

0 0 votes

Article Rating

0 Comments

Oldest

Newest Most Voted