Stratis Description

Dennis Keefe, Stratis Team

Stratis Description

Stratis is a tool to easily configure pools and filesystems with enhanced storage functionality that works within the existing Linux storage management stack. To achieve this, Stratis prioritizes a straightforward command-line experience, a rich API, and a fully automated approach to storage management. It builds upon elements of the existing storage stack as much as possible. Specifically, Stratis uses device-mapper, LUKS, XFS, and Clevis. Stratis may also incorporate additional technologies in the future.

Stratis 3.7.2 Release Notes

mulhern, Stratis Team

Stratis 3.7.2, which consists of stratisd 3.7.2 and stratis-cli 3.7.0 includes one significant enhancement, several minor enhancements, and a number of small improvements.

Most significantly, Stratis 3.7.2 extends its functionality to allow a user to revert a snapshot, i.e., to overwrite a Stratis filesystem with a previously taken snapshot of that filesystem. The process of reverting
requires two steps. First, a snapshot must be scheduled for revert. However, the revert can only take place when a pool is started. This can be done while stratisd is running, by stopping and then restarting the pool. A revert may also be occasioned by a reboot of the system stratisd is running on. Restarting stratisd will also cause a scheduled revert to occur, so long as the pool containing the filesystem to be reverted has already been stopped. To support this functionality, stratis-cli includes two new filesystem subcommands, schedule-revert and cancel-revert.

Some additional functionality has been added to support this revert functionality. First, a filesystem's origin field is now included among its D-Bus properties and updated as appropriate. stratis-cli displays an origin value in its newly introduced filesystem detail view. stratisd also support a new filesystem D-Bus method which returns the filesystem metadata. The filesystem debug commands in stratis-cli now include a get-metadata option which will display the filesystem metadata for a given pool or filesystem. Equivalent functionality has been introduced for the pool metadata as well.

stratisd also includes a considerable number of dependency version bumps, minor fixes and additional testing, while stratis-cli includes improvements to its command-line parsing implementation.

stratisd 3.6.7 Release Notes

mulhern, Stratis Team

stratisd 3.6.7 contains two bug fixes. The first bug fix prevents a file descriptor from being closed too soon after opening so that the user is prevented from specifying a passphrase via the --capture-key option of the stratis-min pool start command. This bug was introduced in stratisd 3.6.6. The second corrects an error in the stratis-fstab-setup script where the pool UUID was not properly supplied to the stratis-min pool is-encrypted command.

stratisd 3.6.6 includes a number of changes. It now defines two workspaces, one for itself and one for stratisd_proc_macros, mostly to simplify packaging downstream. It increases the lower bounds of many of its dependencies; its bindgen dependency lower bound is now increased to 0.69.0. It includes a restriction on the size of any String value in the Stratis pool-level metadata. It ensures that the UserInfo values on devices conform to the same restrictions as filesystem names and pool names. It fixes a bug in lock file handling where it would be possible for the lock file to contain some extra digits at the end of the running stratisd process's id.

Both releases contain many minor fixes and improvements.

stratisd 3.6.5 Release Notes

mulhern, Stratis Team

stratisd 3.6.5 includes a modification to its internal locking mechanism which allows a lock which does not conflict with a currently held lock to precede a lock that does. This change relaxes a fairness restriction that gave precedence to locks based solely on the order in which they had been placed on a wait queue. This release also includes a number of housekeeping commits and minor improvements.

stratisd 3.6.4 Release Notes

mulhern, Stratis Team

This post includes release notes for the prior patch releases in this minor release.

stratisd 3.6.4 includes a fix for stratisd-min handling of the start command sent by stratis-min to unencrypted pools. It also captures and logs errors messages emitted by the thin_check or mkfs.xfs executables.

stratisd 3.6.3 explicitly sets the nrext64 option to 0 when invoking mkfs.xfs. A recent version of XFS changed the default for nrext64 to 1. Explicitly setting the value to 0 prevents stratisd from creating XFS filesystems that are unmountable on earlier kernels.

stratisd 3.6.2 includes a fix in the way thin devices are allocated in order to avoid misalignment of distinct sections of the thin data device. Such misalignments may result in a performance degradation.

stratisd 3.6.1 includes a fix to correct a problem where stratisd would fail to unlock a pool if the pool was encrypted using both Clevis and the kernel keyring methods but the key in the kernel keyring was unavailable.

All releases include a number of housekeeping and maintenance updates.

Stratis 3.6.0 Release Notes

mulhern, Stratis Team

Stratis 3.6.0 includes one significant enhancement as well as several smaller improvements.

Most significantly, Stratis 3.6.0 extends its functionality to allow a user to set a limit on the size of a filesystem. The limit can be set when the filesystem is created, or at a later time.

In addition, Stratis 3.6.0 allows the user to stop a pool by specifying the pool to stop either by UUID or by name, and allows better management of partially constructed pools.

A new --only option was added to stratis-dumpmetadata, to allow it to print only the pool-level metadata.

stratis-min, the minimal CLI for Stratis, was extended with bind, unbind, and rebind commands.

The devicemapper dependency lower bound is increased to 0.34.0 which includes an enhancement to check for the presence of the udev daemon. stratisd and stratisd-min now exit on startup if the udev daemon is not present.

The libcryptsetup-rs dependency lower bound is increased to 0.9.1 and a direct dependency is introduced on libcryptsetup-rs-sys 0.3.0 to allow registering callbacks with libcryptsetup.

The nix dependency lower bound is increased to 0.26.3, to avoid compilation errors induced by a fix to a lifetime bug in a function in nix's public API.

The serde_derive dependency lower bound is increased to 1.0.185 to avoid vendoring the serde_derive executable included in some prior versions of the package.

stratisd also contains sundry internal improvements, error message enhancements, and so forth.

The stratis-cli command-line interface has been extended with an additional option to set the filesystem size limit on creation and two new filesystem commands, set-size-limit and unset-size-limit, to set or unset the filesystem size limit after a filesystem has been created.

stratis-cli now incorporates password verification when it is used to set a key in the kernel keyring via manual entry.

stratis-cli now allows specifying a pool by name or by UUID when stopping a pool.

stratis-cli also contains sundry internal improvements, and enforces a python requirement of at least 3.9 in its package configuration.

stratisd 3.5.8 Release Notes

mulhern, Stratis Team

stratisd 3.5.8 principally contains changes to make handling of partially
set up or torn down pools more robust. It also fixes a few errors and omissions in the management of stratisd's D-Bus layer, including supplying some previously missing D-Bus property change signals and removing D-Bus object paths to partially torn down pools which had in some cases persisted past the point when the pool should be considered stopped. In addition, it removes the dracut subpackage's dependency on plymouth.

Stratis root filesystem installation with stratify.py

Bryn Reeves, Stratis team

Support for using Stratis as the root filesystem was added in version 2.4.0 but without support in distribution installers it can be tricky for users to build systems for testing.

This blog post will look at a quick method for installing systems with Stratis as the root filesystem using the Fedora Live ISO, kickstart, and a Python script to simplify and automate the process.

stratisd filesystem as root filesystem on Fedora

John Baublitz, Stratis Team

Based on recent questions, we wanted to develop a specific guide for additional steps that need to be taken on Fedora to enable Stratis as the root filesystem for a Fedora install.

If you have not already looked at the guide for root filesystem work, please read that first. It is a prerequisite.

For a little bit of background, stratisd provides an additional subpackage for our dracut modules that we use to set up the root filesystem during early boot. This package installs the necessary modules for dracut to automate the setup. However there are some steps that may not be obvious to users to get this all to work. We'll cover them below.

Steps:

  1. Install the stratisd-dracut package. This is the subpackage mentioned above.
  2. Optional If using Clevis for unlocking encrypted pools, add the following configuration under /etc/dracut.conf.d/99-stratisd.conf:
add_dracutmodules+=" stratis-clevis "
  1. Test your configuration or ensure you have a rescue kernel and initramfs in case the update of the initramfs renders your install unbootable.
  2. Once you've verified that everything works as expected, run dracut --force --kver=[KERNEL_VERSION]

stratisd 3.5.2 Release Notes

mulhern, Stratis Team

stratisd 3.5.2 includes three significant enhancements as well as a bug fix.

The enhancements are:

  • stratisd 3.5.2 is the first stratisd release to include a subpackage, stratisd-tools, which incorporates stratis-dumpmetadata, an application which may be used for troubleshooting.
  • stratisd 3.5.2 now depends on devicemapper-rs 0.33.1, which includes support for synchronization between udev and devicemapper. See the devicemapper-rs changelog and stratisd pr 3069 for additional details.
  • stratisd 3.5.2 modifies the way takeover by stratisd from stratisd-min is managed during early boot. See stratisd pr 3269 for further details.

stratisd 3.5.2 also fixes a bug in a script used by the stratisd-dracut subpackage. This fix was included in the stratisd 3.5.1 release. See stratisd pr 3256 for further details.

Stratis 3.5.0 Release Notes

mulhern, Stratis Team

Stratis 3.5.0 includes one significant enhancement as well as several smaller improvements.

Most significantly, Stratis 3.5.0 extends its functionality to allow a user to add a cache to an encrypted pool. The cache devices are each encrypted with the same mechanism as the data devices; consequently the cache itself is encrypted.

Stratis 3.5.0 also fixes a few bugs:

  • It extends the thin metadata device more eagerly, and responds to thin metadata low water mark devicemapper events. This fix was included in the stratisd 3.4.2 release.
  • It makes the pool name field in the Stratis LUKS2 metadata optional; this prevents a failure to start an encrypted pool when upgrading from a previous stratisd version to stratisd 3.4.0. This fix was included in the stratisd 3.4.3 release.
  • It requires a new version of the Stratis devicemapper-rs library, which contains a fix which eliminates undefined behavior in the management of ioctls with large result values. This fix was included in the stratisd 3.4.4 release.
  • It requires a new version of the Stratis libblkid-rs library, which fixes a memory leak in the get_tag_value method used by stratisd. This fix is not included in any previous release.

This release also reduces the problem of repetitive log messages and modifies the D-Bus API to eliminate the redundancy parameter previously required by the CreatePool D-Bus method.

Stratis 3.4.0 Release Notes

mulhern, Stratis Team

Stratis 3.4.0 includes one significant enhancement as well as several smaller improvements.

Most significantly, Stratis 3.4.0 extends its functionality to allow users to specify a pool by its name when starting a stopped pool. Previously it was only possible to identify a stopped pool by its UUID.

In addition, stratisd enforces some checks on the compatibility of the block devices which make up a pool. It now takes into account the logical and physical sector sizes of the individual block devices when creating a pool, adding a cache, or extending the data or cache tier with additional devices.

The stratis pool start command has been modified to accept either a UUID or a name option, while the stratis pool list --stopped command now displays the pool name if it is available.

This release also includes improvements to stratisd's internal locking mechanism.

Stratis 3.3.0 Release Notes

mulhern, Stratis Team

Stratis 3.3.0 includes one significant enhancement and several smaller enhancements as well as number of stability and efficiency improvements.

Most significantly, Stratis 3.3.0 extends its functionality to allow users to instruct stratisd to include additional space that may have become available on a component data device in the space that is available to the device's pool. The most typical use case for this is when a RAID device which presents as a single device to stratisd is expanded.

stratis supports these changes with a new command stratis pool extend-data that allows the user to specify that the pool should make use of additional space on its devices. The stratis pool list command has been extended to show an alert if a pool's device has changed in size. The stratis blockdev list command will display two device sizes if the size that stratisd has on record differs from a device's detected size.

A less user-visible change is an improvement to the way that stratisd allocates space for its thin pool metadata and data devices from the backing store. The new approach is less precise but always more conservative when allocating space for the thin pool metadata device and will consistently reduce possible fragmentation of the thin pool metadata device over the backing store.

Checks for Clevis executables occur whenever a Clevis executable that is depended on by stratisd needs to be invoked to complete a user's command. Previously, the check occurred only once, when stratisd was started. We believe that this change will be more convenient for users who may install needed Clevis executables after stratisd has already been started.

Stratis 3.2.0 Release Notes

mulhern, Stratis Team

Stratis 3.2.0 includes one significant enhancement, one bug fix, and a number of more minor improvements.

Most significantly, Stratis 3.2.0 extends its functionality to allow users to stop and start a pool.

Stopping a pool consists of tearing down its storage stack in an orderly way, but not destroying the pool metadata. It is a pool destroy operation without the final step of wiping the Stratis metadata. Starting a pool is setting up a pool according to the information stored in the pool level metadata of the devices associated with a pool. Whether a pool is stopped or started is stored in the pool-level metadata, with the consequence that users can control whether a pool is automatically started when stratisd is started up, or whether startup of the pool is deferred until explicitly requested.

stratis supports these changes with new commands to start and to stop a pool. It includes an additional debug refresh command which allows a user to request that the state of all pools be refreshed. The pool list command has been extended to allow a detailed view of individual pools and to allow the user to examine stopped pools. The pool unlock command has been removed in favor of the pool start command.

Other changes include a fix to the algorithm for determining the size of data and metadata devices that make up a thinpool device, the elimination of all uses of udevadm settle in the stratisd engine, and general improvements to the RPC layers used by stratis-min and stratisd-min.

In addition, the stratisd-min service now requires the systemd-udevd service to ensure that Stratis filesystem symlinks are created when stratisd-min sets up a Stratis filesystem.

Stratis 3.1.0 Release Notes

mulhern, Stratis Team

Stratis 3.1.0 includes significant improvements to the management of the thin-provisioning layers, as well as a number of other user-visible enhancements and bug fixes.

Please see this post for a detailed discussion of the thin-provisioning changes. To support these changes the Stratis CLI has been enhanced to:

  • allow specifying whether or not a pool may be overprovisioned on creation
  • allow changing whether or not a pool may be overprovisioned while it is running
  • allow increasing the filesystem limit for a given pool
  • display whether or not a pool is overprovisioned in the pool list view

Users of the Stratis CLI may also observe the following changes:

  • A debug subcommand has been added to the pool, filesystem, and blockdev subcommands. Debug commands are not fully supported and may change or be removed at any time.
  • The --redundancy option is no longer available when creating a pool. This option had only one permitted value so specifying it never had any effect.

stratisd 3.1.0 includes one additional user-visible change:

  • The minimum size of a Stratis filesystem is increased to 512 MiB.

stratisd 3.1.0 also includes a number of internal improvements:

  • The size of any newly created MDV is increased to 512 MiB.
  • A pool's MDV is mounted in a private mount namespace and remains mounted while the pool is in operation.
  • Improved handling of udev events on device removal.
  • The usual and customary improvements to log messages.

Thin provisioning redesign

John Baublitz, Stratis Team

Overview

For a while, we've bumped into a number of problems with our thin provisioning implementation around reliability and safety for users. After gathering a lot of feedback on our thin provisioning layer, we put together a proposal for improvements to how we currently handle allocations.

The changes can largely be divided up into three areas of improvement:

  • Predictability
  • Safety
  • Reliability

Predictability

We made two notable changes to make behavior in the thin provisioning layer well-defined and predictable for users. Both parts relate to an existing thin provisioning tool, thin_metadata_size. This tool allows users to calculate the amount of metadata needed for a thin pool with a given size and number of thin devices (filesystems and snapshots in the case of stratisd). We have started taking advantage of thin_metadata_size to make our metadata space reservation more precise. Instead of our previous approach of allocating a fixed fraction of the available space, we now calculate the exact amount of space required for a given pool size and number of filesystems and snapshots. The second change is a switch to lazy allocation. Previously, we allocated greedily which meant that every time a device was added, we would allocate a certain amount of space for data and metadata regardless of the individual user's requirements. We now delay allocation and allocate block device storage on an as-needed basis allowing users to develop different requirements and adjust accordingly. For example, a user may realize that they need more filesystems than they originally planned for. With lazy allocation, assuming there is unallocated space on the pool, the user can now redirect that unallocated space from data to metadata space so there is enough room for a greater number of filesystems than was originally anticipated.

This change resulted in two API modifications. One is filesystem limits; to appropriately ensure that we never exceed the allocated metadata limit, we set a filesystem limit per pool. This limit can be increased through the API, triggering a new allocation for metadata space. The other API change is related to the switch to lazy allocation. There is now information available that reports the amount of space that has been allocated. Previously we only concerned ourselves with used and total space, but with lazy allocation, it is now also important to report space that has been allocated but may not be in use yet.

Safety

A key drawback of thin provisioning is often the failure cases. When overprovisioning a storage stack, the stack can get into a bad state when the pool becomes full due to the filesystem being far larger than the pool backing it. We have added in two safety features to help users cope with this.

One measure is the addition of a mode to disable overprovisioning. This ensures that the size of all filesystems on the pool does not exceed the available physical storage provided by the pool. This feature is not necessarily useful for all users, particularly with heavy snapshot usage because even if storage is shared between a snapshot and a filesystem, this mode will treat them as entirely independent entities in terms of storage cost. This ensures that copy- on-write operations will not accidentally fill the pool if the shared storage diverges between the two, but puts a rather strict limit on snapshot capacity. For users that use Stratis for critical applications or the root filesystem, this mode prevents certain failure cases that can be challenging to recover from.

When overprovisioning is enabled, we have also introduced a new API signal to notify the user when physical storage has been fully allocated. This does not necessarily mean that the pool has run out of space but serves as a warning to the user that once the remaining free space fills up, Stratis has no space left to extend to. This gives users time to provide more storage from which to allocate space before reaching a failure case.

Reliability

For a while, we've gotten bug reports about the reliability of filesystem extension. In certain cases, Stratis was not able to handle filesystem extension smoothly or at all. Between the per-pool locking and the thin provisioning redesign, we have now resolved some of the previous issues with filesystem extension. The approach we've taken attacks the problem from a few different angles.

Earlier filesystem extension

Stratis used to wait until several gigabytes were left to extend the filesystem. If Stratis didn't resize the filesystem quickly enough, the filesystem would run out of space before the extension could complete. While this would eventually resolve itself once the filesystem was extended, it would cause some unnecessary IO errors. We now extend the filesystem at 50% usage to ensure that users always have a large buffer of free space available for even very IO-heavy usage patterns.

Parallelized filesystem extension operations

Stratis could previously only iterate sequentially through pools. Now stratisd can handle filesystem extension on two separate pools in parallel, reducing the latency between the point where high usage is detected and the extension operation being performed.

Periodic checks for filesystem usage

Checking filesystem usage used to be a devicemapper event-dependent operation. This led to some problems around filesystem extension. A devicemapper event would be generated periodically as the filesystem filled up, but if the filesystem failed to extend a few times, devicemapper events would no longer be generated once the pool filled up and users would be left with a filesystem that couldn't be extended. We've removed our dependency on devicemapper events for filesystem monitoring and use devicemapper events for pool handling exclusively. Instead, we run periodic checks in the background on filesystems to ensure that even if filesystem extension fails multiple times, once the filesystem is ready to be extended, stratisd can perform the operation in the background, so that we don't leave users in a state where their filesystem can't be extended.

Migration and backwards compatibility

There are two types of changes that require migrations from older versions of stratisd: metadata changes and allocation scheme changes.

Metadata changes

The changes we made required some schema changes in our MDA, the metadata region outside of the superblock that records longer form JSON about the specifics of the pool topology. The migration should be invisible to the user and will be performed the first time the new version of stratisd detects legacy pools. The migration adds some additional devicemapper information, information about filesystem limits on a pool, and other bookkeeping information.

Allocation scheme changes

As mentioned above, the previous metadata allocation scheme was less precise and allocated a larger segment for metadata space than was necessary for the amount of data space present. Migration for old pools will cause stratisd to detect that the metadata device is already larger than it needs to be and no additional metadata device growth will occur until the data device size becomes large enough to require additional metadata space.

Future work

We hope to eventually provide some smarter allocation strategies for our data and metadata allocations to maximize contiguous allocation extents.

stratisd 3.0.4 Release Notes

mulhern, Stratis Team

stratisd 3.0.4 contains two fixes to bugs in its D-Bus API.

The D-Bus property changed signal sent on a change to the LockedPools property of the "org.storage.stratis3.Manager.r0" interface misidentified the interface as the "org.storage.stratis3.pool.r0" interface; the interface being sent with the signal is now correct.

The introspection data obtained via the "org.freedesktop.DBus.Introspectable" interface's "Introspect" method was not correct for the "GetManagedObjects" method of the "org.freedesktop.DBus.ObjectManager" D-Bus interface; it did not include the specification of the out argument. This has been corrected.

stratisd 3.0.3 Release Notes

mulhern, Stratis Team

stratisd 3.0.3 contains internal improvements and several bug fixes.

Most significantly, it includes an enhancement to stratisd's original multi-threading model to allow locking individual pools.

A change was made to the conditions under which the stratis dracut module is included in the initramfs.

Under some conditions, a change in pool size did not result in a corresponding property changed signal for the relevant D-Bus property change; this has been fixed.

Addition of per-pool locking

John Baublitz, Stratis Team

Overview

Recently, we've merged a PR that completes our work on improved concurrency in stratisd. Previously, we had made some changes to the IPC layer to provide the ability for stratisd to handle incoming requests in parallel which you can read about here. This work allowed IPC requests to each be handled in a separate tokio task, but the Stratis engine, the part of our code that handles all of the storage stack operations, could still only be accessed sequentially.

Motivation

After having conversations with the LVM team, it seemed like sequential accesses of storage operations was not entirely necessary. While modifying multiple layers of the pool stack at once can cause problems, modifying independent pools in parallel is safe, and we wanted to take advantage of the potential for increased concurrency. A large part of this is due to how we handle D-Bus properties. Our D-Bus properties expose aspects of the storage stack that sometimes require querying the device-mapper stack for information. With sequential accesses, this would mean that even two list operations on any two pools could not run in parallel, a restriction that causes a bad user experience and is not technically necessary.

Requirements

Despite the motivation being clear, the solution turned out to be more complicated. One of the major problems that we bumped into when trying to achieve more granular concurrency was the interaction between standard Rust synchronization structures and the API for listing D-Bus objects.

Our initial idea was to wrap the data structure containing the record of all of the pools in a read-write lock. This had a few notable drawbacks. For one, you could not acquire mutable access to two independent pools at a time even though this is a completely safe operation.

This led us to the idea of wrapping each pool in a read-write lock. Unfortunately, this also had some major drawbacks. One notable example of this was the behavior of our list operation with this solution. A list operation would require a read lock on every single pool and this means that the time that it would take to list all of the pools or filesystems would increase proportionally with the number of pools on the system. Because locking is relatively expensive, we noticed a significant slowdown when listing larger numbers of pools and filesystems.

Our ideal scenario was to have the benefits of a read-write lock so that list operations could run in parallel but to provide an ability to either lock single pools or all pools in one operation so that locking all pools would take the same amount of time no matter how many were present on the system.

Design

After determining that no locking data structure like this appeared to exist in tokio, we took some time to look into how tokio implements its locking data structures. The API for much of the locking data structures appeared to be a lock acquisition method that returned a future. This future would poll the state of the lock and either update the internal data structures to indicate that the lock had been acquired or put itself to sleep until it was ready to poll again. The drop method on the data structure returned by the future would trigger waking up a task to poll again. This seemed perfectly workable with a more granular read-write lock. The only difference would be that we would need to keep track of locks on individual pools as well as locks on the entire collection. The proper locking conflict rules would need to be checked:

  • WriteAll conflicts with all other operations.
  • ReadAll conflicts with WriteAll and Write on any pool.
  • Write conflicts with WriteAll and Read or Write on the same pool.
  • Read conflicts with WriteAll and Write on the same pool.

Any attempt to acquire two conflicting locks would queue one of the tasks to be woken up once the conflicting lock was dropped.

Notable design choices

We chose to implement our lock as a starvation-free lock. Implementing a lock that allows ReadAll to bypass Write* requests that are queued when another ReadAll request has already acquired the lock leads to behavior where Write* requests could block indefinitely. This behavior could cause list operations to block filesystem extension handling indefinitely, potentially leading to IO errors and a full filesystem. A starvation-free locking approach puts a task in a FIFO queue if any are already queued in front of it. The notable downside of this is slightly more latency for handling locking requests, but the benefits seemed to outweigh this.

Because tokio can cause spurious wake ups for tasks, we assign a unique integer ID to each future responsible for polling the lock for readiness. In the case where there is both a legitimate and spurious wake up at the same time, this allows our lock to differentiate between the two woken tasks to determine which one should be given priority and which should be put to sleep. This prevents spurious wake ups from acquiring the lock before they are scheduled to.

Because tokio does not currently allow lifetimes shorter than 'static when passing a reference across thread boundaries, our locking data structure heavily uses automatic reference counting (Arc). This enables shared access between multiple threads and the ability to pass an acquired lock handle to a separate thread after acquisition. Without the use of Arc, the pool would have to be operated on in the same task as the lock acquisition which would prevent passing lock handles to separate tasks to process them in parallel.

Optimizations

After our initial implementation of the write-all lock, we bumped into an issue where we could not pass all pool lock handles into separate threads to handle them all in parallel. This was particularly problematic for our implementation of background devicemapper event handling. Our solution for this was to allow acquiring all locks at once to avoid the penalty of locking each pool individually and then converting that lock handle to a set of individual locks that can all be released when they are no longer needed. This addressed both the issue of parallelization and constant time locking for all pools nicely.

Originally we also only woke one queued task at a time when a lock was released. This proved to be less performant. If two ReadAll tasks were queued, these could both be woken up in parallel and acquire the lock with no conflict. The solution to this was to factor out the part of the code that tests for conflicts and traverse the queue and wake up all tasks until a conflicting task is found. This allowed waking up a batch of queued tasks that could all operate in parallel without also waking up a conflicting task that would immediately be put back to sleep.

Future work

Recently, we discovered that we should be able to provide even more parallelization for filesystem background operations. While we cannot perform multiple pool mutation operations in parallel, the filesystems on top of the pool can be modified independently in parallel. We expect to change the way background checks on filesystem usage are handled by spawning each filesystem extension in its own tokio task so that, for pools with many filesystems, the filesystem extension will be more responsive. Rather than iterating through hundreds of filesystems, stratisd will be able to handle multiple filesystem extensions in parallel, speeding up the checking process if there is more than one filesystem that needs to be extended at once. This will benefit IO performance by ensuring that the filesystems are extended in a timely manner to avoid cases where the filesystem is filled before it can be extended.

Final notes

We've added extensive debugging for the locking data structure in case users run into issues. To enable these logs and see the state of the per-pool locking data structure over time, simply enable trace logs in stratisd!

Stratis 3.0.0 Release Notes

mulhern, Stratis Team

Stratis 3.0.0 includes many internal improvements, bug fixes, and user-visible changes.

Users of the Stratis CLI may observe the following changes:

  • It is now possible to set the filesystem logical size when creating a filesystem.
  • It is possible to rebind a pool using a Clevis tang server or with a key in the kernel keyring.
  • Filesystem and pool list output have been extended and improved. The pool listing includes an Alerts column. Currently this column is used to indicate whether the pool is in a restricted operation mode. A new subcommand, stratis pool explain, which provides a fuller explanation of the codes displayed in the Alerts column has been added. The filesystem listing now displays a filesystem's logical size.
  • With encrypted pools it was previously possible for the display of block device paths to change format if stratisd was restarted after an encrypted pool had been created. Now the display of the block device paths is consistent across stratisd restarts.

In stratisd 3.0.0 the D-Bus API has undergone a revision and the prior interfaces are all removed. The FetchProperties interfaces that were supported by all objects have been removed. The values that were previously obtainable via the FetchProperties methods are now conventional D-Bus properties. The possible values of error codes returned by the D-Bus methods have been reduced to 0 and 1, with the usual interpretation.

stratisd 3.0.0 includes a number of significant internal improvements and a few bug fixes.

stratisd bug fixes:

  • Previously the Stratis release included a dracut.conf.d file which made the Stratis dracut modules required in the initramfs. The consequence of this was that the initramfs could not be built unless all files required for the Stratis modules were present; if the initramfs is not built a reboot will fail. That file has been removed in this release.
  • The --prompt option was not passed to stratis-min in the stratis-fstab-setup script; this prevented the user from entering the password necessary to unlock an encrypted pool during boot. This is no longer the case.
  • Previously, stratisd did not increase the amount of space allocated to its spare metadata device when its in-use thinpool metadata device was extended. In some situations, when setting up a pool, stratisd might attempt a repair operation on the thinpool metadata device; if the space allocated for the spare metadata device was not large enough to accommodate all the metadata, then the repair operation would fail. Now the space allocated for the spare metadata device is increased whenever the metadata device is extended.
  • stratisd was not immediately updating the devicemapper device stack when a cache was initialized with the result that the cache was not immediately put in use. This is no longer the case.
  • stratisd was not immediately updating the Clevis encryption info associated with a pool on a command to bind an encrypted pool with Clevis. This problem has been corrected.
  • stratisd was sending an incorrect D-Bus signal on a pool name change; this has been fixed.
  • Previously, when stratisd-min, which runs during boot before D-Bus functionality is available, gave way to stratisd when the D-Bus had been set up, it was possible for inconsistencies to arise if the Stratis engine was performing an operation which required invoking a distinct executable. The executable might be terminated during its execution, and stratisd-min would take the action appropriate to the command failure before exiting. Now, systemd is instructed to send a kill signal only to stratisd-min and not to any of stratisd-min's child processes when shutting down stratisd-min.
  • Previously, if the same device was specified using two different paths when creating or extending a pool the different paths would be interpreted as two different devices and an error would be returned when stratisd attempted to initialize the device a second time. Now, the different paths are canonicalized eagerly, and converted into a single canonical representation of the device, stratisd initializes the device only once, and no error is returned.
  • Previously, stratisd did not report all existing object paths in the result of a D-Bus Introspect() call. This was due to a bug in version 0.9.1 and previous of stratisd's dbus-tree dependency. stratisd now requires dbus-tree 0.9.2, so all nodes are reported.

Other stratisd improvements:

  • Previously, stratisd relied entirely on udev information when deciding whether a storage device was not in use by another application and could safely be overwritten with Stratis metadata. Now it performs a supplementary check using libblkid and exits with an error if libblkid reports that the device is in use.
  • Handling of errors returned by internal methods is improved; a chaining mechanism has been introduced and the error chains can be scrutinized programmatically to identify expected scenarios like rollback failures.
  • A set of states indicating that a pool has reduced capability have been added internally and are published on the D-Bus. A pool's capability is reduced on an error being returned internally which contains, somewhere in its chain, the appropriate identifying error variant.
  • The code used to roll back failed encryption operations on a list of pool devices has been refactored and generalized. It is now capable of returning an error that can be used to identify a restricted pool capability due to a rollback failure.
  • stratisd uses sha-256 instead of sha-1 for Clevis-related encryption operations to conform with Clevis's own usage.
  • stratisd exits more elegantly and less frequently if it encounters an error during execution of the distinct tasks that are assigned to the individual threads that it manages internally.
  • In preparation for edition 2021 of the Rust language, stratisd source code has been updated to conform entirely to edition 2018 recommendations.

stratisd 3.0.0 Release Announcement

mulhern, Stratis Team

The next version of stratisd will be 3.0.0.

We have already decided on two breaking changes for this release:

  • We will collapse all the non-zero error codes returned over the D-Bus on an engine error into a single error code, 1.
  • We will remove all the backwards compatible D-Bus interfaces corresponding to stratisd 2 and will supply just a single D-Bus interface for stratisd 3. Subsequent minor releases of stratisd 3 will retain their backwards compatible interfaces as described in the DBus API Reference. Each interface will have a simplified naming convention, always specifying the major version and using the stratisd minor version in the revision number. For example, the name of the filesystem D-Bus interface under the new system for stratisd 3.0.0 is org.storage.stratis3.filesystem.r0.

The motivation for both these changes is the most typical of all: the implementation of stratisd will become unwieldy and bug-ridden if we try to maintain backwards compatibility in the D-Bus layer while simultaneously doing necessary redesign, re-implementation, and enhancement of the stratisd engine. In particular, changes to the way errors are managed internally will not allow us to ensure consistency of error codes returned over the D-Bus with the ones that were previously used.

Since we are increasing the major version, dropping the stratisd 2 D-Bus API is an obvious next step.

We are reviewing other possible API changes at this time in order to minimize the number of subsequent major version increases that we will be obliged to do.

stratisd 2.4.2 Release Notes

mulhern, Stratis Team

stratisd 2.4.2 is a bug fix release. It specifies two additional command-line dependencies for the stratis dracut module. stratisd and stratisd-min both require these dependencies to be available in order to start up.

Packaging for stratisd 2.4.1

mulhern, Stratis Team

For Fedora packaging, we have decided to split out the dracut support into a separate subpackage, stratisd-dracut. This package must be installed in order to support booting from a Stratis filesystem. All other functionality is included in the stratisd package.

The motivation for this change is to allow users greater flexibility and robustness. We understand that some users may choose to use Stratis but not to use Stratis for their root filesystem. These users may choose to install only the stratisd package.

Other users may prefer to use a Stratis root filesystem. They should install the stratisd-dracut package, which has a hard dependency on the stratisd package. The stratisd-dracut package also includes a hard dependency on dracut itself and on plymouth. plymouth is used in order to obtain a password to unlock an encrypted Stratis root filesystem. Please consult [Stratis filesystem as the root filesystem] for further information about Stratis support for a root filesystem.

We decided to implement this division due to a problem which would ensue if stratisd was installed but plymouth was not. In that case, the regeneration of the initramfs on kernel updates would fail and render the system unbootable with the new kernel.

The solution of adding plymouth as a hard requirement for stratisd would place an unnecessary dependency burden on a user who did not choose to maintain a Stratis root filesystem. However, without such a requirement a user who had stratisd but not plymouth installed would eventually end up with an unbootable system.

We believe that a separate subpackage is the most robust and flexible solution; it is one which requires no manual intervention by the user.

To properly construct the stratisd-dracut subpackage, it is imperative that the stratisd source code be compiled twice; once with the default features, in order to build stratisd itself, and again with a different set of features, in order to correctly build supporting scripts for the dracut module.

We recommend that other downstream packagers adopt a similar scheme.

Stratis 2.4.1 Release Notes

mulhern, Stratis Team

Stratis 2.4.1 is a bug fix release, which addresses a flaw in the multi-threading implementation.

A user could observe the behavior caused by the flaw when CLI commands would either take far longer to complete than normal or the D-Bus connection would eventually time out.

The cause of the observed problem was that stratisd was accepting a GetManagedObjects call on the D-Bus but not returning the result. This could occur when numerous object paths, i.e., several hundred, were being supported on the D-Bus, generally due to the creation of many filesystems.

We have addressed this problem by:

  • modifying the D-Bus message handling implementation
  • implementing a custom ObjectManager class in the D-Bus layer

In addition, we have refined the method by which individual threads are terminated when stratisd receives a shutdown signal to better terminate the D-Bus message handling thread.

The stratisd 2.4.1 release includes one additional fix: the signals associated with r4 D-Bus interfaces were not being sent appropriately, now they are.

In addition, stratisd 2.4.1 includes logging, at the trace level, of lock acquisitions and releases and additional logging in the systemd generators included with the release.

The stratis-cli 2.4.1 release includes:

  • an improvement to the listing of block devices
  • a new report with key managed_objects_report

Stratis 2.4.0 Release Notes

mulhern, Stratis Team

Stratis 2.4.0 includes two major user-visible changes:

  • All the functionality required to boot from a Stratis-managed root filesystem. See the prior post Stratis filesystems as the root filesystem for a more detailed discussion.
  • An enhancement to existing encryption support that allows the user to create a pool with encryption managed either by the kernel keyring or Clevis, and to subsequently bind an already encrypted pool using either mechanism. Previously, the user could create an encrypted pool using the kernel keyring only, and could bind or unbind using Clevis only.

More minor user-visible changes are:

  • An enhancement to the FetchProperties D-Bus interface in order to disclose more information about sets of encrypted devices.
  • The engine_state_report key in the report interface has been stabilized and is guaranteed to be supported in future releases.
  • A new executable, stratis-predict-usage to predict free space on a newly created pool is distributed with stratisd.

This release of Stratis also includes a number of significant but less visible changes:

  • Support for multi-threading in stratisd. The new multi-threading implementation replaces the prior event-loop implementation. See the prior post Multi-threading Support in stratisd for a detailed discussion.
  • The management of Stratis filesystem symlinks has been simplified. Determining the filesystem and pool name that comprise the symlink path no longer requires communication with stratisd over the D-Bus; it is now accomplished via standard udev-based mechanisms.
  • stratisd now emits a log message at the info level in connection with every mutating D-Bus method call that it completes without an error.

The support for migrating symlinks introduced in Stratis 2.2.0 is no longer included in this release.

The ongoing and perpetual but entirely routine work of improvements to individual log and error messages continues.

Stratis filesystems as the root filesystem

John Baublitz, Stratis Team

While Stratis unencrypted pools could previously be used as the root filesystem for a Linux installation with proper customization of the initramfs, our most recent feature provides all of the plumbing to fully support Stratis filesystems as the root filesystem of a Linux installation.

IPC

Stratis relies on D-Bus for interprocess communication between the daemon and the client. D-Bus does not currently ship in the initramfs so the first order of business was to choose an alternate form of IPC. Because we require the ability to pass a file descriptor from the client to the daemon, this made Unix sockets the only reasonable transport mechanism. After evaluating several JSON RPC libraries with Unix socket support, we decided to write a minimal RPC interface ourselves. This was due to a few constraints:

  • JSON RPC libraries with Unix socket support did not support setting the ancillary data for packets required to send file descriptors.
  • The libraries had relatively complicated threading models, and we preferred to take a simple approach of processing each new request using a Tokio task.
  • It was relatively trivial to serialize full data structures as JSON thanks to [serde_json].

This approach proved successful and we were able to implement an IPC mechanism that left our internal stratisd API unchanged. The alternate IPC mechanism is distributed in a separate pair of executables, stratis-min and stratisd-min.

Support in the initramfs

The next step was to integrate stratisd-min into the initramfs. This involved quite a bit of configuration for dracut and systemd.

Our current model uses systemd generators that are enabled by passing information on the kernel command line at boot to properly set up and unlock any devices that need to be unlocked. We aim to do this in as user-friendly of a way as possible by leveraging existing tools like Plymouth to handle prompting users for a passphrase on the splash screen.

To set up the generators and necessary dependencies, we wrote two dracut modules: stratis and stratis-clevis. stratis-clevis depends on stratis and is required for automated unlocking using clevis.

The required information on the kernel command line is:

  • root=[STRATIS_FS_SYMLINK]: The symlink under /dev/stratis that corresponds to the desired Stratis filesystem. This is required by dracut.
  • stratis.rootfs.pool_uuid=[POOL_UUID]: The UUID of the pool that contains the root filesystem.

If the user requires networking (for example, unlocking a pool using Tang), the parameter rd.neednet=1 is required as well.

Testing on Fedora using Anaconda

This process was relatively simple once it came time to test with Fedora. Anaconda provides the parameter --dirinstall which allows the user to install into a path with the mounted Stratis filesystem as the root directory. It requires a bit more configuration after the fact (manual /etc/fstab or .mount file configuration) but works quite well.

/etc/fstab or .mount files

We now also provide a systemd service to manage setting up non-root filesystems in /etc/fstab. For devices that require a passphrase or are critical for a working system, the following line can be used:

/dev/stratis/[STRATIS_SYMLINK] [MOUNT_POINT] xfs defaults,x-systemd.requires=stratis-fstab-setup@[POOL_UUID].service,x-systemd.after=stratis-fstab-setup@[POOL_UUID].service 0 2

The absence of nofail here is due to the fact that nofail causes the boot to proceed prior to a successful mount. This means that passphrase prompts will not work properly, and most users will want critical system partitions to be mounted successfully or else have the boot fail.

For devices that do not require interaction to set up, such as unencrypted devices or devices that have Clevis bindings, and are not critical for a working system, the following line can be optionally used:

/dev/stratis/[STRATIS_SYMLINK] [MOUNT_POINT] xfs defaults,x-systemd.requires=stratis-fstab-setup@[POOL_UUID].service,x-systemd.after=stratis-fstab-setup@[POOL_UUID].service,nofail 0 2

The addition of nofail here will cause mounting of this device to proceed independently from the boot which can speed up boot times. The set up process will continue running in the background until it either succeeds or fails.

Because the root filesystem is mostly set up in the initramfs, the entry is slightly different and does not require the stratis-fstab-setup service. It should be:

/dev/stratis/[STRATIS_SYMLINK] / xfs defaults 0 1

Recovery console

While we mention above that stratisd could previously be used in the initramfs, there is one major caveat: it could be started but commands could not be issued. This had one major drawback of not allowing users to interact with stratisd in the recovery console. D-Bus is not available in the recovery console, so the move to stratis-min and stratisd-min now allows users to perform recovery actions in the emergency console by starting stratisd-min and running the necessary commands using stratis-min. This will make rescuing systems that do not boot significantly easier moving forward.

Scope of dracut modules and systemd service files

While our dracut modules and systemd service files are meant to work for almost all users, they may not meet the requirements of everyone using them. We encourage those with more advanced configurations to design their own configurations and reach out for guidance as needed. Our configuration is also meant as a template that you can build on!

Conclusion

While this took quite a bit of effort to put all of the pieces together, the Linux boot utilities had all of the features we needed to accomplish this. We're excited for future work with other teams to make using Stratis as the root filesystem for Linux installations even easier!

Release version

All of the utilities required for booting from a Stratis filesystem as the root filesystem will be included in stratisd 2.4.0.

Multi-threading Support in stratisd

mulhern, Stratis Team

Introducing Support for Multi-threading in stratisd

stratisd is an entirely single-threaded application; it is a daemon with a single event loop that consults a list of possible event sources in a prescribed order, handling the events on each event source before proceeding to the next. The event sources are udev, device-mapper, and D-Bus events which are handled in that order. stratisd can also be terminated cleanly by an interrupt signal, which it checks for on every loop iteration.

Because stratisd is single threaded, every action taken by stratisd must be completed before another action is performed. For example, if a client issues a D-Bus message to create a filesystem, that command will be processed, the engine will create a filesystem, and a response will be transmitted on the D-Bus before any other action can be taken. If another D-Bus message is received before the first is completed, that D-Bus message will then be processed. The engine will continue to process D-Bus messages until none are left, preventing it from handling any other categories of signals or events while any D-Bus messages remain.

For this reason, stratisd itself can not parallelize long-running operations. It is well known that, for example, filesystem creation can be time consuming, as it is necessary to write the filesystem metadata when creating the filesystem. Ideally, stratisd would be able to run such time consuming operations in parallel, initiating one operation and then proceeding to initiate another before the first operation completes.

Additionally, as in the example above, if stratisd is continually receiving D-Bus messages, it will not proceed to deal with a device-mapper event, even if the device-mapper event is urgent and not in any conflict with the D-Bus messages, for example, if it is associated with a different pool than any D-Bus messages.

For these reasons, the next release introduces multi-threading capabilities into stratisd. These capabilities do not solve all the problems that multi-threading is intended to solve, but lay the essential foundation for multi-threaded event handling.

We have chosen to implement multi-threading using the Rust tokio crate. The alternative is to use operating system threads explicitly via the Rust standard library thread module. We have chosen tokio in order to get the benefits of code reuse from the tokio runtime, and because we expect that this choice will allow stratisd to operate efficiently while consuming fewer operating system resources.

We have also made use of the newest version of the dbus crate, which includes support for multi-threading via the dbus-tokio crate.

Nomenclature

The following words have a precise definition in the context of multi-threading:

  • object - an instance of a Rust struct and the methods implemented for it.
  • task - A task is a program which has been designed so that it can be run concurrently with its fellow tasks. The multi-threaded incarnation of stratisd consists of a set of tasks, some code to facilitate interactions between the tasks, and the tokio runtime.
  • context - the context of a process is all the information required to begin running the process when it resumes after having been suspended by the operating system scheduler. A context switch is the action of storing this information for the process being suspended and loading the information for the process being resumed.
  • thread - a thread is a sub-division of a process. When an operating system switches processes, a context switch is required. When an operating system switches threads, the new thread shares much of its process' context with the previous thread. Consequently, switching between threads, as opposed to processes, may be an order of magnitude less expensive. All threads belonging to a process share the same memory and may communicate via this shared memory.
  • runtime - the tokio runtime manages scheduling of tasks.
  • block - a task is said to block on an operation if the task must wait for the operation to complete and is not able to be replaced in the same thread by another task until the operation is completed. Examples of typical operations are I/O or network interactions. Another sort of operation is the acquisition of a shared resource via a mutex or other synchronization primitive.
  • blocking - a blocking task is a task that may block.
  • non-blocking - a non-blocking task is a task that does not need to wait for any operation to complete; if it initiates an operation that may take a while to complete, it is able to yield to another task, and may be resumed later from the instruction where it yielded.
  • mutex - a synchronization primitive which enforces mutual exclusion. With tokio the exclusion is enforced on a particular object. If a task obtains a mutex, it has exclusive use of the object until it releases the mutex. If a mutex is already held by another task, a task requesting the mutex may block or it may yield until the mutex can be obtained.
  • read/write lock - a mutex which is relaxed in so far as that it allows multiple tasks to share an object if none mutate the object. If a task mutates the object then it must obtain exclusive possession of the mutex.
  • lock - to enter a mutex guarding an object is synonymous with locking an object.

Design

stratisd divides its work among a number of tasks which handle different event sources. Some tasks are non-blocking, others are blocking tasks. Non-blocking tasks may yield, and can share a single thread with other non-blocking tasks. The tasks communicate using two unbounded MPSC (Multi-Producer Single-Consumer) channels; a channel for udev events and a channel for D-Bus updates.

Termination Variable

One boolean variable, should_exit, is shared among some of the tasks. It is set to true only if SIGINT is detected. It is observed only by the udev task, which checks its value on every iteration of its loop, and immediately returns if the value is true. For all other tasks, termination is handled by tokio constructs. The udev task requires special handling because it does not yield and because it contains a non-terminating loop.

The dbus tree

The dbus tree is a data structure which contains the state of the D-Bus layers. Access to the dbus tree is controlled by a read/write lock.

The stratisd engine

The stratisd engine is the core of the stratisd daemon. It manages all the essential functionality of stratisd. Access to the engine is controlled by a mutex.

The dbus channel

The dbus channel is an unbounded multi-producer, single-consumer channel. It carries messages instructing the DbusTreeHandler how to update the dbus tree. The DbusTreeHandler task is the unique consumer of the messages. The DbusConnectionHandler, which processes D-Bus messages sent by the client, and the DbusUdevhandler, which handles udev events, may both place messages on the dbus channel.

The udev channel

The udev channel is an unbounded multi-producer, single-consumer channel. It carries messages about udev events to the DbusUdevHandler. There is only one producer for this channel, the udev event handling task, which monitors udev events and places those events on the channel.

signal handling task

The signal handling task is a non-blocking task which waits for SIGINT. If it receives the signal it sets should_exit to true and finishes.

device-mapper event task

The device-mapper event task loops forever waiting for a device-mapper event. On receipt of any event, it locks the stratisd engine, and processes the event. It yields when waiting for a new device-mapper event or when waiting for a lock on the engine.

udev event handling task

The udev event handling task uses a polling mechanism to detect udev events. If a udev event is detected it places a message on the stratisd udev channel. It reads should_exit after every udev event or, if no udev event has occurred, after a designated time interval. If should_exit is true when read it returns immediately.

D-Bus Tasks

The management of the D-Bus layer is handled by several cooperating tasks. The dbus crate supplies one task, which detects D-Bus messages and places them on its own unbounded channel. The stratisd tasks are the DbusTreeHandler task, the DbusConnectionHandler task, and the DbusUdevHandler task.

DbusTreeHandler task

stratisd defines a DbusTreehandler task which updates the dbus tree and may also handle emitting D-Bus signals. It is the unique receiver on the stratisd dbus channel and the only task which obtains a write lock on the dbus tree. It is a non-blocking task.

DbusConnectionHandler task

stratisd defines a DbusConnectionHandler task which spawns a new task for every D-Bus method call. Each spawned task obtains a read lock on the dbus tree before it begins to process the D-Bus method call, and may also lock the engine. If it locks the engine, it blocks on the lock. Each spawned task may place messages on the stratisd dbus channel. Each task is responsible for sending replies to its D-Bus message on the D-Bus. This is the only part of the implementation where new tasks can be spawned during stratisd's regular operation.

DbusUdevHandler task

stratisd defines a DbusUdevHandler task which removes udev event information from the stratisd udev channel, allows the engine to process it, and puts any messages that may be necessary as a result of the engine processing the udev event on the stratisd dbus channel. Currently, a udev event may result in a pool being set up; when that happens an add message must be placed on the dbus channel for every filesystem or block device belonging to the pool, as well as an add message for the pool itself. The DbusUdevHandler locks the engine when processing a udev event, but does not block on the lock.

Properties and Consequences

Unbounded Channels

Both the stratisd udev channel and the dbus channel are "unbounded channels". These "unbounded" channels are actually bounded, but the bound on the number of messages allowed on the channel is the maximum value of the Rust usize type. It is assumed that other machine limits will be encountered before the number of messages on the channel reaches that limit. Because both channels are unbounded, tasks do not block placing a message on the channel, sending always succeeds.

We chose to make the dbus channel unbounded, as there exist two situations where a large number of messages may be placed on the channel. When a pool is constructed, the number of messages placed on the channel is proportional to the number of devices in the pool. On startup, when stratisd sets up a pool from its constituent devices, the number of messages is proportional to the number of devices and to the number of filesystems that the pool supports. We prefer to use an unbounded channel rather than to bound the number of filesystems by the channel size.

Generally speaking, we expect the number of messages on the channel, except on the occasion of pool creation or setup, to be no greater than 1; no other action currently implemented requires more than one message to be sent to the DbusTreeHandler. Messages will be rapidly consumed by the DbusTreeHandler, as it is the only task that takes a write lock on the dbus tree, and a task waiting for a write lock takes precedence over one waiting for a read lock.

The choice of unbounded channels also eliminates one possible source of deadlock.

Bounded Number of Blocking Threads

We have accepted, at this time, the tokio default for the number of blocking threads, which is 512. Because the DbusConnectionHandler's generated tasks are blocking, this places an upper bound on the number of distinct D-Bus messages that can be handled concurrently. Note that it is quite possible for 512 D-Bus messages to be handled by just one thread, as each task may be run in sequence on a single thread if the tasks complete rapidly.

We do not believe that this restriction will prove important in practice. The dbus crate's message channel is unbounded, so D-Bus messages can not be dropped although they may be handled very slowly if there is a backlog. Depending on the client's configuration, this may cause the client to hang indefinitely waiting for a response or the client may receive a message indicating that no response was transmitted in the allotted time. However, this situation can only arise if many messages require long-running actions to be taken and if these messages are sent in parallel.

In any case, the improvement with respect to a single-threaded approach is obvious. In the existing single threaded design, stratisd would be unable to handle any other events until all the D-Bus messages had been handled. With the multi-threaded design, udev and device-mapper events can be handled when they arrive, interspersed with the handling of the D-Bus messages.

One Task per D-Bus Message Model

In the single-threaded design, every D-Bus message is handled completely before handling of the next D-Bus message is begun. In our multi-threaded design multiple D-Bus message handling tasks may be being processed at the same time if the tokio scheduler allocates two message handling tasks to separate threads.

Each such task must:

  1. Acquire a read lock on the dbus tree.
  2. Query the tree in order to find the necessary information to call the engine method.
  3. Enter a mutex on the stratisd engine.
  4. Operate on the stratisd engine.
  5. Place any required messages on the dbus channel.
  6. Exit the mutex.
  7. Relinquish the read lock.

While processing of each message will be started precisely in the sequence in which the messages arrive, the order in which messages complete may not be the same, because a later task may enter the engine mutex before an earlier task.

The motivation for this design is obvious, although the benefits are not yet realized in this preliminary multi-threading implementation. In future, we expect to relax the requirement that each task have exclusive access to the entire engine and lock only the relevant parts of the engine. With that extension two non-interfering D-Bus commands may be run separately. The same general advantage from this proposed enhancement will also be gained in the matter of, for example, handling device-mapper events while simultaneously handling a D-Bus method.

This change introduces a relaxation of certain properties that held in the single-threaded case.

  1. If a D-Bus method that mutates state and requires an update to the dbus tree is invoked the changes to the dbus tree resulting from that method call are not visible until some time after the call has returned as updates to the dbus tree can only occur after the method has completed. This can be observed by a client, if the client invokes a second method immediately after the first has returned. For example, if the client invokes the CreatePool method and then immediately invokes the GetManagedObjects() method, some pool object paths corresponding to the pool or its devices may not yet be present in the tree. The opposite behavior can also be observed, for example, if the client invokes the DestroyPool method, some object paths belonging to the destroyed pool may still be found by a GetManagedObjects() invocation.

  2. If two D-Bus methods are invoked in separate processes, the same behaviors described in (1) are somewhat easier to observe.

  3. We believe that we have made it impossible to incorrectly update the tree by returning rich result types from the engine methods.

Given two distinct mutating D-Bus methods running in separate threads there is a possibility of a situation rather analogous to a race-condition arising. Two tasks may read the dbus tree, update the internal engine state, and then send update messages on the dbus channel. It is uncertain which task will acquire the engine mutex. This is partially analogous to the classic race-condition where two processes read a single variable, and then both update that variable in an undetermined order.

What makes this analogy only partial is the interposition of the engine, which restricts the updates that may be requested of the DbusTreeHandler by the DbusRequestHandler. The engine methods invoked by the D-Bus layer return a result which sufficiently distinguishes the actions actually taken by the engine so that conflicting updates to the dbus tree can not be requested. Thus the updates are constrained to be correct.

For example, consider that two conflicting commands may be handled at the same time: one command to delete a filesystem and the other to rename the same filesystem. If both commands are being handled in separate threads each will read the same data based on the filesystem object. Then, either one may enter the engine mutex. If the rename task enters the mutex first, it will be the first to place a message on the dbus channel. The DbusTreeHandler will remove the rename message first and then the remove message placed on the dbus channel after the remove request completes. Clearly, this order of processing can not result in an error. With the other order, the remove message will be placed on the dbus channel before the rename occurs. But in this case, the engine method will return a result indicating that no rename could occur, because the filesystem could not be found. Consequently, no rename message will be put on the dbus channel, and so the DbusTreeHandler will receive the remove message only. Thus, no incorrect update is performed on the dbus tree.

Error Behavior

stratisd exits if any task returns an error, using the same mechanism and general procedure that it uses on receipt of SIGINT. Causes of error may include:

  • an error when polling for udev events
  • an error when polling for device-mapper events
  • failure to properly set up a D-Bus connection on startup
  • an error when consuming a message on one of the stratisd channels

A properly handled error within the stratisd engine will not result in the termination of any tasks. In the case of a D-Bus method call, for example, an error result is interpreted by the D-Bus layer, and some representation of that error is then incorporated into the message returned on the D-Bus.

We have taken great care to avoid panics within stratisd. Nonetheless, it is reasonable to discuss possible behavior on any panic which may occur.

If one of the dynamically spawned DbusTreeHandler tasks experiences a panic while executing, stratisd will not be terminated. Only the currently running task will fail to complete. When a new D-Bus message is received, a new task will be spawned and will execute as usual.

However, a panic that occurs during the execution of a task like the udev event handling task, of which there is only one spawned when stratisd is started, will cause stratisd to exit.

Ensuring a Clean and Prompt Exit

On SIGINT, stratisd should exit promptly and cleanly. This is ensured by:

  1. Having a separate signal handling task that waits on SIGINT. The tokio scheduler will ensure that this task is run regularly; thus the signal can not be ignored. Note that in the single-threaded case it is possible for the signal handling code never to be reached.
  2. Causing asynchronous tasks to terminate at their next synchronization point when the signal handling task terminates.
  3. Having the udev event handling loop check the flag set by the signal handling task on every iteration, and terminate if the flag is true.
  4. Each distinct D-Bus method processing task is allowed to run to completion, so that every action that it has begun can be completed.

Statistics

Using tokio increases the size of the stratisd executable by about 1 MiB, which at stratisd's current size is an increase of approximately 20%.

Remarks

Preliminary multi-threading support will be included in the next stratisd release, 2.4.0.

Stratis 2.3.0 Release Notes

mulhern, Stratis Team

Stratis 2.3.0 adds additional flexibility to its encryption support via Clevis.

stratis 2.3.0

This release extends the pool unlock command, and adds two new commands, pool bind and pool unbind.

The pool bind command establishes an alternative mechanism for unlocking a pool. The user may select either the "tang" mechanism, which implements NBDE (Network-bound Disc Encryption) by means of a Tang server, or the "tpm2" mechanism, which uses TPM 2.0 (Trusted Platform Module) encryption. Binding the devices in a pool to a supplementary Clevis encryption policy does not remove the primary encryption mechanism, which uses a key in the kernel keyring.

The pool unbind command simply unbinds a previously added encryption policy from all the devices in the specified pool.

In the pool unlock command it is now necessary to specify the mechanism. Use clevis to make use of the Clevis unlocking policy previously specified for the devices in the pool. Use keyring, to make use of the mechanism that uses a key in the kernel keyring, which was introduced in Stratis 2.1.0. Note that the pool unlock command unlocks all currently locked pools.

stratisd 2.3.0

This release introduces two D-Bus interface revisions, which differ in the following way from the previous revisions.

org.storage.stratis2.Manager.r3 modifies the UnlockPool method to take an additional parameter, unlock_method, which may be keyring or clevis.

org.storage.stratis2.pool.r3 adds two new method: Bind and Unbind. The Bind method takes two arguments, pin and json. The pin argument designates the Clevis pin as a string, and the json argument encodes a Clevis configuration appropriate to the designated pin. The configuration is a JSON object. Besides Clevis information, it may include Stratis-specific keys that encode configuration decisions that Stratis may implement. At present there is just one such key: stratis:tang:trust_url. The Unbind method reverses a Bind action.

Remarks

The Bind method may be called with any Clevis pin and configuration; we expect that any valid Clevis pin and configuration can be used to bind the devices in a pool. However the Stratis project officially supports only the "tang" and "tpm2" pins as those are the pins that may be designated via stratis. Support for additional Clevis policies may be introduced into stratis in later releases.

When binding a supplementary encryption policy to the devices in a pool using Clevis, the primary key, which is the key in the kernel keyring which was originally used to encrypt each device, must be supplied. stratisd obtains the appropriate key from the kernel keyring in order to provide it to the Clevis binding mechanism. The correct key must be present in the keyring for the bind operation to succeed. It is not necessary for the user to specify the key, stratisd obtains the necessary information from the LUKS2 metadata on the devices in the pool.

In general, it is unwise to write a key consisting of arbitrary binary data to a keyfile. An accidental newline character in the data may cause the contents of the file to be truncated at the newline when read in one context while all the data may be read from the file in some other context.

We are not aware that such a mistake would result in any error in Stratis's operation when Stratis is used in the way that we recommend. We explicitly acknowledge that it might be possible, through some direct interaction with the stratisd D-Bus API, or by, e.g., setting a key in the kernel keyring without using stratis, to manufacture a situation where stratisd could not bind the devices in a pool, even when the correct key is set in the kernel keyring. We would not treat such a situation as evidence of a bug in Stratis.

Stratis 2.2.1 Release Notes

mulhern, Stratis Team

Stratis 2.2.1 is a bug fix release. It fixes the following bugs:

  • It was possible to cause stratisd to hang by leaving open a D-Bus connection when setting a key in the kernel keyring.
  • stratis would pass as arguments on the D-Bus and stratisd would accept relative, rather than absolute, path names to specify devices.
  • Pool and filesystem names that included characters that would be escaped by udev when constructing filesystem symlinks were permitted.
  • The man page entry for the key list command was missing.

Other general improvements were made, and several crate version requirements were increased.

Stratis 2.2.0 Release Notes

mulhern, Stratis Team

Stratis 2.2.0 now places Stratis filesystem symlinks in /dev/stratis, rather than /stratis. Stratis creates and maintains the symlinks by means of udev rules, rather than directly via stratisd as previously. The /stratis directory is neither created nor used by stratisd 2.2.0.

This release places management of the terminal setting for interactive encryption-key entry in stratisd rather than in stratis-cli.

This release also includes enhancements to the stratisd D-Bus interface, various bug fixes, and a change in the stratisd CLI specification for log levels.

stratisd 2.2.0

This release creates and maintains Stratis filesystem symlinks in /dev/stratis by means of udev rules. It includes a small Rust script, stratis_uuids_to_names which is invoked by the Stratis udev rule which sets the Stratis filesystem symlinks.

In the case where stratisd is updated in place, some filesystem symlinks may remain in /stratis. This release includes a shell script, stratis_migrate_symlinks.sh which may be used to clean up the /stratis directory and ensure that the correct symlinks exist in /dev/stratis. The script removes the /stratis directory once it has completed without error. The shell script relies on a small Rust script, stratis_dbusquery_version which is included with this version of stratisd.

This release also extends the D-Bus interface in a few ways:

  • It sends org.freedesktop.DBus.ObjectManager.InterfacesAddedand org.freedesktop.DBus.ObjectManager.InterfacesRemoved signals on the D-Bus whenever a D-Bus object is added to or removed from the D-Bus interface.
  • It adds a new D-Bus property, PhysicalPath, for the org.storage.stratis2.blockdev.r2 interface. This property is principally useful for encrypted Stratis block devices; it identifies the block device on which the Stratis LUKS2 metadata resides.
  • It adds a new key, LockedPools, to the org.storage.stratis2.FetchProperties.r2 interface for objects that implement the org.storage.stratis2.Manager interface. This key returns a D-Bus object that maps the UUIDs of locked pools to their corresponding key descriptions.

Please consult the D-Bus API Reference for the precise D-Bus specification.

Stratis 2.1.0 Release Notes

mulhern, Stratis Team

Stratis 2.1.0 introduces support for encryption.

It supports per-pool encryption of the devices that form a pool's data tier. A pool may be encrypted, or its constituent encrypted devices may be activated, by means of a key stored in the kernel keyring.

stratisd 2.1.0

This release implements encryption support and adds several new D-Bus interfaces to administer or monitor that support.

It implements encryption support in the following way:

  • A single instance of stratisd can support both encrypted and unencrypted pools.
  • The choice to encrypt a pool must be made at the time a pool is created.
  • At present, the use of a cache and of encryption are mutually exclusive; if the pool is created with encryption enabled, then it is not possible to create a cache.
  • Each pool may be encrypted by means of a key in the kernel keyring; each encrypted pool may make use of a different key, but all devices in a pool are encrypted with a single key.
  • Any additional devices that are added to an encrypted pool's data tier will be encrypted using the key that was specified when the pool was initialized.

stratisd 2.1.0 supplies several new D-Bus interfaces:

  • org.storage.stratis2.manager.r1: This interface supplies an extended CreatePool method, to support an optional argument for encryption. In addition, it supplies a number of method for key management.
  • org.storage.stratis2.pool.r1: This interface supports explicit initialization of a cache tier. Previously, a cache was initialized as a side-effect of the addition of the first device to the cache tier. It also supports the new Encrypted property.
  • org.storage.stratis2.FetchProperties.r1: This interface supports an additional HasCache property.
  • org.storage.stratis2.Report.r1: This interface supports a set of ad-hoc reports about Stratis. The interface is unstable; the names by which the reports can be accessed are not guaranteed to remain stable, and the format of any report is only guaranteed to be valid JSON.

Please consult the D-Bus API Reference for the precise D-Bus specification.

The following are significant implementation details:

  • Each block device in an encrypted pool's data tier is encrypted with a distinct, randomly chosen MEK (Media Encryption Key) on initialization.
  • All devices belonging to a single encrypted pool share a single passphrase, supplied via the kernel keyring.
  • The release requires cryptsetup version 2.3.

We would like to thank our external contributor GuillaumeGomez for further work on metadata refactoring (stratisd issue 1573).

stratis-cli 2.1.0

This release requires stratisd 2.1.0. The user will observe the following changes:

  • The pool create command has been extended to allow encryption.
  • There is a new pool init_cache command, for initializing a cache.
  • There is a new subcommand, key, for key management tasks.
  • There is a new subcommand, report, which allows the display of certain reports generated by stratisd.
  • The output of pool list now includes a Properties column; each entry in the column is a string encoding the following properties of the pool:
    • whether or not it has a cache
    • whether or not it is encrypted

All commands now verify that stratis is communicating with a compatible version of stratisd and will fail with an appropriate error if stratisd is found to have an incompatible version.

Usage

To create an encrypted pool, a user must first ensure that a key is placed in the kernel keyring. We strongly encourage using the commands available via the stratis key subcommand for this task. This key, which is secret, has a corresponding key description, which is public.

An encrypted pool is then created by specifying the key description when using the pool create command.

It is necessary that the correct key and corresponding key description be set in the kernel keyring in order to set up a previously encrypted pool. Setting up a previously encrypted pool requires an explicit pool unlock command from the user. This command will attempt to unlock the devices belonging to any previously encrypted pool; it can only unlock all devices if a key for every encrypted pool is in the keyring. Once the devices belonging to a previously encrypted pool have been unlocked, the pool will be set up, and can be used in exactly the same manner as an unencrypted pool.

Cryptsetup Rust bindings release

John Baublitz, Stratis Team

One major focus in the Stratis project recently has been adding an encryption layer for data in Stratis pools. Cryptsetup provides a library backend for programmatically setting up device encryption, so we decided to write Rust bindings to access the existing Cryptsetup functionality in Rust.

While designing the bindings, we took every opportunity to make use of Rust's type system, leveraging features like reference lifetimes and type parameters to ensure that as much of our public API as possible can be validated by the compiler.

Though these bindings were designed with Stratis in mind, it is intended to be general-purpose and so we encourage others to try it out. The license is MPLv2, but it becomes effectively GPL when linked with libcryptsetup. As a result, any project using our bindings will also need to be GPL or GPL-compatible.

If you're interested in seeing more, you can find the repository here.

Stratis 2.0.0 Release Notes

mulhern, Stratis Team

Stratis 2.0 is a significant update for both the daemon and the CLI. The changes to the daemon are covered first, followed by the changes to the CLI.

stratisd 2.0.0

This release makes the D-Bus API more robust, reliable, predictable, and extensible. There are several significant changes:

  • The set of D-Bus properties has been reduced to a core set of fundamental and stable properties. Other filesystem, pool, or block devices properties are now obtainable via methods in the FetchProperties interface. This change increases the robustness of the D-Bus interface to failures occurring in any particular pool, filesystem, or block device, and decreases the computational cost of most operations requested by the Stratis CLI. Several properties, formerly returned as D-Bus properties, are now unavailable by means of the D-Bus. In every case, the reason for removing the property was that it did not represent a well-defined value. See project issue 52 for further details.

  • All D-Bus method calls are idempotent. This should make writing scripts using the D-Bus API much simpler and make reasoning about the behavior of the engine more straightforward. Henceforth, we will treat as a bug any non-idempotent behavior in the D-Bus API. See project issue 51 for further details.

  • All D-Bus size values are now returned in bytes. Again, this should make writing scripts against the D-Bus more straightforward, since it will be unnecessary for the script writer to change their interpretation of the number returned on the D-Bus depending on the value that it represents. See stratisd issue 1243 for further details.

Future enhancements to the D-Bus API will be implemented by means of additional versioned interfaces.

Please consult the D-Bus API Reference for the precise D-Bus specification.

stratis-cli 2.0.0

This release requires stratisd 2.0.0. The user will observe the following significant improvements:

  • The CLI is significantly more robust. Previously, there was a category of error conditions in pools, filesystems, and block devices that would make the CLI virtually unusable; this problem has now been entirely resolved. See project issue 52 for further details.

  • The CLI now reports errors consistently in conditions where a human user would generally expect an error to be reported. Previously, many commands in the CLI were idempotent, to facilitate scripting. Now there is a clear distinction between the CLI behavior and the stratisd D-Bus API behavior: the CLI behavior is designed strictly according to the expectations of a human user, the stratisd D-Bus API is the programmable interface. See project issue 51 for further details.

As always, anyone wishing to implement a program that uses Stratis for storage management is strongly advised to make use of the stratisd D-Bus API rather than the CLI.

stratis-cli 1.1.0 Release Notes

mulhern, Stratis Team

With this release stratis now recognizes an environment variable, STRATIS_DBUS_TIMEOUT. This environment variable controls the timeout for any individual D-Bus call that stratis makes. You may want to set it to a higher value than the default, which is 120 seconds, if you are running tests or otherwise scripting via stratis, and wish to avoid erroneous errors resulting from slow operations in your testing environment. See stratis-cli issue 252 for further details.

This release also introduces simplified and more complete error-reporting. For stratis, it constitutes an error if any command issued results in a Python stack trace. If you experience any such incident, please report it in a GitHub issue, including the full stack trace, and circumstances that led up to the incident.

stratisd 1.0.6 Release Notes

mulhern, Stratis Team

This release includes one significant bug fix and a substantial refactoring.

The bug was caused by an inconsistency in the metadata handling which led to a failure to properly update the Stratis metadata if stratisd was restarted in an environment where the system clock indicated a time earlier than when it had previously been running. See stratisd issue 1509 for further details.

Stratis 1.0 Release Notes

Friday September 28, 2018

New Features

Initial Stable Stratis Release

Stratis is a Linux local storage management tool that aims to enable easy use of advanced storage features such as thin provisioning, snapshots, and pool-based management and monitoring.

After two years of development, Stratis 1.0 has stabilized its on-disk metadata format and command-line interface, and is ready for more widespread testing and evaluation by potential users. Stratis is implemented as a daemon – stratisd – as well as a command-line configuration tool called stratis, and works with Linux kernel versions 4.14 and up.

Stratis 0.5 Release Notes

March 8, 2018

This release is suitable for developers and early testers. It should not be used with valuable data, and pools created with this release will not be supported in Stratis 1.0, due to upcoming on-disk format changes.