MOKSHA-2026-0018: HA Timeout Manipulation via Pool.other_config (Split-Brain/Blindness)

Advisory IDMOKSHA-2026-0018
Semantic IDPLOC-2
Published2026-04-24
CVSS 3.17.6 High
CVSS 3.1 VectorAV:N/AC:L/PR:H/UI:N/S:C/C:N/I:N/A:H
CVSS 4.08.2 High
CVSS 4.0 VectorAV:N/AC:L/AT:N/PR:H/UI:N/VC:N/VI:N/VA:H/SC:N/SI:N/SA:H
XAPI ObjectPool
XAPI Fieldother_config:default_ha_timeout
Entry Rolepool-operator
ResearcherJakob Wolffhechel, Moksha

Affected Products

VendorProductVersions
Citrix / Cloud Software GroupXenServer / Citrix Hypervisorall versions (shared XAPI codebase)
VatesXCP-ng8.3.0

Summary

A pool-operator can manipulate the High Availability timeout by setting Pool.other_config:default_ha_timeout to an arbitrary integer. The value is read by xapi_ha.ml:278-279 via int_of_string with no range check. Setting the timeout to 1 second causes spurious HA fencing events - hosts are incorrectly marked as dead, triggering cascading false fencing across the pool (split-brain condition). Setting the timeout to 999999 seconds effectively disables HA - actual host failures are not detected for days, leaving HA-protected VMs without failover protection. Both outcomes affect every HA-protected VM across the entire pool.

Vulnerability Description

Pool.other_config is the highest-scope other_config field in the XAPI data model. The default_ha_timeout key overrides the default HA heartbeat timeout used to determine whether a host is alive or dead.

Data Flow

pool-operator calls Pool.add_to_other_config(pool, "default_ha_timeout", "1")
  -> xapi_ha.ml:278-279 reads default_ha_timeout via int_of_string
  -> No range validation performed
  -> HA subsystem uses 1-second timeout for heartbeat monitoring
  -> Normal network latency exceeds 1 second -> all hosts marked dead
  -> Cascading fencing events: hosts reboot each other (split-brain)
pool-operator calls Pool.add_to_other_config(pool, "default_ha_timeout", "999999")
  -> HA subsystem uses ~11.5-day timeout
  -> Host failures not detected for days
  -> HA-protected VMs not restarted after actual host failure

Two Attack Modes

Mode 1 - Split-brain (timeout too low): Setting default_ha_timeout=1 causes the HA daemon to declare hosts dead after 1 second of missed heartbeats. Normal network jitter exceeds this threshold, triggering false fencing events. Multiple hosts simultaneously fence each other, causing a cascading reboot loop. All HA-protected VMs restart repeatedly.

Mode 2 - HA blindness (timeout too high): Setting default_ha_timeout=999999 makes HA unable to detect actual host failures for approximately 11.5 days. During this window, a failed host's HA-protected VMs are not restarted on surviving hosts.

Root Causes

  1. Missing RBAC protection. Pool.other_config has zero map_keys_roles entries for infrastructure keys. The default_ha_timeout key is writable by pool-operator.

  2. No range validation. xapi_ha.ml uses int_of_string with no bounds check. Any integer value is accepted, including values that make HA non-functional.

  3. Pool-wide blast radius. The HA timeout applies to the entire pool. A single key write affects the failure detection behavior for every host and HA-protected VM.

  4. Immediate effect. The changed timeout is read on the next HA monitoring cycle, with no confirmation or cooldown period.

Affected Systems

Directly Affected

Indirectly Affected

Exploitation Scenarios

Scenario Impact Pre-conditions Status
Split-brain (timeout=1) Cascading fencing, all hosts reboot repeatedly HA enabled Modeled (code-traced: int_of_string with no range check at xapi_ha.ml:278-279)
HA blindness (timeout=999999) Host failures undetected for days HA enabled Modeled (code-traced)
Storage corruption on split-brain Concurrent access violations on shared storage during fencing HA + shared storage Modeled
BOC-1 chain vm-admin escalates to pool-operator via BOC-1 S3, then manipulates HA timeout BOC-1 available Modeled (two-step chain)

Detection

Remediation

Short-Term Mitigations

Long-Term Fix

RBAC restriction. Add map_keys_roles entry for default_ha_timeout in datamodel.ml requiring _R_POOL_ADMIN.

Range validation. Validate default_ha_timeout at write time. Enforce a reasonable range (e.g., 10-600 seconds) and reject values outside it.

Write-time type checking. Validate that the value is a valid integer at write time, not just at read time.

Upstream patches exist. They are held privately pending coordinated disclosure.

Disclosure

Disclosure:

References

Credits

Discovered and reported by Jakob Wolffhechel, Moksha.

Jakob Wolffhechel · Moksha · Copenhagen
jakob@wolffhechel.dk · +45 3170 7337
Published 2026-04-24 08:00 CEST · cna.moksha.dk · shittrix.moksha.dk