May 22, 2026

VCF 5.2: Adding Hosts to a Stretched Cluster Fails with HOST_NETWORK_VALIDATION_FAILED

WARNING: Do not try this at home. Call VMware Support! Only those who are lucky, brave and immortal you may consider continuing reading this post. 

Background


The idea behind VMware Cloud Foundation is to tie all of VMware's datacenter products closer together into a single solution that just works. There is a GUI where you can do many of the most common things, but sometimes you need to dig a few layers deeper to get what you need. 

VMware Cloud Foundation (VCF) 5.2 supports stretched vSAN clusters across two availability zones (AZs). Expanding a stretched cluster with new hosts is one of these tasks that you can't do from the GUI, but need to do REST calls against the API instead.

When expanding such a cluster by adding newly prepared hosts, you use the POST /v1/clusters/{clusterId}/expand API endpoint. Broadcom's official documentation describes the required JSON spec, but as we discovered, it's missing a critical field that will cause the operation to fail every time in certain environments.

Environment

  • VCF Version: 5.2.2.0 Build 24936865
  • Cluster type: vSAN Stretched Cluster across two AZs
  • AZ1 hosts: Management VLAN 100, Network Pool CORP-AZ1
  • AZ2 hosts: Management VLAN 200, Network Pool CORP-AZ2
  • Fault domains: CORP-CLUSTER01_primary-az-faultdomain (AZ1) and CORP-CLUSTER01_secondary-az-faultdomain (AZ2)
  • Goal: Add 2 hosts to each fault domain (4 hosts total)

What the documentation says


{
  "clusterExpansionSpec": {
    "hostSpecs": [
      {
        "id": "<host-uuid>",
        "licenseKey": "<license-key>",
        "azName": "CORP-CLUSTER01_primary-az-faultdomain",
        "hostNetworkSpec": {
          "vmNics": [
            { "id": "vmnic0", "vdsName": "CORP-VDS001" },
            { "id": "vmnic2", "vdsName": "CORP-VDS001" }
          ]
        }
      }
    ]
  }
}
Straightforward enough. We followed this exactly, matching each host to its correct fault domain based on VLAN. It failed immediately.

The error

HOST_NETWORK_VALIDATION_FAILED
Host corpesxi01.corp.local VlanId 100 is not same as availability zones VlanId 200
The log from ValidateExpandStretchHostNetworkAction told the full story:

ERROR  Host corpesxi01.corp.local VlanId 100 is not same as availability zones VlanId 200
DEBUG  Getting service credential with entity ID <AZ2-host-uuid>
INFO   Host params: ip: 10.10.200.11, username: svc-vcf-corpesxi03
DEBUG  Connecting to https://corpesxi03.corp.local:443/sdk
The validator was reporting a failure for an AZ1 host, but immediately connecting to an AZ2 host (corpesxi03) to perform the check. It was reading the expected VLAN from the wrong availability zone's existing host; comparing an AZ1 host (VLAN 100) against the VLAN it read from an AZ2 host (VLAN 200).

This seems to be a bug in ValidateExpandStretchHostNetworkAction.java in VCF 5.2.2. When no networkPoolId is provided in the spec, the validator attempts to determine the expected management VLAN by connecting to an existing host already in the target fault domain. Due to a scoping issue, it sometimes selects a host from the wrong fault domain, causing the VLAN comparison to fail regardless of how correct your azName mapping is.

Solution: Add networkPoolId to hostNetworkSpec

The solution is to explicitly provide the networkPoolId in each host's hostNetworkSpec. This gives the validator a direct reference to look up the expected VLAN from the network pool definition, bypassing the broken host-lookup logic entirely.

First, retrieve your network pool IDs:
$headers = @{ Authorization = "Bearer $token" }
$pools = Invoke-RestMethod -Method Get `
    -Uri "https://sddc-manager.corp.local/v1/network-pools" `
    -Headers $headers

$pools.elements | ForEach-Object {
    Write-Host "Pool: $($_.name)  ID: $($_.id)"
}
Example output:
Pool: CORP-AZ1  ID: aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb
Pool: CORP-AZ2  ID: cccccccc-4444-5555-6666-dddddddddddd
Then verify which hosts belong to which pool before building your spec:
Invoke-RestMethod -Method Get `
    -Uri "https://sddc-manager.corp.local/v1/hosts?status=UNASSIGNED_USEABLE" `
    -Headers $headers |
    Select-Object -ExpandProperty elements |
    Select-Object id, fqdn,
        @{N="vlan"; E={($_.networks | Where-Object {$_.type -eq "MANAGEMENT"}).vlanId}},
        @{N="pool"; E={$_.networkpool.name}} |
    Format-Table -AutoSize
Example output:
id                                   fqdn                  vlan  pool
--                                   ----                  ----  ----
aaaaaaaa-0001-0001-0001-000000000001  corpesxi05.corp.local  100  CORP-AZ1
aaaaaaaa-0001-0001-0001-000000000002  corpesxi06.corp.local  100  CORP-AZ1
aaaaaaaa-0001-0001-0001-000000000003  corpesxi07.corp.local  200  CORP-AZ2
aaaaaaaa-0001-0001-0001-000000000004  corpesxi08.corp.local  200  CORP-AZ2
Now build the corrected spec with networkPoolId included:
{
  "clusterExpansionSpec": {
    "hostSpecs": [
      {
        "id": "aaaaaaaa-0001-0001-0001-000000000001",
        "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
        "azName": "CORP-CLUSTER01_primary-az-faultdomain",
        "hostNetworkSpec": {
          "networkPoolId": "aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb",
          "vmNics": [
            { "id": "vmnic0", "vdsName": "CORP-VDS001" },
            { "id": "vmnic2", "vdsName": "CORP-VDS001" }
          ]
        }
      },
      {
        "id": "aaaaaaaa-0001-0001-0001-000000000002",
        "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
        "azName": "CORP-CLUSTER01_primary-az-faultdomain",
        "hostNetworkSpec": {
          "networkPoolId": "aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb",
          "vmNics": [
            { "id": "vmnic0", "vdsName": "CORP-VDS001" },
            { "id": "vmnic2", "vdsName": "CORP-VDS001" }
          ]
        }
      },
      {
        "id": "aaaaaaaa-0001-0001-0001-000000000003",
        "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
        "azName": "CORP-CLUSTER01_secondary-az-faultdomain",
        "hostNetworkSpec": {
          "networkPoolId": "cccccccc-4444-5555-6666-dddddddddddd",
          "vmNics": [
            { "id": "vmnic0", "vdsName": "CORP-VDS001" },
            { "id": "vmnic2", "vdsName": "CORP-VDS001" }
          ]
        }
      },
      {
        "id": "aaaaaaaa-0001-0001-0001-000000000004",
        "licenseKey": "XXXXX-XXXXX-XXXXX-XXXXX-XXXXX",
        "azName": "CORP-CLUSTER01_secondary-az-faultdomain",
        "hostNetworkSpec": {
          "networkPoolId": "cccccccc-4444-5555-6666-dddddddddddd",
          "vmNics": [
            { "id": "vmnic0", "vdsName": "CORP-VDS001" },
            { "id": "vmnic2", "vdsName": "CORP-VDS001" }
          ]
        }
      }
    ]
  }
}

Key takeaway

The mapping rule is simple:
Host VLAN Network Pool azName networkPoolId
VLAN 100 CORP-AZ1 primary-az-faultdomain AZ1 pool UUID
VLAN 200 CORP-AZ2 secondary-az-faultdomain AZ2 pool UUID
Always verify VLAN <-> pool <-> fault domain mapping before submitting by querying /v1/hosts?status=UNASSIGNED_USEABLE and cross-referencing with /v1/network-pools. The networkPoolId field is not documented as required in the Broadcom docs for VCF 5.2.2, but omitting it triggers a validator bug that causes the operation to fail with a misleading VLAN mismatch error.

Cleaning up stuck tasks before trying again

When this process fails you will nead to clean up stuck subtasks before trying again with a json file as described above. You will get ~69 subtasks thath you can see as Scheduled in the GUI of SDDC Manager. It *is* possible to do it manually, but after repeatedly trying to fix this issue I rather did a small shell script to speed up the process: 
#!/bin/bash
# vcf-cleanup-stuck-tasks.sh
# Cleans up INITIALIZED (stuck) processing_task records in VCF domainmanager DB
# Usage: ./vcf-cleanup-stuck-tasks.sh [execution_id]
#   If no execution_id is provided, cleans up the latest EXPAND_STRETCHED_CLUSTER run

set -euo pipefail

PGHOST="localhost"
PGUSER="postgres"
PGDB="domainmanager"
PSQL="psql -h $PGHOST -U $PGUSER -d $PGDB -t -A"

RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
CYAN='\033[0;36m'
NC='\033[0m'

log()  { echo -e "${CYAN}[INFO]${NC}  $*"; }
ok()   { echo -e "${GREEN}[OK]${NC}    $*"; }
warn() { echo -e "${YELLOW}[WARN]${NC}  $*"; }
err()  { echo -e "${RED}[ERROR]${NC} $*"; exit 1; }

# ── Resolve execution ID ──────────────────────────────────────────────────────
if [[ $# -ge 1 ]]; then
    EXEC_ID="$1"
    log "Using provided execution ID: $EXEC_ID"
else
    log "No execution ID provided — looking up latest EXPAND_STRETCHED_CLUSTER..."
    EXEC_ID=$($PSQL -c "
        SELECT id FROM execution
        WHERE name = 'EXPAND_STRETCHED_CLUSTER'
        ORDER BY start_time DESC
        LIMIT 1;
    " 2>/dev/null | head -1)
    [[ -z "$EXEC_ID" ]] && err "No EXPAND_STRETCHED_CLUSTER execution found in DB."
    log "Found latest execution: $EXEC_ID"
fi

# ── Verify execution exists and is in a terminal state ───────────────────────
EXEC_STATUS=$($PSQL -c "
    SELECT execution_status FROM execution WHERE id = '$EXEC_ID';
" 2>/dev/null | head -1)

[[ -z "$EXEC_STATUS" ]] && err "Execution ID '$EXEC_ID' not found in database."

if [[ "$EXEC_STATUS" == "IN_PROGRESS" ]]; then
    err "Execution $EXEC_ID is still IN_PROGRESS — will not cancel subtasks of a running workflow."
fi

log "Execution status: $EXEC_STATUS"

# ── Show current task status breakdown ───────────────────────────────────────
echo ""
log "Current processing_task status breakdown:"
$PSQL -c "
    SELECT status, COUNT(*) as count
    FROM processing_task
    WHERE execution_id = '$EXEC_ID'
    GROUP BY status
    ORDER BY count DESC;
" 2>/dev/null | column -t -s '|'
echo ""

# ── Count INITIALIZED tasks ───────────────────────────────────────────────────
INIT_COUNT=$($PSQL -c "
    SELECT COUNT(*) FROM processing_task
    WHERE execution_id = '$EXEC_ID'
      AND status = 'INITIALIZED';
" 2>/dev/null | head -1)

if [[ "$INIT_COUNT" -eq 0 ]]; then
    ok "No INITIALIZED tasks found — nothing to clean up."
    exit 0
fi

warn "Found $INIT_COUNT INITIALIZED (stuck) tasks for execution $EXEC_ID"

# ── Confirm ───────────────────────────────────────────────────────────────────
read -r -p "$(echo -e "${YELLOW}Cancel these $INIT_COUNT tasks and restart domainmanager? [y/N]:${NC} ")" CONFIRM
[[ "${CONFIRM,,}" != "y" ]] && { warn "Aborted by user."; exit 0; }

# ── Cancel stuck tasks ────────────────────────────────────────────────────────
log "Cancelling $INIT_COUNT INITIALIZED tasks..."
UPDATED=$($PSQL -c "
    UPDATE processing_task
    SET status = 'CANCELLED'
    WHERE execution_id = '$EXEC_ID'
      AND status = 'INITIALIZED';
" 2>/dev/null | grep -oP '\d+' | head -1 || echo "0")

ok "Cancelled $UPDATED tasks."

# ── Verify ────────────────────────────────────────────────────────────────────
echo ""
log "Updated status breakdown:"
$PSQL -c "
    SELECT status, COUNT(*) as count
    FROM processing_task
    WHERE execution_id = '$EXEC_ID'
    GROUP BY status
    ORDER BY count DESC;
" 2>/dev/null | column -t -s '|'
echo ""

# ── Restart domainmanager ─────────────────────────────────────────────────────
log "Restarting domainmanager.service..."
systemctl restart domainmanager.service

log "Waiting 30 seconds for service to come up..."
sleep 30

STATUS=$(systemctl is-active domainmanager.service)
if [[ "$STATUS" == "active" ]]; then
    ok "domainmanager.service is running."
else
    err "domainmanager.service failed to start (status: $STATUS). Check: journalctl -u domainmanager.service -n 50"
fi

echo ""
ok "Cleanup complete. Verify the SDDC Manager UI task list is clear before resubmitting."
Good luck!

June 3, 2024

Hosts out of sync after restoring vCenter

 Background

After working with VMware Support on a case we were asked to install a special patch on the vCenter server. It turned out this patch broke some unrelated functionality we needed (remounting rdm disks on a VM that already had 35 rdm disks). This script runs at night and the next day we decided to roll back vCenter to the backup from the previous day; to the backup that was taken just before the patch was installed.

Problem

Some of our ESXi hosts started showing symptoms of being out of sync, all the stats became blank and no alarms were triggered, just two blue info messages. Trying to reconfigure HA would however trigger alarms. 

Cannot synchronize host servername
Quick stats on servername is not up-to-date

Quick stats on xxxx is not up-to-date


The password for the vpxuser changes every 30 days and with many hosts in your vCenter it can potentially affect quite a few hosts depending on the time frame between the backup and the rollback. VMware has a list of things to consider when doing a restore, but the problem we experienced is not on the list.  

When you have 150 esxi hosts in your vCenter it can be time consuming to manually go through each host to find which hosts have been affected by the rollback.
Get-AdvancedSetting -Entity ($DefaultVIServer).Name -Name VirtualCenter.VimPasswordExpirationInDays
Get-AdvancedSetting -Entity ($DefaultVIServer).Name -Name VirtualCenter.VimPasswordExpirationInDays

Searching through the logs of one of the affected hosts revealed little about that it was having problems.

Solution

By looking at the logs through Splunk we could find a log entry from vCenter that blew up after the restore:

Exception occurred during host sync; Got method fault

Now we could use Splunk to give us a list of the affected servers.

Then we could right click each server from the vSphere client and chose Disconnect and then Connect again.

Disconnect + Connect

After having reconnected the hosts things were working fine again and the ongoing error messages we had in Splunk stopped coming.

December 28, 2023

Jumpstart plugin late-filesystem activation failed

 Problem

When installing ESXi you sometimes get this error message after pressing F11 to accept the EULA:





Solution

The reason you're seeing this problem is because your Alt button has magically stuck within your iDRAC/iLO/RSA session and you're seeing tty11 instead of tty2. Pressing alt+F2 bring you back to the EULA page again. Press Alt by itself once and then F11 will allow you to continue the installer.




April 1, 2023

CloudBuilder fails to deploy vCenter during initial deployment

Background 
When deploying VCF 4.5 you should be able to do that in an air gapped environment that has no access to the internet. In such cases you will need to get updates and such into the environment manually, but it's still a supported solution.


Problem
When running the initial bring up process deployment will fail with the message: "vCenter installation failed. Check logs for more details.". The vcf-bringup.log file will tell you that the vCenter appliance was deployed and started, but that there's was problem with the time of this appliance. 

The ntp parameters you have specified in your spreadsheet have been populated correctly in /etc/ntp.conf of the Cloud Builder appliance, but the logs show that it's trying to connect to the Google ntp servers.


Workaround

The only solution we've found so far is to either impersonate Google's ntp entries in dns or to open the firewall and let Cloud Builder communicate with these external servers. Cloud Builder is only used during bring up so these workarounds can be reverted once the environment is up and running. 

March 1, 2023

LPe12000 and other old Emulex cards are unsupported if you patch ESXi 7.0U3

 Background

In January 2020 Broadcom announced that a series of Emulex cards would soon go End Of Life. They have however worked fine in VMware ESXi until recently, including 7.0U3d. 

Problem

If you patch your ESXi 7 host with the latest patches the lpfc driver will be replaced by one that doesn't support these old cards and you will no longer see your FC LUNs (vmfs datastores & RDM disks). The driver will be upgraded from 14.0.169.25 to 14.0.543.0. We've also found that installing ESXi 7.0U3j comes with a non-working driver.

Solution

Using supported hardware is always recommended. Swapping these old cards with newer ones would be optimal.

Workaround

Installing an old driver (right click this link, Save As) that still supports old hardware is possible and you will then see your LUNs again.

Detection

In order to identify where this problem will occur before patching I used the following PowerCLI script:

$vmhosts = get-vmhost|sort-object

foreach($vmhost in $vmhosts){

  $devices = Get-VMHostHba -VMHost $vmhost.Name | Where-Object {$_.Model -match "3530C|LPe1605|LPe12004|LPe12000|LPe12002|SN1000E"}

  foreach ($device in $devices) {

    Write-Output "$vmhost - $($device.Model) device with WWN $($device.PortWorldWideName)"

    }

  }

This script will check the HBAs of all of your ESXi hosts and you'll get a listing similar to this:

Reflection
It's highly unusual that a device gets unsupported while patching a version of ESXi. As far as I can recall we have only seen devices being discontinued between major or minor versions of ESXi, not while installing non-critical patches.



May 4, 2022

Horizon Client 8.5 crashing on Linux

Background

After upgrading from version 8.4 the Horizon Client was unable launch correctly. Launching it from the command line showed a segmentation fault:



I'm using Ubuntu 20.04 LTS, but other related distros may also be affected.


Solution

It turned out that Reddit user Zixyar had already found that you could solve this problem by editing the file /etc/pam.d/lightdm and uncommenting the line:

#session required pam_loginuid.so


After rebooting I was able to use the Horizon client 8.5 (2203-8.5.0-19586897) without problems.

May 17, 2021

Priority tagging of vSAN traffic

 Background

According to Cisco COS is defined as "Class of Service (CoS) or Quality of Service (QoS) is a way to manage multiple traffic profiles over a network by giving certain types of traffic priority over others. "

Note that there's also a similar technology called DSCP that can used in more or less the same way.

When using a vSphere Distributed Switch it's possible to configure this and create fairly granular rules per Port Group. It's not at all limited to vSAN traffic even though that was our use case.

Task

I was asked by the networking guys if we could enable this functionality for vSAN traffic by setting COS=3.

Solution

Identify the port group associated with the vmkernel adapter used by vSAN and choose  Edit Settings. / Advanced and enable Traffic filtering and marking. 


At configure level for the port group you will need to create the rule as outlined in the following steps:










Now that it was turned on it was instantly visible to the networking guys as they started seeing traffic within UC3 (Priority Group 3).