April 15, 2021

vSAN critical alert regarding a potential data inconsistency and maintenance mode problems after upgrade to 7.0U1

Background

Versions involved: 

VMware ESXi, 7.0.1, 17325551,  DEL-ESXi-701_17325551-A01

vCenter 7.0U1 Build 17491160

vCenter and ESXi hosts were upgraded from 6.7U3 to 7.0U1c an the vSAN disk format was upgraded to version 13.

Problem

After upgrading many clusters from 6.7U3 to 7.0U1c and upgrading the vSAN format to 13 we experienced a health warning after the upgrade.

The error message in Skyline Health was "vSAN critical alert regarding a potential data inconsistency"


For almost all clusters this error would fix itself within 60 minutes after the upgrade (typically in a much shorter time).

For one of our clusters this error did however stick and we were unable to put any hosts within this cluster in maintenance mode.

Trying to put a host in maintenance mode would fail after 1 hour. Before failing it would stop at a high percentage between 80 and even at 100% with a message "Objects Evacuated xxx of yyy. Data Evacuated xxx MB of yyy MB".

It's worth mentioning that this cluster had an active Horizon environment running during the upgrade and we suspect that it's constant tasks of creating and removing VMs has contributed to this problem.



Solution

We found a kb article with a similar error message even though we haven't changed the storage policy of any VMs for  a long time (but Horizon might have done something like that behind the scenes): https://kb.vmware.com/s/article/82383

This article states this is a rare issue, but we found a korean page referring this same issue. The VMware kb article has a python script that you will need to run on each host involved. After running the python script we were able to put hosts in maintenance mode and do 7.x single image patching.

We asked VMware support if it was a good idea that we had changed this setting and their response was "Yes, if you want the DeltaComponent functionality going forward then please change it back to 1. The delta component makes a temporary component when there are maintenance mode issues."

Because of this we decided to change the value back and wrote a powershell script instead of running a python script on each host:

param (

    [string]$clustername = $( Read-Host "Enter cluster name:" )

 )

get-cluster $clustername|Get-VMHost| Get-AdvancedSetting -Name "VSAN.DeltaComponent"| Set-AdvancedSetting -Value 1 -Confirm:$false

As we've only found a single article on this issue (in Korean) I guess this issue is indeed quite rare, but if it happens again we now know what to do.


No comments:

Post a Comment