December 28, 2023

Jumpstart plugin late-filesystem activation failed

 Problem

When installing ESXi you sometimes get this error message after pressing F11 to accept the EULA:





Solution

The reason you're seeing this problem is because your Alt button has magically stuck within your iDRAC/iLO/RSA session and you're seeing tty11 instead of tty2. Pressing alt+F2 bring you back to the EULA page again. Press Alt by itself once and then F11 will allow you to continue the installer.




April 1, 2023

CloudBuilder fails to deploy vCenter during initial deployment

Background 
When deploying VCF 4.5 you should be able to do that in an air gapped environment that has no access to the internet. In such cases you will need to get updates and such into the environment manually, but it's still a supported solution.


Problem
When running the initial bring up process deployment will fail with the message: "vCenter installation failed. Check logs for more details.". The vcf-bringup.log file will tell you that the vCenter appliance was deployed and started, but that there's was problem with the time of this appliance. 

The ntp parameters you have specified in your spreadsheet have been populated correctly in /etc/ntp.conf of the Cloud Builder appliance, but the logs show that it's trying to connect to the Google ntp servers.


Workaround

The only solution we've found so far is to either impersonate Google's ntp entries in dns or to open the firewall and let Cloud Builder communicate with these external servers. Cloud Builder is only used during bring up so these workarounds can be reverted once the environment is up and running. 

March 1, 2023

LPe12000 and other old Emulex cards are unsupported if you patch ESXi 7.0U3

 Background

In January 2020 Broadcom announced that a series of Emulex cards would soon go End Of Life. They have however worked fine in VMware ESXi until recently, including 7.0U3d. 

Problem

If you patch your ESXi 7 host with the latest patches the lpfc driver will be replaced by one that doesn't support these old cards and you will no longer see your FC LUNs (vmfs datastores & RDM disks). The driver will be upgraded from 14.0.169.25 to 14.0.543.0. We've also found that installing ESXi 7.0U3j comes with a non-working driver.

Solution

Using supported hardware is always recommended. Swapping these old cards with newer ones would be optimal.

Workaround

Installing an old driver (right click this link, Save As) that still supports old hardware is possible and you will then see your LUNs again.

Detection

In order to identify where this problem will occur before patching I used the following PowerCLI script:

$vmhosts = get-vmhost|sort-object

foreach($vmhost in $vmhosts){

  $devices = Get-VMHostHba -VMHost $vmhost.Name | Where-Object {$_.Model -match "3530C|LPe1605|LPe12004|LPe12000|LPe12002|SN1000E"}

  foreach ($device in $devices) {

    Write-Output "$vmhost - $($device.Model) device with WWN $($device.PortWorldWideName)"

    }

  }

This script will check the HBAs of all of your ESXi hosts and you'll get a listing similar to this:

Reflection
It's highly unusual that a device gets unsupported while patching a version of ESXi. As far as I can recall we have only seen devices being discontinued between major or minor versions of ESXi, not while installing non-critical patches.



May 4, 2022

Horizon Client 8.5 crashing on Linux

Background

After upgrading from version 8.4 the Horizon Client was unable launch correctly. Launching it from the command line showed a segmentation fault:



I'm using Ubuntu 20.04 LTS, but other related distros may also be affected.


Solution

It turned out that Reddit user Zixyar had already found that you could solve this problem by editing the file /etc/pam.d/lightdm and uncommenting the line:

#session required pam_loginuid.so


After rebooting I was able to use the Horizon client 8.5 (2203-8.5.0-19586897) without problems.

May 17, 2021

Priority tagging of vSAN traffic

 Background

According to Cisco COS is defined as "Class of Service (CoS) or Quality of Service (QoS) is a way to manage multiple traffic profiles over a network by giving certain types of traffic priority over others. "

Note that there's also a similar technology called DSCP that can used in more or less the same way.

When using a vSphere Distributed Switch it's possible to configure this and create fairly granular rules per Port Group. It's not at all limited to vSAN traffic even though that was our use case.

Task

I was asked by the networking guys if we could enable this functionality for vSAN traffic by setting COS=3.

Solution

Identify the port group associated with the vmkernel adapter used by vSAN and choose  Edit Settings. / Advanced and enable Traffic filtering and marking. 


At configure level for the port group you will need to create the rule as outlined in the following steps:










Now that it was turned on it was instantly visible to the networking guys as they started seeing traffic within UC3 (Priority Group 3).


April 19, 2021

Autoinstall physical NSX Edge with custom passwords

Background

Setting up NSX Edge in an automatic way with a custom password is a good idea because by default you get a default password that needs to be changed at first login. If you're planning on using an extra strong password, setting it through iDRAC (or similar) can be a bit awkward. If you're using a non-english keyboard layout (like me) it can be even more non-trivial to hit the correct special characters.

Problem

1. We had a problem getting the physical Dell R640 server with Mellanox 25GbE nics to boot from PXE. It would say "Booting from PXE Device 1: Integrated NIC 1 Port 1 Partition 1 Downloading NBP file... NBP File downloaded successfully. Boot: Failed PXE Device 1: Integrated NIC 1 Port 1 Partition 1 No boot device available or Operating system detected. Please ensure a compatible bootable media is available."



2. VMware has provided us with a nice 19 step document that guides us through the needed steps for setting up everything we need. The optional step 16 of setting a non-default password is however a bit misleading (probably referring to an older version of NSX?) and doesn't quite work.

Solution

1. In order to get the physical server to PXE boot we had to change the boot mode from UEFI to BIOS.

2. I had a case open for months without a resolution. In the end I started studying the Debian manuals (that the NSX Edge installer is based upon). I eventually found a working solution. It turned out that adding the following commands to preseed.cfg right after the "di passwd/root..." line gave a working config:

d-i preseed/late_command       string \
        in-target usermod --password 'insert non escaped password hash here' root;\
        in-target usermod --password 'non escaped password hash' admin
You will need to create the password hash using mkpasswd -m sha-512 as described in the original 19 step document.



April 15, 2021

vSAN critical alert regarding a potential data inconsistency and maintenance mode problems after upgrade to 7.0U1

Background

Versions involved: 

VMware ESXi, 7.0.1, 17325551,  DEL-ESXi-701_17325551-A01

vCenter 7.0U1 Build 17491160

vCenter and ESXi hosts were upgraded from 6.7U3 to 7.0U1c an the vSAN disk format was upgraded to version 13.

Problem

After upgrading many clusters from 6.7U3 to 7.0U1c and upgrading the vSAN format to 13 we experienced a health warning after the upgrade.

The error message in Skyline Health was "vSAN critical alert regarding a potential data inconsistency"


For almost all clusters this error would fix itself within 60 minutes after the upgrade (typically in a much shorter time).

For one of our clusters this error did however stick and we were unable to put any hosts within this cluster in maintenance mode.

Trying to put a host in maintenance mode would fail after 1 hour. Before failing it would stop at a high percentage between 80 and even at 100% with a message "Objects Evacuated xxx of yyy. Data Evacuated xxx MB of yyy MB".

It's worth mentioning that this cluster had an active Horizon environment running during the upgrade and we suspect that it's constant tasks of creating and removing VMs has contributed to this problem.



Solution

We found a kb article with a similar error message even though we haven't changed the storage policy of any VMs for  a long time (but Horizon might have done something like that behind the scenes): https://kb.vmware.com/s/article/82383

This article states this is a rare issue, but we found a korean page referring this same issue. The VMware kb article has a python script that you will need to run on each host involved. After running the python script we were able to put hosts in maintenance mode and do 7.x single image patching.

We asked VMware support if it was a good idea that we had changed this setting and their response was "Yes, if you want the DeltaComponent functionality going forward then please change it back to 1. The delta component makes a temporary component when there are maintenance mode issues."

Because of this we decided to change the value back and wrote a powershell script instead of running a python script on each host:

param (

    [string]$clustername = $( Read-Host "Enter cluster name:" )

 )

get-cluster $clustername|Get-VMHost| Get-AdvancedSetting -Name "VSAN.DeltaComponent"| Set-AdvancedSetting -Value 1 -Confirm:$false

As we've only found a single article on this issue (in Korean) I guess this issue is indeed quite rare, but if it happens again we now know what to do.