Replacing disks in vSAN could be a bit less smooth than some of the traditional Storage Arrays. For normal disks used for storage it's quite easy, but disks used for caching it can be a slightly different story. If you get a dead caching disk you should remove it from the config before removing it physically from the server. Otherwise you will get the problems described in this posting.
Once the disk has been replaced you will be unable to delete the disk or the disk group both from the vSphere Web client and RVC. The reason this fails is that it can't find the disk. The disk will show up with a status of "Dead or Error" or "Absent" (depending on where you look)
"esxcli vsan storage list" will show all the other disks belonging to vsan on that server, but not the missing SSD disk.
Listing out the disks in RVC with the command vsan-host_info shows that the disk is in an Absent status:
Trying to use RVC with "vsan.host_wipe_vsan_disks -f" to remove the disk also fails:
A solution that did work in the end was to use partedUtil to remove the partitions of all spinning disks of this disk group. partedUtil is a very dangerous tool so if you have multiple disk groups on your host (like we had) you must make sure you're working with the correct disks. We found it best to locate the naa IDs of the failed disk group from the web client.
After removing both partitions of all the disks belonging to this disk group, the disk group was gone and we could create a new one where we were able to use our new SSD disk and all the spinning ones.
The official way to solve thisproblem is to remove the disk from the pool while it's still present in the server. In our case that was not possible. The SSD disk had for some unknown reason entered "Foreign mode", which is a Dell disk controller feature. We had to enter the Perc controller BIOS settings (from POST), clear the Foreign Config and we also had to configure the disk in the controller config in order to use it again. Because of these things the disk came up with a new naa ID even though we didn't really have a failed disk.