November 21, 2015

Replacing a vSAN caching disk

Background
Replacing disks in vSAN could be a bit less smooth than some of the traditional Storage Arrays. For normal disks used for storage it's quite easy, but disks used for caching it can be a slightly different story. If you get a dead caching disk you should remove it from the config before removing it physically from the server. Otherwise you will get the problems described in this posting.

Problem
Once the disk has been replaced you will be unable to delete the disk or the disk group both from the vSphere Web client and RVC. The reason this fails is that it can't find the disk. The disk will show up with a status of "Dead or Error" or "Absent" (depending on where you look)
.

"esxcli vsan storage list" will show all the other disks belonging to vsan on that server, but not the missing SSD disk.

Listing out the disks in RVC with the command vsan-host_info shows that the disk is in an Absent status:


Trying to use RVC with "vsan.host_wipe_vsan_disks -f" to remove the disk also fails:

Solution
A solution that did work in the end was to use partedUtil to remove the partitions of all spinning disks of this disk group. partedUtil is a very dangerous tool so if you have multiple disk groups on your host (like we had) you must make sure you're working with the correct disks. We found it best to locate the naa IDs of the failed disk group from the web client.

After removing both partitions of all the disks belonging to this disk group, the disk group was gone and we could create a new one where we were able to use our new SSD disk and all the spinning ones.

Appendum
The official way to solve thisproblem is to remove the disk from the pool while it's still present in the server. In our case that was not possible. The SSD disk had for some unknown reason entered "Foreign mode", which is a Dell disk controller feature. We had to enter the Perc controller BIOS settings (from POST), clear the Foreign Config and we also had to configure the disk in the controller config in order to use it again. Because of these things the disk came up with a new naa ID even though we didn't really have a failed disk.


2 comments:

  1. Hi, this is Christian from the VSAN Engineering team. You don't need to use partedUtil. You can't delete the "Absent" disk, but you don't have to. In this case you know the drive is dead and won't return, and what you want to do is make sure you free up the capacity tier drives so that you can reuse them to build a new disk group with a new cache SSD. To do that, simply click on the capacity tier drives in the UI and hit the "Remove" button (note, if you are in "Automatic" claim mode at the cluster level then this button is hidden, so be sure to switch to Manual). This allows you to free up those drives and repurpose them, e.g. for a new disk group. Once all the capacity drives are freed up, the "Absent" SSD will automatically disappear. It was just shown in the UI so we could still represent the old disk group in a meaningful way.

    ReplyDelete
    Replies
    1. Thanks for your good input here, Christian. I think we already tried to do what you suggest, but had an error message(?). I'm no longer having a system with this problem and I'm unable to verify it. Will try again next time.

      Delete