VMware vSAN cache disk failed and how to recover from it

VMware vSAN cache disk failed and how to recover from it

Recently i rebuilt my VMware homelab from scratch and with the most recent vSphere version available at this point. I planned to rebuild my lab a long time ago but because of my job and other things I really hadn’t the time to do that. But recently I had to rebuilt my lab because I screwed up my vCenter. Yes, I screwed it. So what did that mean? Reinstall all and everything completely from scratch. All my physical ESXi hosts, domain controller, vCenter, Jumphost and backup server. All these services are running on a standalone ESXi server with some local disks. This server is called my homebase. I’ve got some more servers which are running my VMware vSAN environment. I reinstalled these too and reconfigured everything what was needed, like networking and storage.

This week one of my vSAN cluster nodes went into degraded mode because of one of the cache disks failed. I thought, easy, just replacing the cache disk and that’s it. But no, the struggle became real…

What happened?

I’ve got three DELL servers for my vSAN cluster. All servers are equiped with one SSD as cache tier and three SSDs for the capacity tier. Now one cache disk failed because of reasons (I really don’t know why). That was causing vSAN to go into degraded mode as “failures to tolerate” was set to 1. So one failure (the failed cache disk) was compensated. Just for your information in case you didn’t know. If a cache disk of one disk group fails, the whole disk group will become unavailable. In my case that meant that one third of the whole vSAN capacity was gone.

What did i do to resolve this?

My first idea was to replace the failed cache disk as I’ve got some identical disks as spare drives available. Well, not directly as spare drives, but installed and configured as RAID 5 in my homebase ESXi host. So I did a Storage vMotion on all my homebase VMs mentioned above to another local RAID 5 datastore, deleted the SSD RAID datastore and removed the disks. The physical replacement of this disks was easy. But telling my degraded vSAN node to accept this disk was a different topic.

Checking the disks

After I installed the “new” disk into the vSAN node I did a rescan on all storage adapters. And there was nothing. Only the already existing capacity disks but no cache disk. So i tried the second and the third identical disk with the same result. Only the capacity disks were visible in vCenter on the host but not the cache disk. What’s wrong here? I knew that ESXi server only shows empty disks without any volumes, file systems or data on it. But how should I wipe this disk when not even with esxcli the disk is not visible?

As I’m using HPE Smart HBA H240 as my storage controller in the DELL server, I already installed the HPE smart storage administrator CLI tool on all the vSAN nodes. So I was able to look onto the storage controller to see what’s happening there (or probably not).

The following command showed me that all disks are here and are fine:

./ssacli ctrl slot=2 pd all show status

But I was still struggling. Why is vCenter still showing only the capacity disks?

Clearing the disk(s)

An article by Cormac Hogan showed me how to reclaim disks for other uses. So i deleted all the partitions on the existing capacity disks, hoping that then the cache disk will also come back online. I read on another blog that wiping all vSAN disks can bring back non-detected disks. But that didn’t help.

First I removed the vSAN node from the vSAN cluster:

esxcli vsan cluster leave

Next I checked with partedUtil how many and what kind of partitions are on the disks:

partedUtil get /vmfs/devices/disks/mpx.vmhba1:C2:T2:L0

Each capacity disks showed two partitions, so I wiped them all:

partedUtil delete /vmfs/devices/disks/mpx.vmhba1:C2:T2:L0 1

partedUtil delete /vmfs/devices/disks/mpx.vmhba1:C2:T2:L0 2

A look into the HPE smart storage administrator CLI tool again showed me that still all physical disks are here. A rescan on all HBA in vCenter on this particular host didn’t help, only the capacity disks were shown.

I looked a little deeper into the storage controller with the command:

./ssacli ctrl slot=2 pd all show detail

That showed something not completely unexpected:

physicaldrive 2I:0:1

Masked from HBA: The drive contains controller configuration data and has been disabled
in order to protect the configuration data. Please run the "modify clearconfigdata"
command on the drive to re-enable it.

This physical drive above was the cache disk I was missing in vCenter. OK, so let’s clear the “configdata” and let’s see what happens then:

./ssacli ctrl slot=2 pd 2I:0:1 modify clearconfigdata

I checked again with “all show detail” and this “modify clearconfigdata” was gone.

Now I was able to rescan all storage adapters in vCenter on this host and that brought back my missed cache disk:

But that was to easy…

After having my cache disk back I went into vSAN configuration in vCenter and claimed the disks. The small one for the cache tier, the bigger ones for the capacity tier. And boom! This particular disk group went into another network partition group. Well done, thank you for nothing!

When you search around the internet for vSAN network partition you will find many forum and blog posts mentioning that this happens if something with the network configuration wasn’t as good as it should be. In my case I checked everything and I changed nothing on the network. So this partitioning issue had another reason. But to be honest I didn’t try to solve that. I wasn’t in the mood for that. I only wanted to bring back my vSAN into a good and healthy state.

I removed this vSAN node from the cluster by just draggin and dropping it out of the cluster. Then I tried to remove it from the inventory. And another boom!

The resource 'eagle.lan.driftar.ch' is in use.

That was the error message in vCenter when I tried to remove the host from inventory. But why? The host is in maintenance mode! Dang it! Let me remove it!

After doing some research on the interwebs I checked also the tasks in vCenter if there is a bit more of information. And I’ve found something:

Cannot remove the host eagle.lan.driftar.ch because it's part of VDS vMotion-DSwitch vSAN-DSwitch.

Well, that’s true. And that was also the obvious reason why I can’t remove the host from inventory. So I had to reconfigure the host networking, putting back the VMKernel ports for vMotion and vSAN to their origin local virtual switches. After that I was finally able to remove the host from the inventory.

Now rebuilding vSAN…

The next steps were easy. I added the host back to the vSAN cluster and configured the VDS for vMotion and vSAN as they were before. Then I went into vSAN configuration and checked the disk group. Lucky me the disk group configured before was still there and healthy, and vSAN claimed it automatically. And no network partitioning this time! All hosts and disk groups in the same network partition group!

After retesting the health onf the vSAN cluster it showed that there is one component in need of a resync. One of my templates was partially on this disk group before failing and is now waiting until the resync completed.

But at least vSAN is working fine again!

Closing words

In the most cases, or probably in all cases, replacing a disk in vSAN should be easy. Usually you will replace a used disk against a new and empty disk other than me. But that doesn’t mean you can’t unless you know what to do. I’m glad if this blog post helped you solving the issue.

If you follow the steps described in the VMware Knowledge Base then you should be fine: