My team and I were tasked with a global vSphere upgrade on all of our ESXi hosts, hyper-converged systems and our vCenter. We took enough time to get the inventory, check all the hosts for compatibility and test the various upgrade paths. The upgrade will be rolled out in multiple steps due to personal resources (we’re a small team and currently, it’s summer holiday season) and also to avoid too much downtime. In this blog post, I’d like to share some personal experiences regarding the upgrade of our vCenter. It didn’t work as we’ve planned. But in the end, all worked fine. I’d like also to shoutout a big thank you to my team. You guys rock!
Before we dive deeply into the vCenter upgrade process, and what happened, I’d like to explain some steps first to better understand our approach and the upgrade process in general.
One of the milestones is (at the writing of this blog post already “was”) the upgrade of our vCenter. We’re using vCenter for our daily tasks like managing virtual workloads, deployment of new ESXi hosts, etc. But before we could upgrade our vCenter from 6.5 to 6.7, we had to do some host upgrades first. Our hyper-converged infrastructure was running 24/7 without getting much care, like care in the form of firmware upgrades. There was just not enough time to do maintenance tasks like this throughout the last few months or maybe years. Maybe some people also were just afraid of touching these systems, I don’t know for sure. The firmware was old but at least the hypervisor was on a 6.0 version and also in pretty good shape as well.
So we’ve scheduled various maintenance windows, planned the hyper-converged upgrades and made sure that we’ve downloaded everything from the manufacturer we need to succeed. The firmware upgrade went well on all hosts. One host had a full SEL log and that caused some error messages. No real issue at all, but some alerts in vCenter on that cluster we had to get rid of.
The firmware upgrade on one of the hyper-converged cluster took about 18 hours. That was expected, somehow, because the firmware was really old, and did not support higher ESXi versions that 6.0. But everything went well and we had no issues at all, expect the full SEL log which then has been cleared.
After that firmware upgrade, we were able to upgrade the ESXi version on all of the hyper-converged clusters to a 6.5 level. This was needed because of some plugins used to manage these hyper-converged systems. Ok, to let the cat out of the bag, we’re using Cisco HyperFlex and the plugin I’m talking about is that HX plugin. The version for ESXi 6.0 wasn’t supported in vCenter 6.7. That’s the reason we had to upgrade the HyperFlex systems first to ESXi 6.5.
As you know for sure, you can’t manage ESXi hosts later than 6.5 in vCenter 6.5. So we had to do a stop here for the moment, but we were now at least able to upgrade our vCenter. All other hosts were already on 6.0 since they were installed, so no issues upgrading to vCenter 6.7.
Oh, did I already mention that our vCenter doesn’t run on-premises but on a cloud provider? No, it’s not VMC on AWS, but some other IaaS provider. That didn’t make it easier.
But let’s dive into the main topic now, enough of explanation, let’s do the hard work now.
Recently, I had to add some hardware to an HPE ProLiant DL380 server. Ok, it wasn’t me because this server runs in a location about 3600 miles away from me. But the engineers on-site completed this task. The engineers added some memory and more disks to the server. My task was to add the newly installed disks to the existing ESXi datastore. This was (still is) a standalone ESXi server, as we say a black box server. It’s a standard HPE ProLiant server with local disks and an SD-Card as ESXi boot disk and it is centrally managed in a vCenter. Local IT persons have limited access to vCenter just to manage their workloads on that specific ESXi black box. There isn’t running much on these black boxes, most of all an SCCM distribution point because lacking enough bandwidth. But anyway, that’s not the topic here.
I want to show you what steps I’ve missed in the first attempt and how I’ve managed to fix it.
As there is not much running on this black box server, it was easy to schedule a maintenance window to shut down the workloads and also the ESXi server, so the engineers onsite were able to install the hardware (memory and disks). Through the iLO interface, I’ve started the server and accessed the Smart Storage Administrator, which is part of the Intelligent Provisioning tool kit on servers of Gen9 and later. It was easy to add the unassigned disk to the already existing RAID array. It took some hours to rebuild all the data because all data had to be redistributed over all disks, with parity and everything needed.
After the server was up and running again, I tried to increase the VMFS datastore capacity. It didn’t work as expected. I didn’t see any device nor LUN which I could extend. That made me curious.
Well then, back to the drawing board…
It wasn’t easy this time to schedule a maintenance window, but I’ve asked the responsible person if he could suggest one. In the meantime, I was digging through the internet to find out what’s wrong or what I’ve missed. I’ve found out that just adding the new disks to the existing RAID doesn’t solve that issue alone. I also had to expand the logical drive. That was the key! So ok, could this be done without another downtime? Thankfully yes!
But before we go deeper here, please, always take a backup of your workloads first. Just in case. Better safe than sorry!
It is indeed possible to expand the logical drive on an HPE ProLiant server without downtime. I’m talking here of a maybe easy, not so complex task. It’s not like I’m going to create new arrays or change the RAID mode. No, just expanding the logical drive.
First, I connected to that ESXi server with SSH to see if the HPE tools were installed. And they were. It’s highly recommended that you use the custom VMware ESXi ISO image to install your server when they come from a vendor like HPE or DELL. These images include all the necessary drivers for your hardware, like network or storage controller, and most of all, they include also some nifty tools as well.
In my case, I’m using the tool “hpssacli”. This tool is just the command line version of the Smart Storage Administrator (HP SSA CLI => Smart Storage Administrator CLI). Nice, isn’t it? 😉
Take me to the CLI, please!
I’ve needed only a few commands to get the things in order. Let’s go into it!
First, I’ve checked the logical drives on the controller:
/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld all show status
Mostly, the controller is installed in slot 0. I’m talking here about the P440ar which is on-board when I’m not wrong, so definitely on slot 0. With “ld all” it will display all logical drives configured on that controller.
Next, as I’ve got now the logical drive ID, I’ve checked the details for that logical drive:
/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 1 show
This gave me an overview of that logical drive and I saw that it was just half of the expected size because disks have been added here.
Just to make sure, I’ve checked the controller to see if all disk drives were assigned:
/opt/hp/hpssacli/bin/hpssacli ctrl slot=0 show config
The output showed me that all installed drives were assigned and that there were no unassigned drives. As this controller only had one logical drive, all disk drives were assigned there.
Ok, so then it should be possible to extend the logical drive. And it is, with the following command:
Some weeks ago Google released the newest version of their Chrome browser, version 69. I thought yeah, get it and update it. I really like Google Chrome because it supports all the websites I’m visiting often (or at least did, but more on that later) and it’s fast. You can also customize it with plugins like mouse gestures or so to customize it for your needs. What I didn’t know is that Google switched off the support for Flash Player as far as I knew it from the older versions. You aren’t able anymore to add websites to the allow or deny list in the Flash settings within the browser. Or at least you can’t add the websites directly on that list in the settings. There are some more clicks to do. Officially announced was the end of Flash Player by Adobe. By the year 2020, Flash Player won’t exist anymore, won’t be supported by Adobe nor by the most used browser software. But there’s a “but”. You can manually add specific websites to the allow or block list of Google Chrome, but not in the way you might know. But why the heck should I use Flash anyway? All my favorite websites are already HTML5 compatible and all stuff works without that crappy Flash plugin! But wait! Do you use the VMware vCenter browser client? Probably the Flex Client because you still have the need for it, like vSAN, Update Manager, or 3rd party plugins of different software and hardware vendors within vCenter? Then you’ll have the same issues as I had. The vCenter Flex Client (aka Flash client) obviously won’t work anymore without Flash. Yes, I know, you don’t need to use the Flex Client for vSAN or the Update Manager because in vCenter 6.7 it is finally available. But what about the 3rd party plugins? There is a lot of stuff out there you probably need, I don’t know your infrastructure. I can only compare with mine. But I can tell you, enabling Flash Player in Google Chrome, even in the most recenter version 69 is easier than you think. It’s just some steps and clicks, no rocket science!
Recently we had a weird issue in our office. We had one virtual machine with a snapshot. That by itself isn’t an issue, a snapshot is nothing special. But someone created that snapshot before a software upgrade and forget to delete it. So this snapshot was growing and growing. We found out that there is a snapshot when the VM or service owner requested some additional disk space. We weren’t able to add disk space because of that snapshot. So we scheduled a maintenance window to delete the snapshot. Faster said than done.
The VM went offline because of disk consolidation. That could happen, depending on snapshot size and storage system. But the VM not only went offline for some time, but unexpectedly for hours. Together with VMware support we were able to stop the snapshot deletion. The VM came back online but with the known “Disk Consolidation Needed status”. We found out that this snapshot was about 400 GB in size. What a bummer! So we scheduled another maintenance window to consolidate that snapshot. Unfortunately that didn’t work well. Consolidation timed at around 96%, not sure why. “Error communicating with the host” isn’t very helpful in that moment.
Some research and again having a chat with VMware support led us towards cloning the disk files. During the cloning of a disk file the snapshot will be consolidated. And as you’re doing a disk clone locally on the ESXi host with “vmkfstools” and not withing vCenter, there shouldn’t be a timeout either. So we had out action plan. And we scheduled another maintenance window with the service owner.
When I’m doing blog posts based on my home lab, like the following article about a failed vSAN cache disk, then I’m really talking about a home lab. Most of the hard- and software configuration in a home lab isn’t supported, neither from VMware nor from any hardware vendor. There might be parts in my lab, for example, the base servers (DELL PowerEdge) or the Smart Array controllers (HPE), which are listed on the VMware HCL. But for example not my SSD (Samsung and Crucial), or probably any combination of controller and SSDs and/or base server. There are so many people out there in the IT, having homelabs and trying out new hard- and software, testing things and learning. If we build such labs then with the only reason to learn and understand how a certain technology works. Not to do fingerpointing to any vendor. If we blow up our labs, then it’s mostly our own fault, like having cheap disks (my bad because I can’t afford shiny Optane nor datacenter disks) or we’ve screwed the configuration of something. I will never, and I repeat, I will never blame any vendor if my lab blows up because of my fault.
Recently I rebuilt my VMware home lab from scratch and with the most recent vSphere version available at this point. I planned to rebuild my lab a long time ago but because of my job and other things I really hadn’t the time to do that. But recently I had to rebuild my lab because I screwed up my vCenter. Yes, I screwed it. So what did that mean? Reinstall all and everything completely from scratch. All my physical ESXi hosts, domain controller, vCenter, Jumphost, and backup server. All these services are running on a standalone ESXi server with some local disks. This server is called my home base. I’ve got some more servers which are running my VMware vSAN environment. I reinstalled these too and reconfigured everything that was needed, like networking and storage.
This week one of my vSAN cluster nodes went into degraded mode because of one of the cache disks failed. I thought, easy, just replacing the cache disk and that’s it. But no, the struggle became real…
I’ve got three DELL servers for my vSAN cluster. All servers are equipped with one SSD as cache tier and three SSDs for the capacity tier. Now one cache disk failed because of reasons (I really don’t know why). That was causing vSAN to go into degraded mode as “failures to tolerate” was set to 1. So one failure (the failed cache disk) was compensated. Just for your information in case you didn’t know. If a cache disk of one disk group fails, the whole disk group will become unavailable. In my case, that meant that one-third of the whole vSAN capacity was gone.
What did I to resolve this?
My first idea was to replace the failed cache disk as I’ve got some identical disks as spare drives available. Well, not directly as spare drives, but installed and configured as RAID 5 in my home base ESXi host. So I did a Storage vMotion on all my home base VMs mentioned above to another local RAID 5 datastore, deleted the SSD RAID datastore and removed the disks. The physical replacement of this disks was easy. But telling my degraded vSAN node to accept this disk was a different topic.
Checking the disks
After I installed the “new” disk into the vSAN node I did a rescan on all storage adapters. And there was nothing. Only the already existing capacity disks but no cache disk. So I tried the second and the third identical disk with the same result. Only the capacity disks were visible in vCenter on the host but not the cache disk. What’s wrong here? I knew that the ESXi server only shows empty disks without any volumes, file systems or data on it. But how should I wipe this disk when not even with esxcli the disk is not visible?
As I’m using HPE Smart HBA H240 as my storage controller in the DELL server, I already installed the HPE smart storage administrator CLI tool on all the vSAN nodes. So I was able to look into the storage controller to see what’s happening there (or probably not).
The following command showed me that all disks are here and are fine:
./ssacli ctrl slot=2 pd all show status
But I was still struggling. Why is vCenter still showing only the capacity disks?
A look into the HPE smart storage administrator CLI tool again showed me that still all physical disks are here. A rescan on all HBA in vCenter on this particular host didn’t help, only the capacity disks were shown.
I looked a little deeper into the storage controller with the command:
./ssacli ctrl slot=2 pd all show detail
That showed something not completely unexpected:
Masked from HBA: The drive contains controller configuration data and has been disabled
in order to protect the configuration data. Please run the "modify clearconfigdata"
command on the drive to re-enable it.
This physical drive above was the cache disk I was missing in vCenter. OK, so let’s clear the “configdata” and let’s see what happens then:
I checked again with “all show detail” and this “modify clearconfigdata” was gone.
Now I was able to rescan all storage adapters in vCenter on this host and that brought back my missed cache disk:
But that was to easy…
After having my cache disk back I went into vSAN configuration in vCenter and claimed the disks. The small one for the cache tier, the bigger ones for the capacity tier. And boom! This particular disk group went into another network partition group. Well done, thank you for nothing!
When you search around the internet for vSAN network partition you will find many forum and blog posts mentioning that this happens if something with the network configuration wasn’t as good as it should be. In my case I checked everything and I changed nothing on the network. So this partitioning issue had another reason. But to be honest I didn’t try to solve that. I wasn’t in the mood for that. I only wanted to bring back my vSAN into a good and healthy state.
I removed this vSAN node from the cluster by just draggin and dropping it out of the cluster. Then I tried to remove it from the inventory. And another boom!
The resource 'eagle.lan.driftar.ch' is in use.
That was the error message in vCenter when I tried to remove the host from inventory. But why? The host is in maintenance mode! Dang it! Let me remove it!
After doing some research on the interwebs I checked also the tasks in vCenter if there is a bit more of information. And I’ve found something:
Cannot remove the host eagle.lan.driftar.ch because it's part of VDS vMotion-DSwitch vSAN-DSwitch.
Well, that’s true. And that was also the obvious reason why I can’t remove the host from inventory. So I had to reconfigure the host networking, putting back the VMKernel ports for vMotion and vSAN to their origin local virtual switches. After that I was finally able to remove the host from the inventory.
Now rebuilding vSAN…
The next steps were easy. I added the host back to the vSAN cluster and configured the VDS for vMotion and vSAN as they were before. Then I went into vSAN configuration and checked the disk group. Lucky me the disk group configured before was still there and healthy, and vSAN claimed it automatically. And no network partitioning this time! All hosts and disk groups in the same network partition group!
After retesting the health onf the vSAN cluster it showed that there is one component in need of a resync. One of my templates was partially on this disk group before failing and is now waiting until the resync completed.
But at least vSAN is working fine again!
In the most cases, or probably in all cases, replacing a disk in vSAN should be easy. Usually you will replace a used disk against a new and empty disk other than me. But that doesn’t mean you can’t unless you know what to do. I’m glad if this blog post helped you solving the issue.
If you follow the steps described in the VMware Knowledge Base then you should be fine: