My team and I were tasked with a global vSphere upgrade on all of our ESXi hosts, hyper-converged systems and our vCenter. We took enough time to get the inventory, check all the hosts for compatibility and test the various upgrade paths. The upgrade will be rolled out in multiple steps due to personal resources (we’re a small team and currently, it’s summer holiday season) and also to avoid too much downtime. In this blog post, I’d like to share some personal experiences regarding the upgrade of our vCenter. It didn’t work as we’ve planned. But in the end, all worked fine. I’d like also to shoutout a big thank you to my team. You guys rock!
Foreword
Before we dive deeply into the vCenter upgrade process, and what happened, I’d like to explain some steps first to better understand our approach and the upgrade process in general.
One of the milestones is (at the writing of this blog post already “was”) the upgrade of our vCenter. We’re using vCenter for our daily tasks like managing virtual workloads, deployment of new ESXi hosts, etc. But before we could upgrade our vCenter from 6.5 to 6.7, we had to do some host upgrades first. Our hyper-converged infrastructure was running 24/7 without getting much care, like care in the form of firmware upgrades. There was just not enough time to do maintenance tasks like this throughout the last few months or maybe years. Maybe some people also were just afraid of touching these systems, I don’t know for sure. The firmware was old but at least the hypervisor was on a 6.0 version and also in pretty good shape as well.
So we’ve scheduled various maintenance windows, planned the hyper-converged upgrades and made sure that we’ve downloaded everything from the manufacturer we need to succeed. The firmware upgrade went well on all hosts. One host had a full SEL log and that caused some error messages. No real issue at all, but some alerts in vCenter on that cluster we had to get rid of.
The firmware upgrade on one of the hyper-converged cluster took about 18 hours. That was expected, somehow, because the firmware was really old, and did not support higher ESXi versions that 6.0. But everything went well and we had no issues at all, expect the full SEL log which then has been cleared.
After that firmware upgrade, we were able to upgrade the ESXi version on all of the hyper-converged clusters to a 6.5 level. This was needed because of some plugins used to manage these hyper-converged systems. Ok, to let the cat out of the bag, we’re using Cisco HyperFlex and the plugin I’m talking about is that HX plugin. The version for ESXi 6.0 wasn’t supported in vCenter 6.7. That’s the reason we had to upgrade the HyperFlex systems first to ESXi 6.5.
As you know for sure, you can’t manage ESXi hosts later than 6.5 in vCenter 6.5. So we had to do a stop here for the moment, but we were now at least able to upgrade our vCenter. All other hosts were already on 6.0 since they were installed, so no issues upgrading to vCenter 6.7.
Oh, did I already mention that our vCenter doesn’t run on-premises but on a cloud provider? No, it’s not VMC on AWS, but some other IaaS provider. That didn’t make it easier.
But let’s dive into the main topic now, enough of explanation, let’s do the hard work now.
vCenter Upgrade process shortly explained
When you start the vCenter upgrade process, it will be done in two stages. The first stage is to gather all the needed information (source and target, hosts, vCenter, credentials, etc.). The old vCenter will be checked and also some data will be exported (configuration, historical, and performance metrics data).
In the second stage, a new VCSA appliance will be deployed and configured, and all the data exported during stage one will be imported. The new VCSA will be configured exactly like the old one, thus the old VCSA will be shut down.
That’s pretty much it. Usually, this takes about an hour, at least from my personal experience. But it depends on the amount of data you’d like to transfer and also on host and storage performance.
Important notice
Before you’re going to upgrade your vCenter, make sure that you backed it up, at least twice or three times! Within the vCenter, you’ve got a native backup function where you can back up at least the configuration and all that. It’s also good to have a bootable backup of vCenter like you can do it with Veeam Backup & Replication. Also, a snapshot is a good idea. But please remember, snapshots are not backups! They might help you in case something didn’t go as planned, to go quickly back to the recent state.
vCenter Upgrade with the GUI installer
Usually, you can do a vCenter upgrade easily with the GUI installer. Download and mount the ISO file, start the installer and follow the instructions. Easy as pie. Usually. But not with an IaaS provider where you don’t have full administrative privileges. I’ve started to scratch my head. How can we upgrade our vCenter?
There is no need for administrative privileges because it’s nothing more than an OVA file which is deployed with the ovftools during this upgrade process. You don’t need full administrative privileges to do that, usually. But the GUI installer does some checks, and the upgrade will fail (or better, it won’t even start) because lacking administrative privileges.
I’ve asked the vCommunity on Twitter if someone has experience in upgrading a vCenter server appliance if it is hosted on VMC on AWS and how that was managed. Unfortunately, there was no answer…
But I’ve found a recent blog post from William Lam, Staff Solutions Architect for VMware Cloud on AWS. He attended a VMC on AWS summit and was asked if it is possible to deploy a VCSA on VMC on AWS.
That’s William’s blog post:
Short version: yes, it is possible to deploy a vCenter server appliance on VMC on AWS. But not with the GUI installer. William had to use the CLI installer with a JSON configuration file.
That was the “bright light bulb over my head” moment! If you can deploy a new VCSA appliance through the CLI installer, then you should also be able to upgrade an existing appliance with the same tool kit!
vCenter Upgrade with the CLI installer
Thank you in advance that you have read this far. I know it’s a lot of text, but I can’t just write a problem solution without context. This problem we were facing was a special one. And this problem also needed a special solution.
Compared to the GUI installer, the CLI installer is just an EXE file which is executed, but you have to feed it with a JSON configuration file. It took us a while until it suited our needs and infrastructure.
On the VMware documentations website, you can find more information about the various templates available to use with the CLI installer:
We’ve used the “embedded_vCSA_on_VC.json” because our VCSA is embedded, which means including the PSC (platform service controller) and it is managed/running in another vCenter. Below is the example from VMware:
{ "__version": "2.13.0", "__comments": "Sample template to upgrade a vCenter Server Appliance 6.5 with an embedded Platform Services Controller to a vCenter Server Appliance 6.7 with an embedded Platform Services Controller on a vCenter Server instance.", "new_vcsa": { "vc": { "__comments": [ "'datacenter' must end with a datacenter name, and only with a datacenter name. ", "'target' must end with an ESXi hostname, a cluster name, or a resource pool name. ", "The item 'Resources' must precede the resource pool name. ", "All names are case-sensitive. ", "For details and examples, refer to template help, i.e. vcsa-deploy {install|upgrade|migrate} --template-help" ], "hostname": "<FQDN or IP address of the vCenter Server instance>", "username": "<The user name of a user with administrative privileges or the Single Sign-On administrator on vCenter.>", "password": "<The password of a user with administrative privileges or the Single Sign-On administrator on vCenter. If left blank, or omitted, you will be prompted to enter it at the command console during template verification.>", "deployment_network": "VM Network", "datacenter": [ "Folder 1 (parent of Folder 2)", "Folder 2 (parent of Your Datacenter)", "Your Datacenter" ], "datastore": "<A specific datastore accessible to the ESXi host or DRS cluster in the 'target' path.>", "target": [ "Folder A (parent of Folder B)", "Folder B (parent of Your ESXi Host, or Cluster)", "Your ESXi Host, or Cluster" ] }, "appliance": { "__comments": [ "You must provide the 'deployment_option' key with a value, which will affect the VCSA's configuration parameters, such as the VCSA's number of vCPUs, the memory size, the storage size, and the maximum numbers of ESXi hosts and VMs which can be managed. For a list of acceptable values, run the supported deployment sizes help, i.e. vcsa-deploy --supported-deployment-sizes" ], "thin_disk_mode": true, "deployment_option": "small", "name": "Embedded-vCenter-Server-Appliance" }, "os": { "ssh_enable": false }, "temporary_network": { "ip_family": "<ipv4 or ipv6>", "mode": "static", "ip": "<Static IP address. Remove this if using dhcp.>", "dns_servers": [ "<DNS Server IP Address. Remove this if using dhcp.>" ], "prefix": "<Network prefix length. Use only when the mode is 'static'. Remove if the mode is 'dhcp'. This is the number of bits set in the subnet mask; for instance, if the subnet mask is 255.255.255.0, there are 24 bits in the binary version of the subnet mask, so the prefix length is 24. If used, the values must be in the inclusive range of 0 to 32 for IPv4 and 0 to 128 for IPv6.>", "gateway": "<Gateway IP address. Remove this if using dhcp.>" }, "user_options": { "vcdb_migrateSet": "<Set the data migration option. Available options are 'core', 'all', and 'core_events_tasks'.>" } }, "source_vc": { "description": { "__comments": [ "This section describes the source appliance which you want to", "upgrade and the ESXi host on which the appliance is running. " ] }, "managing_esxi_or_vc": { "hostname": "<FQDN or IP address of the ESXi or vCenter on which the source vCenter Server Appliance resides.>", "username": "<Username of a user with administrative privilege on the ESXi host or vCenter Server. For example 'root' for ESXi and 'administrator@<SSO domain name>' for vCenter >", "password": "<Password of the administrative user on the ESXi host or vCenter Server. If left blank, or omitted, you will be prompted to enter it at the command console during template verification.>" }, "vc_vcsa": { "hostname": "<FQDN or IP address of the source vCenter Server Appliance>", "username": "administrator@<SSO domain name>", "password": "<vCenter Single Sign-On administrator password. If left blank, or omitted, you will be prompted to enter it at the command console during template verification.>", "root_password": "<Appliance root password. If left blank, or omitted, you will be prompted to enter it at the command console during template verification.>" } }, "ceip": { "description": { "__comments": [ "++++VMware Customer Experience Improvement Program (CEIP)++++", "VMware's Customer Experience Improvement Program (CEIP) ", "provides VMware with information that enables VMware to ", "improve its products and services, to fix problems, ", "and to advise you on how best to deploy and use our ", "products. As part of CEIP, VMware collects technical ", "information about your organization's use of VMware ", "products and services on a regular basis in association ", "with your organization's VMware license key(s). This ", "information does not personally identify any individual. ", "", "Additional information regarding the data collected ", "through CEIP and the purposes for which it is used by ", "VMware is set forth in the Trust & Assurance Center at ", "http://www.vmware.com/trustvmware/ceip.html . If you ", "prefer not to participate in VMware's CEIP for this ", "product, you should disable CEIP by setting ", "'ceip_enabled': false. You may join or leave VMware's ", "CEIP for this product at any time. Please confirm your ", "acknowledgement by passing in the parameter ", "--acknowledge-ceip in the command line.", "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" ] }, "settings": { "ceip_enabled": true } } }
The JSON configuration file below has been anonymized, but it reflects the exact configuration we used to get the upgrade running through the CLI installer:
{ "__version": "2.13.0", "__comments": "Sample template to upgrade a vCenter Server Appliance 6.5 with an embedded Platform Services Controller to a vCenter Server Appliance 6.7 with an embedded Platform Services Controller on a vCenter Server instance.", "new_vcsa": { "vc": { "__comments": [ "'datacenter' must end with a datacenter name, and only with a datacenter name. ", "'target' must end with an ESXi hostname, a cluster name, or a resource pool name. ", "The item 'Resources' must precede the resource pool name. ", "All names are case-sensitive. ", "For details and examples, refer to template help, i.e. vcsa-deploy {install|upgrade|migrate} --template-help" ], "hostname": "[vCenter which is hosting your vCenter]", "username": "[username for that vCenter]", "password": "[password for the user just above]", "deployment_network": "[the network to which the NEW vCenter should be connected to]", "datacenter": "[the datacenter where your vCenter is located]", "datastore": "[datastore where the new vCenter should be deployed]", "target":"[the target cluster name]" }, "appliance": { "__comments": [ "You must provide the 'deployment_option' key with a value, which will affect the VCSA's configuration parameters, such as the VCSA's number of vCPUs, the memory size, the storage size, and the maximum numbers of ESXi hosts and VMs which can be managed. For a list of acceptable values, run the supported deployment sizes help, i.e. vcsa-deploy --supported-deployment-sizes" ], "thin_disk_mode": true, "deployment_option": "medium", "name": "[the name of the NEW virtual machine in the vCenter inventory]" }, "os": { "ssh_enable": true }, "temporary_network": { "ip_family": "ipv4", "mode": "static", "ip": "[temporary_ip]", "dns_servers": [ "[dns_server_ip]" ], "prefix": "[depends on your network, usually /24]", "gateway": "[gateway_ip]" }, "user_options": { "vcdb_migrateSet": "core_events_tasks" } }, "source_vc": { "description": { "__comments": [ "This section describes the source appliance which you want to", "upgrade and the ESXi host on which the appliance is running. " ] }, "managing_esxi_or_vc": { "hostname": "[vCenter which is hosting your vCenter]", "username": "[username for that vCenter]", "password": "[password for the user just above]" }, "vc_vcsa": { "hostname": "[your current vCenter FQDN]", "username": "[SSO Admin user, mostly administrator@vsphere.local", "password": "[password of the SSO Admin user]", "root_password": "[root password of the source vCenter]" } }, "ceip": { "description": { "__comments": [ "++++VMware Customer Experience Improvement Program (CEIP)++++", "VMware's Customer Experience Improvement Program (CEIP) ", "provides VMware with information that enables VMware to ", "improve its products and services, to fix problems, ", "and to advise you on how best to deploy and use our ", "products. As part of CEIP, VMware collects technical ", "information about your organization's use of VMware ", "products and services on a regular basis in association ", "with your organization's VMware license key(s). This ", "information does not personally identify any individual. ", "", "Additional information regarding the data collected ", "through CEIP and the purposes for which it is used by ", "VMware is set forth in the Trust & Assurance Center at ", "http://www.vmware.com/trustvmware/ceip.html . If you ", "prefer not to participate in VMware's CEIP for this ", "product, you should disable CEIP by setting ", "'ceip_enabled': false. You may join or leave VMware's ", "CEIP for this product at any time. Please confirm your ", "acknowledgement by passing in the parameter ", "--acknowledge-ceip in the command line.", "++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++" ] }, "settings": { "ceip_enabled": false } } }
As soon as you’ve got your JSON file ready, you can use the following command to verify the configuration first, and then start the upgrade process:
F:\vcsa-cli-installer\win32\vcsa-deploy.exe upgrade --accept-eula --no-ssl-certificate-verification --verify-template-only D:\vcsa_upgrade\your_configuration.json
The command above takes your JSON file and verifies it against your infrastructure. From my personal experience, it is only a basic verification. It will test the credentials, connectivity and some other basic stuff. But it doesn’t check needed disk space for example, as in our further progress below:
F:\vcsa-cli-installer\win32\vcsa-deploy.exe upgrade --accept-eula --no-ssl-certificate-verification D:\vcsa_upgrade\your_configuration.json
This command starts the upgrade process and it will run until it either fails or succeeds.
On the VMware Knowledgebase you’ll find all the parameters you can use within the JSON configuration file:
Free disk space?
After working some time figuring out the right parameters and information for the JSON file, we started the upgrade process and run into the issue that our source vCenter didn’t have enough free space on one disk to do the “all
” export. For the vCenter upgrade, you can choose how much data from the old vCenter should be migrated into the new instance, but more of that later on.
We then figured out which disk we had to resize, but that wasn’t as easy as it sounds.
We know that the folder (it was “/var/tmp
“, and I apologize for not having it visible in the screenshot above) was located on the disk partition “sda3
“. But we can’t just add some space to the correct VMDK and resize the partition because of partition “sda4
” was preventing this.
We tried to figure out what was stored on “sda4
” but that wasn’t easy. We checked multiple guides on the internet, blog posts and knowledgebase articles, but no success. Some guides mentioned that you can delete “sda3
” and recreate it in the right size (they didn’t mention “sda4
” and it wasn’t visible on their screenshots). One way could be to boot with a Linux live CD and use some partitioning tools, to move the “sda4
” to the end of the disk to have enough space to resize “sda3
“.
But we thought that this is too risky, even with all the backups we have from our vCenter. Don’t screw up the disks or you’re done. So we decided to not transfer all the data. The “all
” parameter mentioned in the upgrade process above will transfer the configuration, historical, and performance metrics data.
We were fine if we set the parameter to “core_events_tasks
” which means that during the upgrade only the configuration and historical data (events and tasks) will be transferred but no performance metrics. It was a compromise that we had to go with.
Let’s start again!
We started the upgrade process again and it looked very promising. All went fine, we saw many tasks running and succeeding. But at a certain point, the upgrade failed. The old vCenter was already shut down, the new instance has been started and configured correctly. But then, bam! and the upgrade process did kind of a roll-back. It reverted the new instance to the temporary IP address and the process ended. That’s not nice! What happened?
We went through the JSON file again to check if we maybe had a wrong parameter somewhere, but all looked good. We also verified the configuration with the command above multiple times, and all looked good. We didn’t have any issue so far.
We started the upgrade process again (the third run so far…) and it failed again, with the same error messages. We went on and checked the logs of the most recent run and we even went through the Python scripts mentioned in the logs (all on the source vCenter). But there was nothing which led us to an issue.
But something grabbed our attention anyway. There was one thing in the log, pointing to some privileges being added to a role. But this resulted in an error. We then search the internet for “VMTX_SYNC_PRIVILEGES
” because this term yelled at us.
Finally. The solution!
We found two results so far. Two results. We first thought that it looks as we’re the only ones with that issue. But one link led us to the VMware VMTN Communities. One user tried to upgrade his vCenter but with the GUI installer instead of using the CLI version. And his upgrade failed as well. He crawled through the logs as well as we did and he also had the same error as we had. Lucky us (somehow)!
That’s the community post:
https://communities.vmware.com/thread/611985
On the source vCenter, there was a role missing. There are more roles as you see in the list. But they are hidden. The missing role was/is a hidden one. We had to create that role named “com.vmware.Content.Admin” and had to enable all items under “Content Library”. After creating this role, we had to reboot our vCenter.
And the rest of that story is already in the history books…
After the last vCenter reboot, we were finally able to upgrade our vCenter from 6.5 to 6.7 through the CLI installer, with a proper JSON configuration file, and without any issues this time. The upgrade itself took about an hour or so. The troubleshooting took us about two working days, until we’ve found the solution. I was willing to try it again on the first day late at night. But I could have bet that I had not seen this missing role error. It needed a new day and a new start. There are times when you can’t see the forest for the trees.
Conclusion
I know, that’s much text above to read. But I can’t just provide you a step-by-step guide, because special problems need special solutions. Also, I’d like to provide you some context. You might say that we should’ve update our hyper-converged systems more often. Or maybe there was an issue with our vCenter at a certain point in time because of that missing role. I hope that you can see the ways we walked and possibilities we’ve tried. And in the end, it worked fine, the issue has been solved.