Yesterday i published my first white paper. It’s nothing special. Just a small guide on how to use the vSphere Update Manager (VUM) in your DRS enabled cluster.
In today’s world of IT, datacenter and cloud automation, maintenance windows and downtime are a special topic. A few years ago the IT department did updates mostly on weekends because nobody was working then. On Monday everyone came back to the office, the mail server was patched and driver updates were installed. Anybody uses IT like running water. And nobody except the IT knows what effort it is to keep the IT thus the business running.
Today at least maintenance windows with service interruption are somewhat of the past, but not to be forgotten, because everyone want’s access to their data whenever it’s needed, wherever it’s located. You can’t shut down a mail server to install updates, you can’t restart virtualization hosts just to install a driver or a patch. IT has to continue to run like water from the tap.
I’m working as a system engineer for an IT company in Switzerland. We provide different services to our customers, ranging from small to medium sized businesses. I saw so many transformations in business needs, but most of the customers had the same whish. The employees of the customers should have access to their emails, wanted to work from home or when they are on the go. So the IT systems had to run twenty four hours and seven days a week.
VMware vSphere Update Manager is a powerful tool to update your ESXi hosts. You can automatically set your hosts into maintenance mode, and if DRS is enabled, your virtual machines are moved to other hosts automatically. At least from infrastructure perspective you can avoid any maintenance window or even downtime. Because the DRS cluster manages the VMs and you can patch your ESXi hosts in the middle of the day.
Today it was the last day in our VMware vSAN Deploy and Manage course. Nevertheless today we have given everything again. We had a deep dive in designing vSAN solutions, we discussed the key topics in design decisions and also played around with some what-if scenarios. But as every day we kicked off with some review what we discovered yesterday, and again to make for everyone clear what vSAN really is.
Simple, with GUI, all from the web client (with just few clicks)
create a Cluster
set a VMkernel for vSAN
create disk groups
claim them as cache / capacity tier
What’s in the default vSAN policy:
FTT = 1
Stripes = 1
No reservation (neither cache nor capacity)
What is a Fault Domain:
an area which may can fail
plan to recover impact of Ops
In vSAN there are two states of compliance…
Absent => wait for 60 minutes, then rebuild
Degraded => rebuild immediately
Cache disk fails => lose disk group => latency increases
Capacity disk fails => degraded => rebuild => VM back online
Controller => host issue => HA response
Host outtage (complete loss of host) => HA response => VM response
Module 7 Lesson 2 – Troubleshooting
Some topics we covered already yesterday. Today it was also some repetition and a quick overview about troubleshooting and some of the tools we discovered yesterday. There are so many tools for troubleshooting available, either already built-in or community driven, i think the list could be longer. But at least some of the most known tools i will provide you with this list.
vCenter (you don’t say…)
Wireshark (yes indeed; capture packets on ESXi and analyze them with Wireshark => pcap)
A cool tool indeed is vCheck. It’s based on Powershell scripts and runs against your vSphere infrastructure (there are scripts for other stuff too). You schedule the scripts and you can reveive notifications about changes, issues (before they become a real deal). So when you arrive in your office you already know what’s going on (or what’s not). Also worth to mention is vSAN Observer. It’s already there, just start it and access the built-in webserver to get an overview what’s going on in your vSAN environment.
Module 8 -Stretched Cluster
After doing some work in the labs we talked about design. And having a stretched cluster is also a question of design, how to create a solution which covers rack outtages or even a complete site outtage. You can do that with a stretched cluster. And the failover happens automatic (what may probably not the best solution in every fail over situation…).
When planning a stretched cluster you have to concern about resources. You need 50% spare capacity on both sites (talking about two racks or two sites) in HA admission control. Imagine that one site / rack should keep the other one online, and the stuff which is already running on the secondary site too.
You don’t have to use SRM (Site Recovery Manager) for a failover. vSAN does that for you automatically. If you use SRM then you have to have a recovery plan for each and every VM. Thats a lot of planning and even checks if there are new or changed VMs. Not to think about the costs. You need SRM licenses and a second vCenter license.
Talking about the vSAN witness. A witness is a separate ESXi box. This can be a physical server with ESXi which needs to be licensed. This physical server can’t be a member of a cluster, but it can run some VMs on it. Or you can get a witness appliance, which represents a special ESXi as an appliance, which runs on a ESXi server. This appliance cannot run VMs on it.
You can have a ROBO vSAN cluster in your remote office / branch office which consists only of two ESXi hosts in this cluster. If you’re doing so you have to have a witness host / appliance in your main office site. You always need somewhere a witness to have the quorum in case of an HA event. And remember the 5 heartbeats. In the case of an outtage, after 5 missed hearbeats your host is gone and a failover happens.
Module 10 – Designing a vSAN deployment
That’s not a random list of IT buzzwords, folks. You have to consider these key points when you’re designing a vSAN solution (probably any other scalable solution too).
Let me give you some more things to consider. In the way of designing a vSAN solution you will have to find answers to these questions. Some answers you will get from your customer when talking with him about a solution for his specific needs. Some other answers you will find when you design the solution. And you will find some more questions too…
Requirements (must have / be / do)
“RPO of 15 minutes”
“RTO of 5 minutes”
Location of data / data center
Constraints (design decisions)
“Must work with existing network hardware”
“Must work at this site”
“We have enough bandwith”
“If the bandwith is not enough => risk of not meeting the SLA”
If you covered the topics above (and the bullet points are just ideas, there are lot more to cover) then you will proceed with the design.
This week i attend the VMware vSAN Deploy and Manage course with Paul McSharry as our instructor. I’m still learning and preparing for my VCP6-DCV which i will catch before new years eve. And there is a helluva stuff to stuff in my brain. This course is not especially for VCP exam, but it will help to answer at least some question about vSAN, which is part of vSphere and this in turn is part of the VCP. So it’s not bad to get some insights.
Starting off with day 2 we had a quick review about yesterday, what we did and what we discussed on day 1. We repeated what vSAN is, what you can do with it (and what not; see Pauls review question list further down). Today we worked a lot in the labs to get familiar with some functions, and probably some stuff you wouldn’t do just so in production. We enjoyed also a small outlook to vSAN 6.5 and some of its features in comparison with vSAN 6.2.
After Pauls questions we talked about some basic networking stuff. We discussed load balancing, features of virtual distributed switches and so on. vSAN is set up in just a few clicks. But you have to look for the networking. vSAN is a storage topology which depends on proper configured and well performing network connections. So its a good idea to make the network admins your friends.
Module 5 Lesson 1 – vSAN policies and VMs
A policy is a state config and a specification and it defines basically the SLA. It can be configured at VM or even at VMDK level. The FTT value describes how many hosts can be tolerated to be lost. FTT generates the replicas of your data (how many copies to store). When using stripes we talk about performance. Stripes define the number of physical disks across which each replica of a storage object is striped. It could increase the performance if you add some more stripes. But also the ressource usage will increase. And you will have to have probably more disks.
Another component in vSAN is the witness. It is the tiebreaker for objects. The cluster needs always a quorum to decide what to do in case of an outtage (absent or degraded state). Per default, if a host is absent (the cluster does not know what happend with that host), your data will be replicated after a wait time of 60 minutes. If the cluster is degraded (cluster knows what happend with a dowend host) then the data will be replicated instantly. You can see that the default vSAN policy with FTT=1 is always your safety net. It is recommended not to edit the default vSAN policies but to create new ones and apply those to your vSAN storage / VM / VMDK.
Module 2 Lesson 2 – vsanSpares Snapshots
Is a snapshot a backup? Most people would freak out at this question. No, it’s not a backup. If you want to make backups of your VMs (and thats a damn good idea…) you should use vSphere Data Protection (or other third party products). But VMware did some changes especially for virtual SAN snaptshots. It’s called the vsanSparse Snapshot. A traditional snapshot will be created, but with this new VMDK type. The delta file will be mounted with a virtual SCSI driver, all the read requests are served through the in-memory cache (physical memory from the host) and all writes go directly to disk. It should not create any performance impact and you can keep up to 32 snapshots as long as you want. But don’t do that. Really.
Module 6 – Management (HA & Update)
At the beginning of this module we talked about the maintenance mode and its specific differences in a vSAN cluster. The maintenance mode enables you to take a host out of rotation. This is the normal vSphere (HA / DRS) maintenance mode. The vSAN maintenance mode is slightly different.
When you put a host in a vSAN cluster in maitenance mode then you can choose between three modes:
Ensure accessibility => move objects to active vSAN ressources as needed to ensure access.
Full data migration => move all objects to active vSAN ressources, regardless of wether the move is needed.
Do nothing => move no objects. Some objects might become unavailable.
We discovered in a class discussion that, depending on the amount of data residing on the hosts, it could be painful to put a host in maintenance mode, even if you don’t do a full data migration but just ensure accessibility. It can take some minutes up to some hourse until the host is in maintenance mode. But you can decrease the time needed with adding more hosts, increase FTT and also stripes.
Few words about HA (High Availability). If your cluster already has HA configured, then you cannot enable vSAN. You have to disable HA, enable vSAN, and then enable HA again. When HA is turned on, the FDM agent (HA) traffic uses the virtual SAN network. The datastore heartbeat is disabled when there are only vSAN datastores in the cluster. And HA will never use a vSAN datastore for a heartbeat, because the vSAN networking is already used for network heartbeat.
What happens with physical disk failures? In traditional server environments or with a normal SAN you create a RAID array, probably with a hot spare, to ensure immediate disk replacement if a disk fails. With vSAN the redundancy is built logically directly within vSAN (FTT, stripes, witness). Thats the reason you shouldn’t create a RAID array but configure your disk controller to pass-through mode, so vSAN is aware of each physical disk and its state.
The upgrade process for vSAN in a few words…
but it’s I/O intensive
you can’t downgrade a disk group once the upgrade is completed
it needs more than 3 hosts (run the allow-reduced mode => potential risk)
Before you upgrade check the hardware for vSAN 6 support (HCL…). The rest of the upgrade process is straight forward:
First upgrade your vCenter
then upgrade the vSphere Update Manager (VUM)
Afther that upgrade your ESXi hosts to version 6
Confirm that Ruby vSphere Console (RVC) is accessible
Login to Ruby and execute the upgrade script at cluster level
One cool built-in tool we tried out today on day 2 in our course. Its the Ruby vSphere Console (RVC) with which the vSAN Observer can be enabled / started. The process starts a webserver which you then can access via https://vCenterServer_hostname_or_IP_Address:8010. The result looks like this:
The initial configuration is not that easy, but its not a big deal. Enter some commands and you’re good to go. The webserver will stop itself after a runtime of an hour or if you manually stop it with Ctrl + C in the CLI console.
Module 8 – Stretched Cluster
Everyone knows a cluster, a group of servers that act like a single system. A strechted cluster is very similar to a normal cluster, with the difference that you cover two sites with the same cluster (or probably multiple racks in one datacenter), including vMotion, storage vMotion and all other cluster-enabled features.
A stretched cluster helps you to…
do maintenance of a complete site with no downtime
lower RPO for unplanned failures
Setting up fault domains enables you to set…
Rack Awareness (1st is primary site, 2nd is failover, 3rd is witness)
Site Awerness (across sites)
A stretched cluster has some specific requirements (some are also required to setup vSAN itself):
L2 stretched network for vSAN (Multicast)
L3 routed network between witness and vSAN hosts (Unicast)
Less than 5ms network latency for data
200ms latency for witness
500ms latency for ROBO (the two-host vSAN in your remote office / branch office)
10GB links are recommended
If you have less than 10 VMs in your ROBO then you’re fine with 1GB links
Consistent MTU size from end to end
You can imagine the following scenarios when there are outtages in your environment:
Failed site => site failover
Failed host same site => if ressources are good to handle the SLA then same site, otherwise DR in other site
Failed witness => everyone carries on workin because no tiebreaker is needed
Failes network between sites => Restart to preferred site
Failed site with vCenter => Witness comes to use to restart to FD2 site
Today we learned a lot about vSAN in its technical details. With an all-flash solution you get lots of IOPS and performance. With a stretched cluster you can even tolerate a complete site failure. Think about that! VMware Virtual SAN is a really cool storage topology which is easy to setup if everything is prepared correctly (networking!).
Here you can find the other blog posts about the vSAN deploy and manage course:
It is some time ago when i published my last blog article. I wasn’t really in the mood for because i am learning and preparing for my VCP6-DCV. And there is a helluva stuff to stuff in my brain. This week i’m publishing some articles, beginning with this one. Not because i don’t learn for VCP, but because i’m learning right now and because i’m attending the VMware Virtual SAN Deploy and Manage course. Lets call this a recap. My brain is still collecting data and sorting it in the right shelf. This recap helps me with that. But lets start now. You probably don’t wan’t to know whats going on in my brain…
Module 1 – Introduction to the course
To break the ice our instructor Paul McSharry started with a short introduction round for all attendees. Paul introduced himself (and his cat too…). I didn’t know him personally. I just knew that he is an expert in his area and a instructor. I heard from other people i know that Paul does his stuff very well. So i expected a good start, and finally the whole day was great.
First of all it’s now official that it’s called vSAN (with a small v in front) and not VSAN. VMware recently did a name change on this particular product. It should show that this product (vSAN) is integrated directly into the ESXi hypervisor. vSAN is a policy-driven software-defined storage tier. There are no dependencies on a VM, it runs directly in the hypervisor. We all knew that. You don’t need a special software or plugin to use vSAN, it’s just a matter of licensing. But now the name of the product makes that clear too.
When a customer wants that software defined storage is flexible, easy to use and install, quick and scalable. He doesn’t want to make compromises in performance. And it should also run in my private cloud and in my public cloud too.
Because of my customer size i don’t work often with scalability. I am somehow feeling ashamed for that. We always calculate some reserves into the systems, because we know our customer and always clarify the needs of the customer. Now i’m 100% sure what’s this with scale-out and scale-up. You scale-up when you add some disks to your hosts to increase capacity (or caching) tier in virtual SAN. You scale-out when you add one or more hosts to your (vSAN)-Cluster to increase overall performance and capacity.
Module 2 – Storage fundamentals
In the second module we talked about some basic storage stuff like spinning disks (rotating rust) and SSD’s, about IOPS and so on. We discovered some good points about latency, and why it’s good to have at least flash cache in a hybrid vSAN, or better go all-flash. It doesn’t mean that spinning disks are old school and that you shouldn’t use them. They are great in prize and capacity comparison. Think about an archive system, or a huge backup storage. For this use cases the spinning disks still deliver a fair amount of IOPS.
We took also a short review on RAID levels. It’s always good to know that, and hear it from time to time. When you’re using vSAN you don’t have to create a RAID array on your built-in storage controller. Just make the controller passing-through the disks to ESXi / vSAN and it’s all good. A discussion worth were also some storage protocols like Fiber Channel (FC), iSCSI and NFS. And last but not least the VMware HCL, the hardware compatibility list. Always check this if you build your own vSAN ready nodes, or even if you upgrade firmware (especially disk / controller firmware) on a certified vSAN ready node.
Module 3 – What is vSAN and use cases
Creating vSAN is easy. You can tune, set limits etc. very similar to VMs. Every single I/O gos through the hypervisor. There is no HBA and no fabric in between host and storage. Thats a big plus regarding the storage latency, which means the latency will be decreased massively. With vSAN you use local ressources (CPU, memory and storage). If you have to expand capacity you can easily add more disks (scale-up) or add a host (scale-out). Lets take a look at the ESXi hypervisor. It already comes with HA, DRS, VDP, vSphere Replication. VMware Virtual SAN is compatible with these common features. It’s just a matter of licensing. You are using different storage tiers? You don’t have to with vSAN. It’s policy based. Limit IOPS for noisy neighbors or to guarantee an SLA.
You can build your own vSAN ready nodes (brownfield / specific pods). And you have always to check the HCL. Or you choose from preconfigured vSAN ready nodes from your favorite hardware vendor. HPE, DELL etc. will provide approved solutions. And last but not least there are the DELL EMC VxRail or even VxRack systems, preinstalled and preconfigured.
VMware Virtual SAN is since version 6.0 now in production. One of the most uses is for virtual desktop infrastructure (VDI). Customers run also their Exchange servers, transactional data bases and so on with vSAN. There is no right or wrong. If you just wan’t to free up space in your racks and replace old hardware, then you’re good to go. With two height units you can replace four servers and a shared storage system which demand all together at least 10 units. Converged systems are a space saver. And don’t forget about energy savings.
Module 4 – Virtual SAN concepts, requirements, install ckecklist
A vSAN datastore is accessible to all hosts in the vSAN cluster, wether or not they have own local storage. You can have capacity hosts and compute hosts if you want. Other storage topoligies can easily coexist with vSAN, there is no limitation. A vSAN datastore is built from disk groups. Every disk group is a single capacity unit from a host and provides cache and data. You must have one flash disk per disk group and one or more capacity disks. There is a limit of five disk groups per host / node.
You need at least three vSAN hosts / nodes for production environments. Your data (for example a VM) is stored all across the hosts in the cluster. Three components are stored in total; two replicas and a whitness. If a host failure occurs the cluster needs the quorum to decide what to do with your data. Thats the reason why there are three components.
disk controller in pass through?
host cache and capacity disks?
VMkernel marked as vSAN?
Multicast on network level?
Uplink or VLAN considerations?
1GB or 10GB network connection?
Cluster of three nodes?
Standard switches or distributed virtual switches?
Setting up vSAN is easy as pie. Meet the requirements and turn it on:
setup vSAN networking
enable vSAN on the cluster
select automatic or manual disk claiming
create disk groups if you set automatic disk claiming
And because it’s that easy, the official movie about setting up VMware Virtual SAN is only about three minutes long. There you go.
The last few days and weeks i was preparing for my VCP6-DCV exam. Well, i’m still preparing for it, there is ahelluva stuff to learn and understand. One thing is vSphere Auto Deploy.
vSphere Auto Deploy is a cool feature for large infrastructures. Imagine, you just have to mount your ESXi hardware hosts in the racks, start them, and they are getting their software, setup and configuration via network. Without the need of any USB, CD or remote mounting of ISO files (like with HPE iLO or DELL iDRAC), and without any local storage if you boot your ESXi hosts from a shared datastore. Your host is online in just a few minutes and ready for use in your cluster, or whatever scenario you need it for.
Today i did some Auto Deploy stuff. And it is not that easy as i thought. You can’t do much via the vSphere Web Client (i’m absolutely the GUI type of sysadmin). You have to do some PowerCLI stuff, but not that much as i was afraid of. Let me show you how i did it. And please drop a comment if there is anything wrong, or if there is anything to make better. I’m pleased to update this post if necessary.
Stage 1 – Preparation for Auto Deploy
What do you need for using Auto Deploy? There is not much:
Probalby some hosts you want to setup with Auto Deploy
Let me give you some tips about the configuration for vCenter and the Open TFTP Server. With this piece of software i had to try and fail a few times until i’ve got it up and running.
vCenter configuration – Enable Auto Deploy
Login to your vCenter with the Web Client.
Click on “Administration“.
Click on “System Configuration” and then “Services” on the next page.
Click on “Auto Deploy“.
In the toolbar on top, click “Actions” and then “Start“.
Under “Actions” and “Edit Startup Type” you can configure Auto Deploy for a automatic or manual start.
vCenter configuration – download TFTP boot zip file
Login to your vCenter with the Web Client.
Click “vCenter Inventory Lists” and then click “vCenter Servers“.
In this overview click your vCenter Server.
Click on “Manage” then “Settings” and then “Auto Deploy“.
Click on the link “Download TFTP Boot Zip” to download the file. You’ll need it later for the TFTP server.
Open TFPT Server – setup and configuration
Download and install the Open TFTP Server (i’ll us this software in my configuration).
Use the standard settings for installation.
Navigate to the setup folder (e.g. C:\OpenTFTPServer) and open the “OpenTFTPServerMT.ini” with a text editor.
You’ll need to configure the [HOME] parameters. This is the folder where you have to save the TFTP Boot Zip from above.
Locate the [HOME] parameter, ignore all the text there and add just “C:\TFTP-Root” (or any other folder you’d like) after the last line of text in this part of the INI file. Add the path to the folder without quotation marks.
Restart the Open TFTP Server service.
Copy your “TFTP Boot Zip” file from above to the folder you added in the INI file and unpack it directly there. You should have now about 11 files, including the zip file.
Restart the Open TFTP Server service again.
Configure DHCP server with options
You need to configure your DHCP server with two options so that your ESXi hosts can boot via network / PXE, get an IP address and configuration file.
Add option 66, which is frequently called next-server. Add the IP address of your TFTP server as value.
Add option 67, which is frequently called boot-file. Add undionly.kpxe.vmw-hardwired as value.
Stage 2 – Create depot, profiles and rules, and deployment
Download the ESXi Offline Bundle from VMware and save it in a folder on the machine where you’re doing this stuff.
Open PowerCLI and connect to your vCenter (Connect-VIServer).
As i wrote above it is not that easy, but it was not so hard as i was afraid of. There are some things to consider, like ESXi configuration with correct networking, storage etc. to make later the suitable host profile which should fit all of your hosts. In this first try i didn’t create a big configuration, just some basic stuff to understand Auto Deploy and for the writing of this blog post.
I have to investigate the password policy, or better, how i can set a password policy. Because my test ESXi host did not have a root password after assigning the host profile. I know i configured the password in step 12 above beside the rest of the configuration. But the password didn’t come with the host profile. But anyway, the configuration of Auto Deploy worked. Now i’ve got some more tasks, for example to find out about this password issue.
Special thanks to Duncan Epping for his cheat sheet (no, i did not read his article, just his cheat sheet, but yes, i saw the link to his article). So i had the commands needed and a thin red line for orientation.