Lab Overview - HOL-1801-02-CMP - vRealize Suite Standard: Automated, Proactive Management
Note: It will take more than 60 minutes to complete this lab. You should expect to only finish 2-3 of the modules during your time. The modules are independent of each other so you can start at the beginning of any module and proceed from there. You can use the Table of Contents to access any module of your choosing.
The Table of Contents can be accessed in the upper right-hand corner of the Lab Manual.
In this Lab you will see how to automate the workload balance across your vSphere Infrastructure and automatically remediate any issues that may come up in the environment.
Lab Module List:
This lab manual can be downloaded from the Hands-on Labs Document site found here:
This lab may be available in other languages. To set your language preference and have a localized manual deployed with your lab, you may utilize this document to help guide you through the process:
During this module, you will input text into the Main Console. Besides directly typing it in, there are two very helpful methods of entering data which make it easier to enter complex data.
You can also click and drag text and Command Line Interface (CLI) commands directly from the Lab Manual into the active window in the Main Console.
You can also use the Online International Keyboard found in the Main Console.
When you first start your lab, you may notice a watermark on the desktop indicating that Windows is not activated.
One of the major benefits of virtualization is that virtual machines can be moved and run on any platform. The Hands-on Labs utilizes this benefit and we are able to run the labs out of multiple datacenters. However, these datacenters may not have identical processors, which triggers a Microsoft activation check through the Internet.
Rest assured, VMware and the Hands-on Labs are in full compliance with Microsoft licensing requirements. The lab that you are using is a self-contained pod and does not have full access to the Internet, which is required for Windows to verify the activation. Without full access to the Internet, this automated process fails and you see this watermark.
This cosmetic issue has no effect on your lab.
Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes your lab has not changed to "Ready", please ask for assistance.
Module 1 - Automated workload placement and predictive DRS (30 minutes)
Within a virtualized environment, even with the best planning, the distribution of the workloads between hosts, clusters and data centers can get out of balance. Being out of balance itself is not a problem as long each workload can obtain the resources it needs without causing contention. Contention exists when the workload on a specific host requests more resources than are available. Resource contention is one of the most critical issues in any virtualized environment. When contention occurs, applications slow down and your users are affected.
Distributed Resource Scheduler (DRS) is a proven vSphere feature that moves virtual machines (VMs) within a cluster (between hosts) to ensure virtual machines are always running on a host with adequate resources to support it.
vRealize Operations Manager can move virtual machines between clusters to ensure the clusters are balanced in the environment, which in the end helps DRS. vRealize Operations Manager's Rebalance Container action allows you to balance workloads between the clusters in your data center or custom data centers by providing you move recommendations. These move recommendations come in the form of a rebalance action plan. The plan lists move recommendations and provides a reason on why to move it (CPU or memory imbalance).
Up until now two different methodologies have been employed to mitigate the risk of contention, with varied results. New to vRealize Operations and vSphere is Predictive DRS, a capability that can be used to minimize resource contention proactively. Predictive DRS uses a combination of DRS and vRealize Operations Manager to predict future demand and determine when and where hot spots will occur. When future hot spots are found, Predictive DRS moves the workloads before contention occurs.
In the following steps and video, you will learn how to remediate a cluster based resource constraint by rebalancing virtual machines between clusters using the Rebalance Container Action.
Note: It is important to balance between clusters configured for workloads of similar priority or importance to your organization.
For example, you would not want to balance workloads in a test/dev environment with your production, mission critical applications; this could cause unexpected behavior within the production environment.
Due to the significant amount of resources required to simulate an out of balanced cluster, which would negativity affect the lab as a whole, we have chosen to walk you through how to access the Workload Balancing within vRealize Operations Manager.
Resource contention is one of the most critical issues in any virtualized environment. When contention occurs, applications slow down and your users are affected. Up until now two different methodologies have been employed to mitigate the risk of contention, with varied results. But now I want to introduce you to the new “game changing” method available from VMware: Predictive DRS!
How does Predictive DRS work? It starts by leveraging one of the core functions of vRealize Operations, Dynamic Thresholds, which understand the behaviors of all workloads throughout the day. vRealize Operations Manager collects hundreds of metrics across numerous types of objects (hosts, datastores, virtual machines, and other objects) every day. Each night vRealize Operations Manager runs Dynamic Threshold calculations using sophisticated analytics to create a band of what is “normal” for each metric/object combination. The band has an upper and lower bound of normal for each metric associated with object. For example, if there is a simple application server virtual machine, vRealize Operations Manager will show the virtual machine does not use a lot of CPU early in the morning. However, at 8 AM, when people start logging into the system, the CPU load will spike very high. It will then taper off around noon as people go to lunch, and then back up again for the rest of the day until people go home. And don’t forget about the nightly reports which run at 2AM and spike CPU.
The great thing about Dynamic Thresholds is that they are tailored to each individual virtual machine and application. There is nothing you need to do; the analytical engine in vRealize Operations takes care of everything.
Once vRealize Operations Manager has calculated its Dynamic Thresholds we have 3 fundamental data points:
Once we have those we can ask the most important contention mitigation question of all, “Will any of my hosts struggle to serve my workloads today”? If the answer is “Yes” then let’s move a few virtual machines around to avoid that future contentious situation. In a nutshell, this is how Predictive DRS works.
The video will focus on a Predictive DRS walk through showing how simple it is to configure the Predictive DRS feature in both vRealize Operations 6.6 and vSphere 6.5. This walk through will also serve as a great demonstration of the solution and give you a view into how it all comes together. After watching the video you should be easily able to configure it in your environment and start seeing the benefits of Predictive DRS.
You can see more details on this through this video: https://youtu.be/cwaALGTyTMU
You have completed Module 1 - Automated Workload Placement and Predictive DRS (pDRS)
You should now have an understanding of:
Feel free to proceed to the next module below:
Module 2 - Automated Remediation of Issues
If you wish to conclude your lab at this time click on the END button. This will terminate your lab and all progress. Do this only if you wish to NOT proceed with the other modules.
Module 2 -Automated remediation of issues (30 minutes)
In this module we will look at the Alerting Framework in vRealize Operations. We will cover the following topics:
This module should take about 30-45 minutes for you to complete.
The Alerting Framework is a very powerful feature of the vRealize Operations platform. It's a relatively simple construct to understand, but once you master it, you can use it for all sorts of useful purposes in your organisation.
The main construct we use in the Alerting Framework is theAlert Definition. It is made up of three parts:
Lets start by looking at Symptoms...
Lets start by looking at Symptoms
|User name: Admin|
The lab environment has a default resolution of 1024x768.To minimize the need forextensive scrolling within the vRealize Operations user interface, please adjust the zoomlevel in Firefox.
Most of the Symptom Definitions you will work with will be Metric/Property Symptom Definitions which will be selected by default in the left hand pane.
Lets look at what makes up the definition:
Evaluate on instanced metric - this means we will look at all the CPUs on a Virtual Machine
5. Click on Cancel to return to the symptom list.
The scenario we are going to build is one where we have set a performance SLA for our Virtual Machines. We want to trigger an alert when any of the performance metrics have breached their SLA.
The SLA we have is:
(This SLA would be appropriate for a production envrironment)
Lets create the 4 symptom definitions we will need for this
We probably need to filter for the metric we are looking for:
Click on the Ready (%) metric and, holding the mouse button down, drag the metric into the symptoms panel, then release the mouse button
We probably need to filter for the metric we are looking for:
Click on the Contention metric and, holding the mouse button down, drag the metric into the symptoms panel, and release the mouse button
Now we'll create the Disk latency symptom
In the Symptom definitions list click on the green plus sign to add this third definition
This time, drag the Total Latency metric into the Symptom Panel
Finally, we'll create the Network Packets Dropped symptom
In the Symptom definitions list click on the green plus sign to add this fourth definition
This time, drag the Packets Dropped (%) metric into the Symptom Panel
You can now see all four symptoms
Now lets take a quick look at Recommendations and Actions
You'll see a list of recommendations. A recommendation is some plain text on what to do, should a particularAlerttrigger. It can be short or verbose and in some cases may include links to such things as KB articles. If you create your own, they could include links to your operational manuals.
You can see where Recommendations have been linked to Alert Definitions
Some Recommendations have Actions associated with them
2. Click on the green plus icon to add a new Recommendation
Add some text, for example - Hands on Lab - this VM has breached the performance SLA in place. Consider moving it to a different cluster/host or removing workload from the cluster/host it is running on.
Note: you can cut and paste this text from the readme.txt file on your lab desktop
Click on Save once complete
(We'll look at adding actions later...)
You can see a list of 'out of the box' Actions that are available. You will notice that there aren't any options to add new custom Actions.
With the current version of vRealize Operations, the following action types are available:
In this lab we are going to use the out of the box Python actions.
Now lets build an Alert Definition from the four Symptoms and the Recommendation we just built.
The Alert Impact should be set as follows:
We will need to filter to find the Symptoms that we created earlier - type hands on lab in the filter box and hit return
Click on the Hands on Lab - CPU SLA symptom and, holding the mouse button down, drag it to the Alert Definition panel. Release the mouse button
Drag the Hands on Lab - Disk SLA symptom to the same symptom set. As you hover over the symptom set, it will get a green outline as in the screenshot. Release the mouse button when you get this green outline.
Don't drag the symptom into the 'Drag another symptom here to add more symptoms' box below (we'll do this later when we show how the symptom sets work)
Drag the Hands on Lab - Memory SLA symptom to the same symptom set. As you hover over the symptom set, it will get a green outline as in the screenshot. Release the mouse button when you get this green outline.
Don't drag the symptom into the 'Drag another symptom here to add more symptoms' box
Before we drag the final symptom let's change the boolean term.
Click on the 'Base object exhibits' drop down to change it from the default ofAllto the value Any. This means that if any of our individual SLA symptoms are triggered then the alert will trigger. We don't want to wait for them to all trigger at the same time - that is very unlikely to happen!
We could just add our final Symptom into this symptom set - however, lets create a 2nd symptom set to see how they work.
Drag the Hands on Lab - Network SLA symptom to the 'Drag another symptom here to add more symptoms' box.
You may need to scroll down to see both symptom sets as in the screenshot
You would usually have just created the single symptom set - for the purposes of this lab we wanted to demonstrate you can have multiple symptom sets with boolean options.
Hands on Labs are not designed for proper production workloads! We massively overcommit our resources and use vSphere technology to provide the best possible experience - can you imagine the number of servers we would need if we wanted to run 1000 instances of this lab concurrently with absolutely zero memory, CPU or disk contention!!
To that end, its likely one of our symptoms will have triggered the alert we just created. Given the crazy over-commit we use in the lab we should see some CPU ready time.
Finally, we thought we'd show an example of automated remediation. We are fairly limited in Hands on Labs on the workloads we can have running so we've constructed a slightly different scenario. Hopefully you will find it fun.
In this scenario we are going to monitor our application cluster and turn off any machines that contains "win" in its name! We only want to keep running VMs that actually have a purpose, like DB, APP, WEB or any other useful application for our lab. Any machine that references an OS on its name it is probably a template or a base reference machine and we do not want them running on our cluster.
We are going to need to add three informational Symptom Definitions that are going to define the conditions under which this alert will be trigerred:
a. The cluster is called RegionA01-COMP01;
b. The VM in the cluster contains the prefix win in its name;
c. The VM is powered on.
So, if we see a VM in the "RegionA01-COMP01" cluster that has the prefix "win" in its name and its "powered on", we're going to power it off!
We have our three Symptoms and the Recommendation, let's now create the Alert Definition
The Alert should have triggered. To find it:
If the Alert has not triggered, wait 30 seconds and click on the refresh icon to try again
If you did power on the machine again you will see that action in progress - if you wait a bit longer it will turn off again! No way that VM is staying powered on unless you rename it!
In this module you learned:
Congratulations on completing Module 1.
If you are looking for additional information on monitoring objects in your managed environment for "Automated remediation of issues", try one of these:
Proceed to any module below which interests you most.
To end your lab click on the END button.
Thank you for participating in the VMware Hands-on Labs. Be sure to visit http://hol.vmware.com/ to continue your lab experience online.
Lab SKU: HOL-1801-02-CMP