VMware Hands-on Labs - HOL-1804-01-SDC


Lab Overview - HOL-1804-01-SDC - vSphere 6.5 Performance Diagnostics and Benchmarking

Lab Guidance


Note: It will take more than 90 minutes to complete this lab. You should expect to only finish 2-3 of the modules during your time.  The modules are independent of each other so you can start at the beginning of any module and proceed from there. You can use the Table of Contents to access any module of your choosing.

The Table of Contents can be accessed in the upper right-hand corner of the Lab Manual.

Lab Module List:

 Lab Captains:

 

This lab manual can be downloaded from the Hands-on Labs Document site found here:

http://docs.hol.vmware.com This lab may be available in other languages.  To set your language preference and have a localized manual deployed with your lab, you may utilize this document to help guide you through the process:

http://docs.hol.vmware.com/announcements/nee-default-language.pdf


 

Location of the Main Console

 

  1. The area in the RED box contains the Main Console.  The Lab Manual is on the tab to the Right of the Main Console.
  2. A particular lab may have additional consoles found on separate tabs in the upper left. You will be directed to open another specific console if needed.
  3. Your lab starts with 90 minutes on the timer.  The lab can not be saved.  All your work must be done during the lab session.  But you can click the EXTEND to increase your time.  If you are at a VMware event, you can extend your lab time twice, for up to 30 minutes.  Each click gives you an additional 15 minutes.  Outside of VMware events, you can extend your lab time up to 9 hours and 30 minutes. Each click gives you an additional hour.

 

 

Alternate Methods of Keyboard Data Entry

During this module, you will input text into the Main Console. Besides directly typing it in, there are two very helpful methods of entering data which make it easier to enter complex data.

 

 

Click and Drag Lab Manual Content Into Console Active Window

You can also click and drag text and Command Line Interface (CLI) commands directly from the Lab Manual into the active window in the Main Console.  

 

 

Accessing the Online International Keyboard

 

You can also use the Online International Keyboard found in the Main Console.

  1. Click on the Keyboard Icon found on the Windows Quick Launch Task Bar.

 

 

Activation Prompt or Watermark

 

When you first start your lab, you may notice a watermark on the desktop indicating that Windows is not activated.  

One of the major benefits of virtualization is that virtual machines can be moved and run on any platform.  The Hands-on Labs utilizes this benefit and we are able to run the labs out of multiple datacenters.  However, these datacenters may not have identical processors, which triggers a Microsoft activation check through the Internet.

Rest assured, VMware and the Hands-on Labs are in full compliance with Microsoft licensing requirements.  The lab that you are using is a self-contained pod and does not have full access to the Internet, which is required for Windows to verify the activation.  Without full access to the Internet, this automated process fails and you see this watermark.

This cosmetic issue has no effect on your lab.  

 

 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes your lab has not changed to "Ready", please ask for assistance.

 

vSphere 6.5 Performance Introduction


This Lab, HOL-SDC-1804-01, covers vSphere performance best practices and various performance related features available in vSphere 6.5. You will work with a broad array of benchmarks, like Weathervane and DVD Store, and performance monitoring tools such as esxtop and advanced performance charts, to both measure performance and diagnose bottlenecks in a vSphere environment. Performance-related vSphere features such as right-sizing virtual machines, virtual NUMA, Latency Sensitivity and Host Power Management are also explored.

While the time available in this lab constrains the number of performance problems we can review as examples, we have selected relevant problems that are commonly seen in vSphere environments.By walking through these examples, you should be more capable to understand and troubleshoot typical performance problems.

For the complete Performance Troubleshooting Methodology and a list of VMware Best Practices, please visit the vmware.com website:

http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/whats-new-vsphere65-perf.pdf

https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/drs-vsphere65-perf.pdf

Furthermore, if you have interest in performance related articles, make sure that you monitor the VMware VROOM! Blog:

http://blogs.vmware.com/performance/


Module 1 - Application Performance Testing with Weathervane (45 minutes)

Introduction


 

This Module introduces Weathervane, a new application-level performance benchmark designed to allow the investigation of performance trade-offs in modern virtualized and cloud infrastructures.

Here is a brief overview for the content in this Module:


What is Weathervane?


This lesson will describe what the Weathervane benchmark is, and how it is different from traditional benchmark workloads.


 

Weathervane Description

 

 

 

Weathervane Components

 

Weathervane consists of three main components (if the picture above seems daunting, do not fear: this lab has all three components running inside one Linux VM!).  It is possible to run every Weathervane service in one VM or container, but it is also possible to run only specific service tiers, or even only specific service instances.

  1. The Workload driver that can drive a realistic and repeatable load against the application
  2. The Run Harness that automates the process of executing runs and collecting results and relevant performance data
  3. The Auction Application itself is a web-application for hosting real-time auctions.

We will take a look at each of these components in more detail, then run Weathervane in our lab environment.

 

 

Workload Driver

 

The Weathervane workload driver has several key features:

 

 

Run Harness

 

The Weathervane run harness is controlled by a configuration file that describes the deployment, including:

The harness also does several other extremely useful tasks:

Later in this module, we will start an actual run in the lab environment using the harness to see how easy it is -- it is literally just one command!

 

 

Auction Application

 

The Auction Application, as we can tell from the picture above, is the most complex portion of Weathervane.

It is a web app that simulates hosting real-time auctions.  It uses an architecture that allows deployments to be easily scaled to, and sized for, a large range of user loads. A deployment of the application involves a wide variety of support services, such as caching, messaging, data store, and relational database tiers. Many of the services are optional, and some support multiple provider implementations.

A default Weathervane deployment like the VM in this lab uses the following applications (click the links for more information about the applications).  All are set up "out of the box" (ready to run) via the automatic setup script that comes with the benchmark:

In addition, the number of instances of some of these services can be scaled elastically at run time in response to a preset schedule or to monitored performance metrics. The flexibility of the application deployment allows us to investigate a wide variety of complex infrastructure-related performance questions.

 

Downloading/Installing Weathervane


This lesson will describe how to install the Weathervane benchmark.  It is very easy set up, as most of it is automated.

NOTE: Weathervane has already been installed in our hands-on lab environment, so this lesson is purely informational (for example, if you want to learn how easy it is to install Weathervane in your own environment).  In the next lesson, we will configure and run Weathervane in the lab environment.


 

Create a Weathervane VM

 

Creating a Weathervane host is relatively straightforward. The process of setting up Weathervane starts with creating a Weathervane host, which is a CentOS 7 VM that we configure to run the workload driver, run harness, and application components.  When creating a VM, select Linux as the Guest OS Family, and Red Hat Enterprise Linux 7 (64-bit) as the Guest OS Version. This is necessary in order for proper operation of customization scripts when cloning the VM.

As shown in the screenshot, the virtual hardware must have at least 2 CPUs, 8 GB of memory, and at least 20 GB of disk space (we used 30 GB in this example).  For larger deployments, the hardware can be scaled up appropriately (see the Weathervane documentation for more details).

 

 

Install CentOS 7

 

The CentOS 7 installation may be a Minimal Install (the default, as shown) or a full desktop install.

In fact, you may want to create one Weathervane host with a full desktop install for running the harness, and a second with a Minimal Install for cloning to VMs for running the various Weathervane services.

 

 

Post-OS Installation Tasks

After completing the OS installation, some of tasks should be done prior to installing Weathervane:

  1. Update all software packages by running the command yum update as the root user.
  2. Install VMware Tools (for CentOS 7, open-vm-tools) by running the command yum install -y open-vm-tools as the root user.
  3. Install Java by running the command yum install –y java-1.8.0-openjdk* as the root user.
  4. Install Perl by running the command yum install –y perl as the root user.

NOTE: These commands will not work in the lab environment, but these tasks have already been performed in our VM.

 

 

Download and Extract Weathervane

 

Weathervane is an open source project developed by VMware.  As such, the latest release tarball (.tar.gz) can be downloaded from github, as shown here, from https://github.com/vmware/weathervane/releases

A release tarball is a snapshot of the repository at known good point in time.  Releases are typically more heavily tested than the latest check-in on the master branch.

To install Weathervane, login as root to your CentOS host, and unpack the tarball with the command tar zxf weathervane-1.0.14.tar.gz
NOTE: This has already been done in our hands-on lab environment, so do not run this command in the lab VM.

Once the tarball is extracted, the Weathervane executables must be built.

 

 

Building the Weathervane Executables

In order to build the Weathervane executables, go into the /root/weathervane directory created by unpacking the tarball in the previous step, then issuing the command
./gradlew clean release
NOTE: This has already been done in our hands-on lab environment, so do not run this command in the lab VM.

The first time you build Weathervane, this will download a large number of dependencies. Wait until the build completes before proceeding to the next step.

 

 

Running the Weathervane auto-setup script

The auto-setup script configures the VM to run all of the Weathervane components.
NOTE: the VM must be connected to the Internet in order for this process to succeed.

From the Weathervane directory, Run the script using the command:
./autoSetup.pl
NOTE: This has already been done in our hands-on lab environment, so do not run this command in the lab VM.

The auto-setup script may take an hour or longer to run depending on the speed of your internet connection and the capabilities of the host hardware.
Once it has completed, the VM must be rebooted.  Weathervane is now ready to run!

 

Configuring Weathervane


This lesson will describe how to start the lab and configure the Weathervane benchmark on our lab environment deployment.


 

Launch Performance Lab Module Switcher

 

Double click the Performance Lab MS shortcut on the Main Console desktop

 

 

Start Module 1

 

Click on the first Start button (highlighted) and a PowerShell script will run to start the Weathervane VM, and open two PuTTY sessions to it.

 

Once Module 1 has started, you will see two PuTTY windows side-by-side and a popup box indicating that Module 1 has started (as shown here). Click OK.

We are now ready to configure and run Weathervane!

 

 

Configuring Weathervane

 

We should look at the Weathervane configuration file to see how configurable this benchmark is.

In the PuTTY window on the left, type this command and press Enter:

less weathervane.config

We can now use the standard navigation keys (Up/Down arrows, Page Up/Down) to see the various parameters to customize.

 

We are now looking at the beginning of the Weathervane configuration file.  As standard with most configuration files, lines that start with "#" are commented out and thus ignored by Weathervane.

Highlighted here is one of the most useful parameters (which is why it is at the top!): users.  As the comments state, this determines how many simulated users are active during a Weathervane benchmark run.  This has already been reduced to the minimum value of 60 due to the constraints of our lab environment, but the default is 300 as we will see next.

 

In the right-hand PuTTY window, type the following command and press Enter:
(Note: the character before less is the pipe symbol (typically typed by holding down Shift and pressing the backslash \ key.  You can also select this text and drag-and-drop it directly into the PuTTY window -- try it!)

./weathervane.pl --help |less

 

The --help command we just ran lists all the Weathervane command-line parameters.  If any of these parameters are set on the command line, it will override both the Parameter Default and even the value set in the weathervane.config  file we just looked at.

As shown in this screenshot, the users parameter defaults to a value of 300, but we have set it to the minimum value of 60 in the weathervane.config.  If we wanted to try a Weathervane run of 100 users, we could override it on the command line, i.e.  ./weathervane.pl --users=100.

 

In both PuTTY windows, press the Page Down key to scroll down to the next page, and you should see a screen similar to this.  As the help text explains, Weathervane has three run length parameters: rampUp, steadyState, and rampDown.  To make it easier, you can set all three parameters by changing runLength to short, medium, or long.

In the interest of time (and to not tax our lab environment for any longer than it needs to be!), we have set the values to 30, 60, and 0 in our configuration file.  In an actual benchmark environment, we would want to set runLength to medium or long to gauge performance over a longer period of time.

At this point, feel free to use the arrow keys and the Page Up/Page Down keys to look at all of the parameters Weathervane supports.  As you can see, it is very configurable!

 

Now that we have looked at the Weathervane configuration file and the help text, left-click in each PuTTY window and press q to "quit" less and return to the bash shell.  You should see a screen similar to this.

In the next lesson, we will start an actual Weathervane benchmark run!

 

Running/Tuning Weathervane


This lesson will describe how to run and tune the Weathervane benchmark using the VM deployed in the lab environment.


 

Running Weathervane

 

Now that we have learned how to configure Weathervane, we can start a test run!  This is actually the easiest part, since the run harness automates starting the necessary services, gathering performance statistics, and stopping the benchmark once the run lengths we specified have elapsed.

Click in the left-hand PuTTY window, and start the Weathervane benchmark harness by running one simple command (and press Enter):

./weathervane.pl

Note that since we are already in the /root/weathervane directory, we invoke weathervane.pl from the current directory.

In the right-hand PuTTY window, the processes consuming CPU, memory, etc. in real-time can be monitored while Weathervane is running by running the Linux top command (press Enter afterwards):

top

 

Weathervane should now be running!

The top output can be broken down into three sections:

  1. This shows the CPU utilization of the two virtual CPUs (vCPUs); these values will fluctuate throughout the run.  In this screenshot, they are both heavily utilized (95-96%), which is expected for this benchmark.
  2. This shows the memory utilization of the VM.
    The top line (KiB Mem) shows us that most of the 8 GB we have allocated to the VM is used, with very little free; again, this is expected, as there are many services/processes running and consuming RAM.
    Conversely, the next line (KiB Swap) shows that while we have ~3 GB of swap space, most of it is free, and very little used; this is a Good Thing, as Linux is not having to swap memory to disk (which is likely what would happen if we did not give the VM enough memory, i.e. 4 GB)
  3. The bottom part of the top output shows the running processes, sorted by highest CPU utilization (%CPU) first.  At a quick glance, we can see that java (Tomcat), mongod (MongoDB), and postgres (PostgreSQL) are the heavy hitters.

This benchmark run will take some time to complete (~15 minutes from start to finish).  While we wait, we can browse through the Weathervane documentation to see how we can improve performance.

 

 

Tuning Parameters (User's Guide)

 

Click the Google Chrome icon on the taskbar to open a browser.  The Weathervane documentation is available as a PDF (as well as in Word format).

 

On the bookmark toolbar, click the link that reads Weathervane User's Guide.  This is a PDF that comes with the benchmark that shows how to install, configure, and tune Weathervane.

 

We will not make you read this 99-page document from beginning to end :-)  In any case, we have already touched on a lot of what this guide covers in terms of installation and configuration.

Therefore, scroll down to page 56 (shown here), which has a section on Component Tuning.  Skim through the next few pages to get a feel for the parameters you can experiment with to tune the various tiers inside Weathervane:

Another way to improve performance of a Weathervane environment is to clone the Weathervane host VM, and assign different services to each.  For example, you can have separate (and multiple) VMs that act as application servers, web servers, NoSQL data stores, etc.  For more information, see section 7.5 of the User's Guide, "Cloning the Weathervane VM".

 

 

Check on the Weathervane run

 

Periodically switch back to the PuTTY windows to check on the progress of the run.  When the Weathervane benchmark run has finished, you will see screens similar to this one.  Specifically:

  1. On the left, you will see messages about Cleaning and compacting storage, and whether the run Passed or Failed.
    NOTE: It is OK if it says failed and/or a message such as Failed Response-Time metric.  In our shared lab environment, the response times will likely not meet the benchmark requirements.  This would not be an issue in a dedicated test/dev environment.
  2. This Take specific note of the run number at the end (in this example, it is Run 8).  We will use that number in the next step when we look at the output files.
  3. On the right, note the top screen will indicate the Linux VM is now essentially idle (%Cpu less than 1%, and most of the memory is free).
  4. Once you have confirmed the run is over, close the PuTTY window on the right by clicking the "x" in the upper-right (click OK when PuTTY asks you to confirm).
  5. Maximize the remaining PuTTY window on the left by clicking the maximize button in the upper-right, as shown.

 

 

 

Analyzing Weathervane benchmark output

After running the benchmark, you can look at the various log files the Weathervane run harness collects:

  1. cd output (all Weathervane output is stored in /root/weathervane/output)
  2. ls (to show all the runs on this VM; determine the most recent one)
  3. cd 8/ (replace with the most recent run number)
  4. cat version.txt (this records the version of Weathervane used to run this result)
  5. cat run.log (this shows any errors and details of response-times from each of the application instances)
  6. cat console.log (not shown; this is just a record what you already saw output to the PuTTY console, i.e. the start/stopping of services, whether the run passed or failed, and cleanup)

Once you are done looking at these files, you can close this PuTTY console.

If a run passes, this means that the application deployment and the underlying infrastructure can support the load driven by the given number of users with acceptable response-times for the users' operations.

A typical way of using Weathervane is to compare the maximum number of users that can can be supported when some component of the infrastructure is varied. For example, if the same application configuration is run on two different servers, the maximum user load supported by the servers can be compared to determine which has better performance for this type of web application.

Congratulations! You now know how to run Weathervane!

 

Clean-up and Conclusion


Congratulations! You now know how to install, configure, and run the Weathervane benchmark!


 

How to End Module 1

 

To end this module, click the Stop button for Module 1.

 

 

Resources/Helpful Links

 

Congratulations on completing Module 1.

For more information about Weathervane, here are some helpful links:

Proceed to any module below which interests you most.

 

 

 

How to End Lab

 

To end your lab click on the END button.  

 

Module 2 - Database Performance testing with DVD Store (30 minutes)

Introduction


 

This Module introduces DVD Store 3, also known as DS3 for short.  It simulates an online store that allows customers to logon, search for DVDs, read customer reviews, rate helpfulness of reviews, and purchase DVDs.

Here is a brief overview for the content in this Module:


What is DVD Store 3?


This lesson will describe what DVD Store is, including all of its various features.


 

DVD Store 3 Description

 

Here is an overview of the DVD Store 3 (DS3) benchmark:

 

 

DVD Store 3 Database Sizes

DVD Store 3 supports three standard sizes of small, medium and large. In addition to these standard sizes, any custom size can be specified during the DVD Store setup. The number of rows in the various tables that make up the DVD Store 3 database are what is varied to determine the size specified.

The table below shows the number of rows for the standard sizes for the Customers, Orders, and Products tables as examples:

Database
Size
Customers
Orders
Products
Small
10 MB
20,000
1,000/month
10,000
Medium
1 GB
2,000,000
100,000/month
100,000
Large
100 GB
200,000,000
10,000,000/month
1,000,000

 

Downloading/Installing DVD Store 3


This lesson will describe how to install the DVD Store 3 benchmark.  Specifically, we will look at how we set it up for this lab environment using the LAMP stack (Linux, Apache, MySQL, and PHP) stack.

NOTE: The LAMP stack is only one of the supported environments for DVD Store 3.  The benchmark supports a variety of databases: Microsoft SQL Server, Oracle, MySQL, and PostgreSQL.

NOTE #2: This VM and database have already been created; this is informational, if you'd like to set it up for testing in your own environment.
Creating the database is resource intensive, in terms of both time and storage, so it is not available for the hands-on lab environment.


 

Create a Linux VM

 

This screenshot shows that, in our lab environment, DVD Store 3 is installed in a CentOS Linux VM with 1 vCPU, 1 GB of memory, and a 10 GB hard disk.

You may notice that these are lower minimum system requirements than the Weathervane module.  There are a couple of reasons for this:

  1. We are only exercising a couple of applications in this VM (namely MySQL for the database tier, and Apache HTTP Server for the web server tier).
  2. This VM has been built with a small database size.  From the previous lesson, we learned that DVD Store 3 comes in 3 sizes: small (10 MB), medium (1 GB), and large (100 GB).
    For building a medium or large database, you should scale up the CPU, memory, and disk size appropriately.

 

 

OS Installation/Post-Install Tasks

DVD Store should work on any modern Linux distribution.  This VM was installed with CentOS 6.8.

After the OS installation, some of tasks should be run as the root user, prior to installing DS3:
NOTE: these have already been done in our lab environment; do not run these commands now!

  1. Update all software packages by running the command yum update
  2. Install VMware Tools (or open-vm-tools)
  3. Stop the firewall by running service iptables stop and disable it on boot by running chkconfig iptables off
    (NOTE: this is for ease of use in a test/dev environment; never do this in production!)
  4. Install MySQL by running yum install mysql-server and start it by running service mysqld start
  5. Install Apache HTTPD Server by running yum install httpd httpd-devel
  6. Install PHP with MySQL support by running yum install php php-mysql
  7. Create a user named web and set its password to web:
    useradd web passwd web chmod 755 /home/web
  8. Set permissions for this new user within MySQL:
    mysql >create user 'web'@'localhost' identified by 'web'; >grant ALL PRIVILEGES on *.* to 'web'@'localhost' IDENTIFIED BY 'web'; >exit;

 

 

Download and Extract DS3

 

DVD Store 3 is an open source project that is actively developed and maintained.  The latest version can be downloaded from github, as shown here, from https://github.com/dvdstore/ds3/

To extract DS3, login as root to your CentOS host, and unzip it with the command unzip ds3-master.zip

NOTE: This has already been done in our hands-on lab environment, so do not run this command in the lab VM.

Finally, we need to copy the PHP Web pages to the correct place on the host (again, this has already been done in our lab, no need to run):

mkdir /var/www/html/ds3
cp /root/home/ds3/mysqlds3/web/php5/* /var/www/html/ds3
service httpd restart

 

Building a DVD Store 3 Database


This lesson will show how to build a DVD Store 3 database.

We will run the configuration script to generate the necessary SQL commands, but due to time and resource constraints, we will not run the actual build.  Our lab environment already has a pre-built database ready to run.


 

Launch Performance Lab Module Switcher

 

Double click the Performance Lab MS shortcut on the Main Console desktop

 

 

Start Module 2

 

Click on the Module 2 Start button (highlighted) and a PowerShell script will run to start the DVD Store 3 VM, and open a PuTTY session to it..

 

Once Module 2 has started, you will see a PuTTY window and a popup box indicating that Module 2 has started (as shown here). Click OK.

We are now ready to learn how to build a DS3 database!

 

 

Run the Install_DVDStore.pl script

 

Remember earlier when we learned DS3 has three "canned" database sizes (small, medium, and large)?  Well, we can also specify a custom database size to build.  Here's how:
(Press Enter after each command/value)

  1. Change to the DS3 directory.  In this VM, it's been installed to /root/ds3 and you're already in the /root folder so type:
    cd ds3
  2. Run the Install_DVDStore Perl script:
    perl Install_DVDStore.pl
  3. We are now asked how big we want our DS3 database to be.  Let's build a 100 MB MySQL database:
    100
  4. When asked if the database size is in MB or GB, specify GB:
    MB
  5. Since DS3 supports multiple databases, we need to specify MYSQL:
    MYSQL
  6. Finally, DS3 needs to know if the database server will be on a Windows or Linux machine; this determines whether the input files will have CR/LF (DOS format).  Choose LINUX:
    LINUX

The Install_DVDStore.pl script will now do the following:

Please wait for the Perl script to finish.

 

This is how the script looks upon completion.  Look for the message highlighted: Completed creating and writing build scripts for MySQL database...

Now that all the MySQL scripts have been generated, the database would normally be built at this point.  The reason that the scripts are generated instead of just doing the database creation directly is that it allows for the database to be easily recreated later, or even modified if needed, to address specific testing requirements of individual environments.

The database build is accomplished by the following commands.  NOTE: Do not run these commands in the lab environment, for a couple of reasons: the database build takes a long time, and we have already saved you the trouble (a database has been built and is ready to run).

cd mysqlds3
sh mysqlds3_create_all.sh

Now that we've seen how to build a DS3 database, we will start an actual run!

 

Configuring/Running DVD Store 3


This lesson will describe how to configure the DVD Store 3 load driver and run it against the MySQL database VM deployed in the lab environment.


 

Start top on the DVD Store VM

 

To view the performance of the DVD Store VM, type the command top and press Enter.  This will show us how much CPU and memory are consumed, along with which processes are taxing the VM the most.

Next, we will kick off the DS3 driver from our Windows machine.

 

 

Start the DVD Store driver on the Main Console

 

On the Main Console (Windows desktop), double-click the DVD Store 3 Driver icon shown here (note: you may need to minimize some windows in order to see it).

 

 

Monitor the driver and the PuTTY windows during the run

 

While the run is progressing, you should watch both the PuTTY console running top (shown here on the top) and the DS3 driver window (shown here on the bottom).

Let's make some observations about this screenshot (note: due to the variability of the cloud, your performance may vary):

  1. The CPU utilization line in top shows us that 34.2% is consumed in user space (application), 9.3% in system (kernel), for a total of 43.5% CPU.  There is zero idle time, however; the rest of the CPU (55.5%) is waiting for I/O -- meaning we likely have a disk or network bottleneck in our environment.
  2. The process that is consuming the 43.5% CPU utilization we saw is mysqld (the MySQL database) -- which makes sense, since we're hammering it with a database benchmark!
    Let's look at the output of the driver:
  3. These are normal DS3 driver startup messages, indicating the various threads that are connecting to the database server before the actual run begins
  4. Approximately every 10 seconds, you will see a performance summary output to the screen (notice et, elapsed time, goes up by 10 each line).
  5. There are many statistics on each line (many of them dealing with rt which is short for response time), but we're most interested in the primary DVD Store throughput performance metric, known as opm or orders per minute.  Here we can see we're only achieving about 40 opm on average, which is very low.  You would achieve much higher opm numbers in an optimized testbed.

Congratulations!  You're now running DVD Store!

Here's the command we used on the Windows machine to start the driver, in case you're curious:

c:\hol\ds3mysqldriver --target=dvdstore-01a.corp.local --n_threads=5 --warmup_time=0 --detailed_view=Y

Let's see what each of these driver parameters means.

 

 

Show the driver parameters

 

Click the Command Prompt icon to open a Command Prompt window.

 

Type this command as shown and press Enter:
ds3mysqldriver

You will see a list showing each Parameter Name, Description, and Default Value.  You can also create a configuration file and pass that on the command line instead of manually setting each parameter.

 

Analyzing Results/Improving DVD Store 3 Performance


This lesson will describe how to anaylze DVD Store 3 results (specifically, comparing and contrasting a low-performing run versus a higher-performing run) and then looking at ways to improve performance.


 

DVD Store 3 Performance Metrics: opm (throughput) and rt (response time)

Performance metric (abbreviation) Definition Value
opm Orders Per Minute (throughput) Higher = better
rt Response Time (latency) Lower = better

We will look at a couple of results that we'll call "bad" and "good".

 

 

Example output: "Bad" performance (low opm, high rt)

 

Here is example output of a poorly-performing configuration.  We will look at several key areas:

  1. Real-time output: every 10 seconds (et is short for elapsed time, in seconds), the DS3 driver will output a line showing how long it's been running, how many orders per minute (opm) were achieved, and response times (rt).
  2. After the driver finishes, it will print a line that starts with Final that shows the overall performance statistics.
    et=  60.0 tells us that this was a short run (only 1 minute).
  3. opm=41 tells us that the database server was only able to process 41 operations per minute.  This is low, but expected, as it was run in our nested hands-on lab environment, which shares resources with many other labs.
  4. rt_tot_avg=13218 tells us that the average response time was 13218 millseconds (13.218 seconds).  This is high, but again, expected.

Let's compare this to a high-performing run that was done in an isolated dedicated lab environment.

 

 

Example output: "Good" performance (high opm, low rt)

 

Here is example output of a high-performing configuration:

  1. This summary line which starts with Final that shows the overall performance statistics.
    et=  609.4 tells us that this was a 10-minute run (~600 seconds).
  2. opm=74932 indicates this database server was able to process 74,932 operations per minute.  This is much higher than the previous example, as it is a highly-tuned performance configuration.
  3. rt_tot_avg=87 tells us that the average response time was only 87 millseconds.  Again, this low value is in stark contrast to the previous example.

So what factors determine whether a database server will be able to sustain high load, and thus achieve the maximum opm?

 

 

Database Performance Factors

Obviously, we want to achieve the maximum opm (database performance) possible in our environment.
There are many factors that affect performance, and there isn't enough time in this lab to cover any one in detail, but here is a short list:

By following these guides, and testing the performance of your particular environment prior to production deployment, you can ensure your virtualized databases will achieve maximum throughput.

 

Clean-up and Conclusion


Congratulations! You now know how to install, configure, and run the DVD Store 3 benchmark!

You've also learned how to tune your database server to achieve the maximum orders per minute (opm), so your database throughput will be as high as possible with the lowest response times.


 

How to End Module 2

 

To end this module, click the Stop button for Module 2.

 

 

Resources/Helpful Links

 

Congratulations on completing Module 2.

For more information about DVD Store 3, and database performance in general, here are some helpful links:

Best Practices:

DVD Store blogs/whitepapers:

DVD Store 3 is also one of the key workloads in VMmark 3.0:

Proceed to any module below which interests you most.

 

 

 

How to End Lab

 

To end your lab click on the END button.  

 

Module 3 - Right-Sizing vSphere 6.5 VMs for Optimal Performance (45 minutes)

Introduction


 

Meet Melvin the Monster VM!  vSphere 6.5 can handle Melvin and any other large, business-critical workloads (known affectionately as "wide" or "monster" VMs) without breaking a sweat! :-)

In all seriousness, this module will discuss rules of thumb for right-sizing VMs -- particularly those that are so large that they span multiple physical processor or memory node boundaries.  We will throw around terms like vCPUs, pCPUs, Cores Per Socket, NUMA (pNUMA and vNUMA), and learn how to right-size these VMs to perform optimally.


NUMA and vNUMA



 

UMA

 

This is a bit of a history lesson, as UMA, or Uniform Memory Access, is no longer how modern servers are designed.  The reason why?
The Memory Controller (highlighted) quickly became a bottleneck; it is easy to see why, as every CPU requesting memory or I/O had to pass through this layer.  (Credit: frankdenneman.nl)

 

 

NUMA

NUMA moves away from a centralized pool of memory and introduces the concept of a topology. By classifying memory location bases on signal path length from the processor to the memory, latency and bandwidth bottlenecks can be avoided. This is done by redesigning the whole system of processor and chipset. NUMA architectures gained popularity at the end of the 90's when it was used on SGI supercomputers such as the Cray Origin 2000. NUMA helped to identify the location of the memory, in this case of these systems, they had to wonder which memory region in which chassis was holding the memory bits.

In the first half of the millennium decade, AMD brought NUMA to the enterprise landscape where UMA systems reigned supreme. In 2003 the AMD Opteron family was introduced, featuring integrated memory controllers with each CPU owning designated memory banks. Each CPU has now its own memory address space. A NUMA-optimized operating system such as ESXi allows workload to consume memory from both memory addresses spaces while optimizing for local memory access. Let's use an example of a two CPU system to clarify the distinction between local and remote memory access within a single system:

(Credit: frankdenneman.nl

 

The memory connected to the memory controller of the CPU1 is considered to be local memory. Memory connected to another CPU socket (CPU2) is considered to be foreign or remote for CPU1. Remote memory access has additional latency overhead as opposed to local memory access, since it has to traverse an interconnect (point-to-point link) and connect to the remote memory controller. As a result of the different memory locations, this system experiences “non-uniform” memory access time.

 

 

Without vNUMA

 

In this example, a VM with 12 vCPUs is running on a host with four NUMA nodes with 6 cores each. This VM is not being presented with the physical NUMA configuration and hence the guest OS and application only sees a single NUMA node. This means that the guest has no chance of placing processes and memory within a physical NUMA node.

We have poor memory locality.

 

 

With vNUMA

Since vSphere 5, ESXi has had the vNUMA (virtual NUMA) feature that can present multiple NUMA nodes to the guest operating system. Traditionally, virtual machines have only been presented with a single NUMA node, regardless of the size of the VM, and regardless of the underlying hardware. Larger and larger workloads are being virtualized, so it has become increasingly important that the guest OS and applications can make decisions on where to execute applications and where to place memory.

VMware ESXi is NUMA aware, and will always try to fit a VM within a single physical NUMA node when possible. However, with very large "monster VMs", this isn't always possible.

The purpose of this module is to gain understanding of how vNUMA works by itself and in combination with the cores per socket feature.

 

In this example, a VM with 12 vCPUs is running on a host that has four NUMA nodes with 6 cores each. This VM is being presented with the physical NUMA configuration, and hence the guest OS and application sees two NUMA nodes. This means that the guest can place processes and accompanying memory within a physical NUMA node when possible.

We have good memory locality.

 

vCPU and vNUMA Rightsizing


Using virtualization, we have all enjoyed the flexibility to quickly create virtual machines with various virtual CPU (vCPU) configurations for a diverse set of workloads.

However, as we virtualize larger and more demanding workloads, like databases, on top of the latest generations of processors with up to 28 cores, special care must be taken in vCPU and vNUMA configuration to ensure performance is optimized.


 

vCPUs, Cores per Socket, vSockets, CPU Hot Plug/Hot Add

 

The most important values are shown in this screenshot, taken directly from the vSphere Web Client:
NOTE: You must expand the CPU dropdown to view/change some of these fields!

  1. CPU: This is the total number of vCPUs presented to the guest OS (20 in this example)
  2. Cores per Socket: If this value is 1 (the default), all CPUs are presented to the guest as single-core processors.
    For most VMs, the default value is OK, but there are definitely instances when you should consider increasing this value, which we'll discuss in a bit.
    In this example, we've increased it to 10, which means the guest will see multi-core (10-core) processors.
  3. Sockets: This is not a configurable value; it is simply the number of CPUs divided by Cores per Socket: in this example, 20 / 10 = 2.
    Also called "virtual sockets" or "vSockets".
  4. CPU Hot Plug: Also known as CPU Hot Add, this is a checkbox to allow adding more CPUs "on the fly" (while the guest is powered on).
    If you have right-sized your VM from the beginning, you should not enable this feature, because it has the major downside of disabling vNUMA.
    For more information, see vNUMA is disabled if VCPU hotplug is enabled (KB 2040375)

Let's refer to this 20 vCPU VM, as configured, as 2 Sockets x 10 Cores per Socket.

 

 

Cores per Socket: Licensing Considerations

 

Let's talk about the Cores per Socket value.  As mentioned earlier, this defaults to 1, which means that every virtual CPU is present as a Socket to the guest VM.  In most cases, there's no issue there.

However, this may not be ideal from a Microsoft licensing perspective, where the operating system and/or application is sometimes per-processor.  Here are a couple of examples:

 

 

vNUMA Behavior Changes in vSphere 6.5

In an effort to automate and simplify configurations for optimal performance, vSphere 6.5 introduced a few changes in vNUMA behavior.  Thanks to Frank Denneman for thoroughly documenting them here:

http://frankdenneman.nl/2016/12/12/decoupling-cores-per-socket-virtual-numa-topology-vsphere-6-5/

Essentially, the vNUMA presentation under vSphere 6.5 is no longer affected by Cores per Socket. vSphere will now always present the optimal vNUMA topology (unless you use advanced settings).

However, you should still choose the CPU and Cores per Socket values wisely.  Read on for some best practices.

 

 

Best Practices for Cores per Socket and vNUMA

In general, the following best practices should be followed regarding vNUMA and Cores per Socket:

There are many Advanced Virtual NUMA Attributes (click for a full list); here are a few guidelines, but in general, the defaults are best:

 

Of course, a picture (or in this case, a table) is worth a thousand words.  This table outlines how a VM could (should) be configured on a dual-socket, 10-core physical host to ensure an optimal vNUMA topology and performance, regardless of vSphere version.

 

Guest OS Tools to View vCPUs/vNUMA


We have seen how to use the vSphere Client to right-size a virtual machine's vCPUs and Cores per Socket.

What do these toplogies look like from the guest OS perspective?  We will look at some examples of tools for Windows and Linux that let us verify that the guest is showing the expected processor and NUMA configurations.


 

vSphere Client CPU/Cores per Socket Example

 

Although shown before, it is worth repeating:

  1. CPU: This is the total number of vCPUs presented to the guest OS (20 in this example)
  2. Cores per Socket: If this value is 1 (the default), all CPUs are presented to the guest as single-core processors.
    For most VMs, the default value is OK, but there are definitely instances when you should consider increasing this value, which we'll discuss in a bit.
    In this example, we've increased it to 10, which means the guest will see multi-core (10-core) processors.
  3. Sockets: This is not a configurable value; it is simply the number of CPUs divided by Cores per Socket: in this example, 20 / 10 = 2.
    Also called "virtual sockets" or "vSockets".
  4. CPU Hot Plug: Also known as CPU Hot Add, this is a checkbox to allow adding more CPUs "on the fly" (while the guest is powered on).
    If you have right-sized your VM from the beginning, you should not enable this feature, because it has the major downside of disabling vNUMA.

Let's refer to this 20 vCPU VM, as configured, as 2 Sockets x 10 Cores per Socket.

 

 

Windows: Coreinfo

From the Microsoft Sysinternals web site:  Coreinfo is a command-line utility that shows you the mapping between logical processors and the physical processor, NUMA node, and socket on which they reside, as well as the cache’s assigned to each logical processor. It uses the Windows’ GetLogicalProcessorInformation function to obtain this information and prints it to the screen, representing a mapping to a logical processor with an asterisk e.g. ‘*’.

Coreinfo is useful for gaining insight into the processor and cache topology of your system.

Parameter Description
-c
Dump information on cores.
-f
Dump core feature information.
-g
Dump information on groups.
-l
Dump information on caches.
-n
Dump information on NUMA nodes.
-s
Dump information on sockets.
-m
Dump NUMA access cost.
-v
Dump only virtualization-related features

 

Here we see the output of coreinfo (with no command line options) on the aforementioned 20 vCPU VM.  Here is a breakdown of the highlights:

  1. Logical to Physical Processor Map: This section confirms Windows sees 20 vCPUs (note that it presents them as Logical and Physical Processors, with a 1:1 mapping)
  2. Logical Processor to Socket Map: This section confirms Windows sees 2 Sockets, with 8 Logical Processors on each Socket.  We can also refer to these as vSockets.
  3. Logical Processor to NUMA Node Map: This section confirms that Windows sees 2 NUMA Nodes, with 8 Logical Processors on each Node.  Since this is a VM, we call these vNUMA nodes.

 

 

Linux: numactl

For Linux, the most useful parameter to gain information about virtual NUMA is numactl.  Note that you may need to install the package that provides the numactl tool for your OS (for RHEL/CentOS 7, an appropriate command is yum install numactl).

Parameter Description
-c
Dump information on cores.
-f
Dump core feature information.
-g
Dump information on groups.
-l
Dump information on caches.
-n
Dump information on NUMA nodes.
-s
Dump information on sockets.
-m
Dump NUMA access cost.
-v
Dump only virtualization-related features

 

Here we see the output of numactl -H (the -H is an abbreviation for hardware; use the man numactl command to see all of the available parameters).  Here is a quick explanation:

  1. numactl -H: This is the command we typed to get the output
  2. available: 2 nodes (0-1): This section confirms Linux sees 2 NUMA nodes, also known as vNUMA nodes.
  3. node 0 cpus, node 1 cpus: This section confirms Linux sees 10 logical processors on each NUMA node (20 vCPUs total).

 

Conclusion


Congratulations! You now know how to right size VMs optimally for vSphere 6.5!


 

Resources/Helpful Links

 

Congratulations on completing Module 3.

For more information about right-sizing VMs, NUMA/vNUMA, and vSphere performance in general, here are some helpful links:

Proceed to any module below which interests you most.

 

 

 

How to End Lab

 

To end your lab click on the END button.  

 

Module 4 - vSphere HTML5 Client vs. the vSphere Web Client (15 minutes)

Introduction to the HTML5 Web Client


The goal of this module is to expose you to the HTML5 Web Client.

The Flash platform is not a long term solution and VMware is making strides to improve the performance and stability with HTML5 based Web Client.  This was originally a fling (https://labs.vmware.com/flings) back in March 2016 and now its native with vSphere 6.5.  

Let's get started.  


 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

HTML5 versus Flash - Cluster Actions



 

Launch Performance Lab Module Switcher

 

Double click the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 4

 

Click on the Start button and a script will launch and then disappear

 

 

 

 

Open Google Chrome

 

Log into vSphere.  The vSphere HTML Web Client should be the default home page.

 

If, for some reason, that does not work, uncheck the box and use these credentials:

User name: CORP\Administrator

Password: VMware1!

You should see the vSphere Client in HTML5.  

 

 

 

 

Open New Tab

 

Click on the new tab to open another browser window

 

 

Open Flash Web Client

 

1.  Click on vSphere Flash Client

You will automatically log into the Flash version of vSphere Client.

 

 

 

 

Add New Resource Pool in Flash vSphere Client

 

1.  Right Click on RegionA01-COMP01

2.  Click on New Resource Pool...

You will see the following loading screens in order to display the resource pool settings.  

 

Initializing the window

 

Now, another screen will load with Loading... displayed

 

The New Resource Pool finally populates.  Cancel the action.

 

 

Add New Resource Pool in HTML5 vSphere Client

 

Click on the HTML version of the vSphere Web Client

 

1.  Right Click on RegionA01-COMP01.

2.  Click on New Resource Pool...

 

As you can see, the population of the New Resource Pool was almost instantaneous, and there were not any loading or initializing screens beforehand!

 

HTML5 versus Flash - VM Actions



 

Open Google Chrome

 

Log into vSphere.  The vSphere HTML Web Client should be the default home page.

 

If, for some reason, that does not work, uncheck the box and use these credentials:

User name: CORP\Administrator

Password: VMware1!

You should see the vSphere Client in HTML5.  

 

 

 

 

Open New Tab

 

Click on the new tab to open another browser window

 

 

Open Flash Web Client

 

1.  Click on vSphere Flash Client

You will automatically log into the Flash version of vSphere Client.

 

 

 

 

Pull Advance Metrics in the Performance tab (Flash)

 

1.  Click on perf-worker-01a

2.  Click on Monitor

3.  Hit the Performance tab.  

4.  Hit Advanced

You will see the following loading screens in order to display the resource pool settings.  

 

 

The advanced performance metric finally populates.  

 

 

Pull Advance Metrics in the Performance tab (HTML5)

 

Click on the HTML5 version of the vSphere Web Client

 

1.  Click on perf-worker-01a

2.  Click on Monitor

3.  Click on Advanced

 

As you can see, the advanced metric screen populated instantaneous and there was not any loading or initializing screens beforehand.  

 

Conclusion



 

You've finished Module 4. Congratulations!

This module has shown you the performance gains, stability and overall improved user experience when comparing the Flash vs. HTML5 vSphere Client.  VMware continues to build more features in the HTML5 vSphere Client to reach full feature parity and if you wish to test out newer versions of the HTML5 vSphere Web Client be sure to check out the following link from our FLINGS site:

https://labs.vmware.com/flings/vsphere-html5-web-client

Finally, here is our FAQ Knowledge Base article (or scan the QR code) for more information:
vSphere Client (HTML5) and vSphere Web Client 6.5 FAQ (2147929):

http://kb.vmware.com/kb/2147929

 

 

 

Stop Module 4

 

 

Proceed to any module below which interests you most.

 

 

Module 5 - CPU Performance, Basic Concepts and Troubleshooting (15 minutes)

Introduction to CPU Performance Troubleshooting


The goal of this module is to expose you to a CPU contention issue in a virtualized environment. It will also guide you on how to quickly identify performance problems by checking various performance metrics and settings.

Performance problems may occur when there are insufficient CPU resources to satisfy demand. Excessive demand for CPU resources on a vSphere host may occur for many reasons. In some cases, the cause is straightforward. Populating a vSphere host with too many virtual machines running compute-intensive applications can make it impossible to supply sufficient CPU resources to all the individual virtual machines. However, sometimes the cause may be more subtle, related to the inefficient use of available resources or non-optimal virtual machine configurations.

Let's get started!


 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

For users with non-US Keyboards

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

The CLI commands, user names and passwords that needs to be entered, can be copied and pasted from the file README.txt on the desktop.

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

 

Start this Module

 

Let's start this module.

Launch Chrome from the shortcut in the Taskbar.

 

 

Open Flash vSphere Web Client

 

1.  Open Google Chrome and then click on the vSphere Flash Client bookmark

 

Click on Use Windows session authentication and login

 

 

Select Hosts and Clusters

 

 

CPU Contention


Below are a list of most common CPU performance issues:

High Ready Time: A CPU is in the Ready state when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on. Ready Time above 10% could indicate CPU contention and might impact the Performance of CPU intensive application. However, some less CPU sensitive application and virtual machines can have much higher values of ready time and still perform satisfactorily.

High Costop time: Costop time indicates that there are more vCPUs than necessary, and that the excess vCPUs make overhead that drags down the performance of the VM. The VM will likely run better with fewer vCPUs. The vCPU(s) with high costop is being kept from running while the other, more-idle vCPUs are catching up to the busy one.

CPU Limits: CPU Limits directly prevent a virtual machine from using more than a set amount of CPU resources. Any CPU limit might cause a CPU performance problem if the virtual machine needs resources beyond the limit.

Host CPU Saturation: When the Physical CPUs of a vSphere host are being consistently utilized at 85% or more then the vSphere host may be saturated. When a vSphere host is saturated, it is more difficult for the scheduler to find free physical CPU resources in order to run virtual machines.

Guest CPU Saturation: Guest CPU (vCPU) Saturation is when the application inside the virtual machine is using 90% or more of the CPU resources assigned to the virtual machine. This may be an indicator that the application is being bottlenecked on vCPU resource. In these situations, adding additional vCPU resources to the virtual machine might improve performance.

Oversizing VM vCPUs: Using large SMP (Symmetric Multi-Processing) virtual machines can cause unnecessary overhead. Virtual machines should be correctly sized for the application that is intended to run in the virtual machine. Some applications may only support multithreading up to a certain number of threads. Assignment of additional vCPU to the virtual machine may cause additional overhead. If vCPU usage shows that a machine, which is configured with multiple vCPUs and is only using one of them. Then it might be an indicator that the application inside the virtual machine is unable to take advantage of the additional vCPU capacity, or that the guest OS is incorrectly configured.

Low Guest Usage: Low in-guest CPU utilization might be an indicator, that the application is not configured correctly, or that the application is starved of some other resource such as I/O or Memory and therefore cannot fully utilize the assigned vCPU resources.


 

Launch Performance Lab Module Switcher

 

Double click the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 5

 

Click on the Start button and a script will launch.  

 

The script will take a few minutes to run.  

 

 

Wait until you see "Press Enter to continue" to proceed.  Hit enter.  

 

 

CPU Test Started

 

When the script completes, you will see two Remote Desktop windows open (note: you may have to move one of the windows to display them side by side, as shown above).

The script has started a CPU intensive benchmark (SPECjbb2005) on both perf-worker-01a and perf-worker-01b virtual machines, and a GUI is displaying the real-time performance value as this workload runs.

If you do not see the SPECjbb2005 window open launch the shortcut in the upper left hand corner.  

Above, we see an example screenshot where the performance of the benchmarks are around 15,000.

IMPORTANT NOTE: Due to changing loads in the lab environment, your values may vary from the values shown in the screenshots.

 

 

Navigate to perf-worker-01a (VM-level) Performance Chart

 

  1. Select the perf-worker-01a virtual machine from the list of VMs on the left
  2. Click the Monitor tab
  3. Click Performance
  4. Click Advanced
  5. Click on Chart Options

 

 

Select CPU ready for Performance Monitoring

 

When investigating a potential CPU issue, there are several counters that are important to analyze:

  1. Select CPU from the Chart metrics
  2. Check only the perf-worker-01a object
  3. Click None on the bottom right of the list of counters
  4. Now check only Demand, Ready, and Usage in MHz
  5. Click Ok

 

 

CPU State Time Explanation

 

Virtual machines can be in any one of four high-level CPU States:

 

 

Monitor Demand vs. Usage lines

 

Notice the amount of CPU this virtual machine is demanding and compare that to the amount of CPU usage the virtual machine is actually allocated (Usage in MHz). The virtual machine is demanding more than it is currently being allowed to use.

Notice that the virtual machine is also seeing a large amount of ready time.

Guidance: Ready time > 10% could be a performance concern.

 

 

Explanation of value conversion

 

NOTE:  vCenter reports some metrics such as "Ready Time" in milliseconds (ms). Use the formula above to convert the milliseconds (ms) value to a percentage.

For multi-vCPU virtual machines, multiply the Sample Period by the number of vCPUs of the VM to determine the total time of the sample period. It is also beneficial to monitor Co-Stop time on multi-vCPU virtual machines.  Like Ready time, Co-Stop time greater than 10% could indicate a performance problem.  You can examine Ready time and Co-Stop metrics per vCPU as well as per VM.  Per vCPU is the most accurate way to examine statistics like these.

 

 

Navigate to Host-level CPU chart view

 

  1. Select esx-01a.corp.local
  2. Select the Monitor tab
  3. Select Performance
  4. Select the Advanced view
  5. Select the CPU view

 

 

Examine ESX Host Level CPU Metrics

 

Notice in the Chart, that only 1 of the CPUs in the host seems to have any significant workload running on it.

One CPU is at 100%, but the other CPU in the host is not really being used.

 

 

Edit Settings of perf-worker-01a

 

Let's see how perf-worker-01a is configured:

  1. Click on the perf-worker-01a virtual machine
  2. Click Actions
  3. Click Edit Settings…

 

 

Check Affinity Settings on perf-worker-01a

 

  1. Expand the CPU item in the list and you will see that affinity is set to cpu1.
  2. Clear the "1" to correctly balance the virtual machines across the physical CPUs in the system.  
  3. Press OK to make the changes.

Note:  VMware does not recommend setting affinity in most cases. vSphere will balance VMs across CPUs optimally without manually specifying affinity. Enabling affinity prevents some features like vMotion, can become a management headache and lead to performance issues like the one we just diagnosed.

 

 

Check Affinity Settings on perf-worker-01b

 

  1. Expand the CPU item in the list and you will see that affinity is set. Unfortunately, both virtual machines are bound to the same processor (CPU1). This can happen if an administrator sets affinity for a virtual machine and then creates a second virtual machine by cloning the original.
  2. Clear the "1" to correctly balance the virtual machines across the physical CPUs in the system.  
  3. Press OK to make the changes.

Note:  VMware does not recommend setting affinity in most cases. vSphere will balance VMs across CPUs optimally without manually specifying affinity. Enabling affinity prevents some features like vMotion, can become a management headache and lead to performance issues like the one we just diagnosed.

 

 

Monitor Ready time

 

Return to perf-worker-01a and see how the Ready time immediately drops, and the Usage in MHz increases.

 

 

See Better Performance

 

It may take a moment, but the CPU Benchmark scores should increase.  Click back to the Remote Desktop windows to confirm this.

In this example, we have seen how to use the Demand compared to the Used CPU metrics to identify CPU contention.  We showed you the Ready time metric and how it can be used to detect physical CPU contention. We also showed you the danger of setting affinity.  

 

 

Edit Settings of perf-worker-01b

 

Let's add a virtual CPU to perf-worker-01b to improve performance.

  1. Click on the perf-worker-01b virtual machine
  2. Click Actions
  3. Click Edit Settings…

 

 

Add a CPU to perf-worker-01b

 

  1. Change the number of CPUs to 2
  2. Click OK

 

 

Monitor CPU performance of perf-worker-01b

 

  1. Select perf-worker-01b
  2. Select Monitor
  3. Select Performance
  4. Select the CPU view

Notice that the virtual machine is now using both vCPUs. This is because the OS in the virtual machine supports CPU hot-add, and because that feature has been enabled on the virtual machine.

 

 

Investigate performance

 

Notice that the performance of perf-worker-01b has increased, since we added the additional virtual CPU.

However, this is not always the case. If the host these VMs are running on (esx-01a) only had two physical CPUs, the addition of an additional vCPU would have caused an overcommitment, leading to high %READY and poor performance.

Remember, most workloads are not necessarily CPU bound.  The OS and the application need to be able to be multi-threaded to get performance improvements from additional CPUs.  Most of the work that an OS is doing is typically not CPU-bound, that is, most of their time is spent waiting for external events such as user interaction, device input, or data retrieval, rather than executing instructions. Because otherwise-unused CPU cycles are available to absorb the virtualization overhead, these workloads will typically have throughput similar to native, but potentially with a slight increase in latency.

Configuring a virtual machine with more virtual CPUs (vCPUs) than its workload can use might cause slightly increased resource usage, potentially impacting performance on very heavily loaded systems. Common examples of this include a single-threaded workload running in a multiple-vCPU virtual machine, or a multi-threaded workload in a virtual machine with more vCPUs than the workload can effectively use.

Even if the guest operating system doesn’t use all of the vCPUs allocated to it, over-configuring virtual machines with too many vCPUs still imposes non-zero resource requirements on ESXi that translate to real CPU consumption on the host. For example:

These resource requirements translate to real CPU consumption on the host.

 

 

Close Remote Desktop Connections

 

Close the two remote desktop connections.

 

Conclusion and Clean-Up


In order to free up resources for the remaining parts of this lab, we need to shut down the used virtual machine and reset their configuration.


 

Stop Module 5

 

On your desktop, find the Module window and hit stop.  

 

 

Key take aways

CPU contention problems are generally easy to detect. In fact, vCenter has several alarms that will trigger if host CPU utilization or virtual machine CPU utilization goes too high for extended periods of times.

vSphere 6.0 allows you to create very large virtual machines that have up to 128 vCPUs. It is highly recommended to size your virtual machine for the application workload that will be running in them. Sizing your virtual machine with resources that are unnecessarily larger than the workload can actually use may result in hypervisor overhead and can also lead to performance issues.

In general, here are some common CPU performance tips

Avoid a large VM on too small a platform

Don't expect as high of consolidation ratios with busy workloads as you did with the low-hanging-fruit

 

 

Conclusion

This concludes Module 5: CPU Performance, Basic Concepts and Troubleshooting. We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents' to quickly jump to a module in the manual.

 

 

Module 6 - CPU Performance Feature: Power Policies (15 minutes)

Introduction to, and Performance Impact of, Power Policies


VMware vSphere serves as a common virtualization platform for a diverse ecosystem of applications. Every application has different performance demands which must be met, but recent increases in density and computing needs in datacenters are straining power and cooling capacities and costs of running these applications.

vSphere Host Power Management (HPM) is a technique that saves energy by placing certain parts of a computer system or device into a reduced power state when the system or device is inactive or does not need to run at maximum speed.  vSphere handles power management by utilizing Advanced Configuration and Power Interface (ACPI) performance and power states. In VMware vSphere® 5.0, the default power management policy was based on dynamic voltage and frequency scaling (DVFS). This technology utilizes the processor’s performance states and allows some power to be saved by running the processor at a lower frequency and voltage. However, beginning in VMware vSphere 5.5, the default HPM policy uses deep halt states (C-states) in addition to DVFS to significantly increase power savings over previous releases while still maintaining good performance.

However, in order for ESXi to be able to control these features, you must ensure that the server BIOS power management profile is set to "OS Control mode" or the equivalent.

In this lab, we will show how to:

  1. Customize your server's BIOS settings (using example screen shots)
  2. Explain the four power policies that ESXi offers, and demonstrate how to change this setting
  3. Optimize your environment for either balancing power and performance (recommended for most environments), or optimizing for maximum performance (which can sacrifice some power savings).

 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

For users with non-US Keyboards

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

The CLI commands, user names and passwords that needs to be entered, can be copied and pasted from the file README.txt on the desktop.

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

 

Start this Module

 

Let's start this module.

Launch Chrome from the shortcut in the Taskbar.

 

 

Login to vSphere

 

Log into vSphere. The vSphere Web Client should be the default home page.

If, for some reason, that does not work, uncheck the box and use these credentials:

User name: CORP\Administrator
Password: VMware1!

In order to reduce the amount of manual input in this lab, a lot of tasks are automated using scripts. Therefore, it's possible that the vSphere Web Client does not reflect the actual state of the inventory immediately after a script has run..

 

 

Select Hosts and Clusters

 

 

Configuring the Server BIOS Power Management Settings


VMware ESXi includes a full range of host power management capabilities.  These can save power when an ESXi host is not fully utilized.  As a best practice, you should configure your server BIOS settings to allow ESXi the most flexibility in using the power management features offered by your hardware, and make your power management choices within ESXi (next section).

On most systems, the default setting is BIOS-controlled power management. With that setting, ESXi won’t be able to manage power; instead it will be managed by the BIOS firmware.  The sections that follow describe how to change this setting to OS Control (recommended for most environments).

In certain cases, poor performance may be related to processor power management, implemented either by ESXi or by the server hardware.  Certain applications that are very sensitive to processing speed latencies may show less than expected performance when processor power management features are enabled. It may be necessary to turn off ESXi and server hardware power management features to achieve the best performance for such applications.  This setting is typically called Maximum Performance mode in the BIOS.

NOTE: Disabling power management usually results in more power being consumed by the system, especially when it is lightly loaded. The majority of applications benefit from the power savings offered by power management, with little or no performance impact.

Bottom line: some form of power management is recommended, and should only be disabled if testing shows this is hurting your application performance.

For more details on how and what to configure, see this white paper: http://www.vmware.com/files/pdf/techpaper/hpm-perf-vsphere55.pdf


 

Configuring BIOS to OS Control mode (Dell example)

 

The screenshot above illustrates how an 11th Generation Dell PowerEdge server BIOS can be configured to allow the OS (ESXi) to control the CPU power-saving features directly:

For a Dell PowerEdge 12th Generation or newer server with UEFI (Unified Extensible Firmware Interface), review the System Profile modes in the System Setup > System BIOS settings. You see these options:

Choose Performance Per Watt (OS).

Next, you should verify the Power Management policy used by ESXi (see the next section).

 

 

Configuring BIOS to OS Control mode (HP example)

 

The screenshot above illustrates how a HP ProLiant server BIOS can be configured through the ROM-Based Setup Utility (RBSU).  The settings highlighted in red allow the OS (ESXi) to control some of the CPU power-saving features directly:

Next, you should verify the Power Management policy used by ESXi (see the next section).

 

 

Configuring BIOS to Maximum Performance mode (Dell example)

 

The screenshot above illustrates how an 11th Generation Dell PowerEdge server BIOS can be configured to disable power management:

For a Dell PowerEdge 12th Generation or newer server with UEFI, review the System Profile modes in the System Setup > System BIOS settings. You see these options:

Choose Performance to disable power management.

NOTE: Disabling power management usually results in more power being consumed by the system, especially when it is lightly loaded. The majority of applications benefit from the power savings offered by power management, with little or no performance impact. Therefore, if disabling power management does not realize any increased performance, VMware recommends that power management be re-enabled to reduce power consumption.

 

 

Configuring BIOS to Maximum Performance mode (HP example)

 

The screenshot above illustrates how to set the HP Power Profile mode in the server's RBSU to the Maximum Performance setting to disable power management:

NOTE: Disabling power management usually results in more power being consumed by the system, especially when it is lightly loaded. The majority of applications benefit from the power savings offered by power management, with little or no performance impact. Therefore, if disabling power management does not realize any increased performance, VMware recommends that power management be re-enabled to reduce power consumption.

 

 

Configuring BIOS Custom Settings (Advanced)

 

The screenshot above illustrates that if a Custom System Profile is selected, individual parameters are allowed to be modified.  Here are some examples of some of these settings; for more information, please consult your server's BIOS setup manual.

 

Configuring Host Power Management in ESXi


VMware ESXi includes a full range of host power management capabilities.  These can save power when an ESXi host is not fully utilized.  As a best practice, you should configure your server BIOS settings to allow ESXi the most flexibility in using the power management features offered by your hardware, and make your power management choices within ESXi.  These choices are described below.


 

Select Host Power Management Settings for esx-01a

 

  1. Select "esx-01a.corp.local"
  2. Select "Configure"
  3. Select "Hardware"
  4. Select "Power Management"

 

 

Power Management Policies

 

On a physical host, the Power Management options could look like this (it may vary depending on the processors of the physical host).

Here you can see what ACPI states that get presented to the host and what Power Management policy is currently active.  There are four Power Management policies available in ESXi 5.0, 5.1, 5.5, 6.0 and ESXi/ESX 4.1:

  1. Click "Edit" to see the different options

NOTE: Due to the nature of this lab environment, we are not interacting directly with physical servers, so changing the Power Management policy will not have any noticeable effect.  Therefore, while the sections that follow will describe each Power Management policy, we won't actually change this setting.

 

Conclusion


This concludes Module 6: CPU Performance Feature: Power Policies.  We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished. Let's review some key takeaways and where to go from here.


 

Key takeaways

We hope that you now know how to change power policies, both at the server BIOS level and also within ESXi itself.

To summarize, here are some best practices around power management policies:

Depending on your applications and the level of utilization of your ESXi hosts, the correct power policy setting can have a great impact on both performance and energy consumption. On modern hardware, it is possible to have ESXi control the power management features of the hardware platform used. You can select to use predefined policies or you can create your own custom policy.

Recent studies have shown that it is best to let ESXi control the power policy.  For more details, see the following references:

 

 

Next Steps

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents'  to quickly jump to a module in the manual.

Lab Module List:

 

 

Module 7 - Memory Performance with X-Mem (30 minutes)

Introduction


 

The goal of this module is to learn how to characterize memory performance in a virtualized environment.  VMware vSphere incorporates sophisticated mechanisms that maximize the use of available memory through page sharing, resource-allocation controls, and other memory management techniques.

Host memory is a limited resource, but it is critical that you assign sufficient resources (especially memory, but also CPU) to each VM so they perform optimally.

We will learn about an open-source memory benchmark named X-Mem, which can be used to characterize both memory bandwidth (throughput) and memory latency (access time).


What is X-Mem / Why X-Mem?


This lesson will describe what X-Mem is (no, it's not a superhero movie), and why we've decided to use it to characterize memory performance in this lab.


 

What X-Mem is: A Cross-Platform, Extensible Memory Characterization Tool for the Cloud

From the X-Mem page on github (https://github.com/Microsoft/X-Mem):

X-Mem is a flexible open-source research tool for characterizing memory hierarchy throughput, latency, power, and more. The tool was developed jointly by Microsoft and the UCLA NanoCAD Lab. This project was started by Mark Gottscho (Email: mgottscho@ucla.edu) as a Summer 2014 PhD intern at Microsoft Research. X-Mem is released freely and open-source under the MIT License. The project is under active development.

 

 

Why X-Mem / Alternatives

 

Of course, X-Mem is not the only memory benchmark available.  Here is a feature comparison of X-Mem versus some other popular memory benchmarks like STREAM, lmbench and Intel's mlc (source). Here is a quick summary of some key advantages that set it apart:

 

 

Research Paper and Attribution

There is a research tool paper describing the motivation, design, and implementation of X-Mem, as well as three experimental case studies using tools to deliver insights useful to both cloud providers and subscribers. For more information, see the following links:

Citation:

Mark Gottscho, Sriram Govindan, Bikash Sharma, Mohammed Shoaib, and Puneet Gupta. X-Mem: A Cross-Platform and Extensible Memory Characterization Tool for the Cloud. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 263-273. Uppsala, Sweden. April 17-19, 2016. DOI: http://dx.doi.org/10.1109/ISPASS.2016.7482101

 

Downloading/Installing X-Mem


This lesson will describe how to download the X-Mem benchmark.  There are prebuilt binaries for Windows and Linux; this lab will demonstrate X-Mem inside of Windows VMs.


 

Download and Extract X-Mem

 

There are multiple ways to obtain X-Mem, but the easiest is to go to http://nanocad-lab.github.io/X-Mem/ and click the Binaries (zip) button, which has precompiled binaries for Windows.  If you're using Linux, or wish to make modifications to the source code, click the appropriate link.

 

 

Runtime Prerequisites

There are a few runtime prerequisites in order for the software to run correctly.  Note that these requirements are for the pre-compiled binaries that are available on the project homepage at https://nanocad-lab.github.io/X-Mem. Also note that these requirements are already met using our lab environment:

HARDWARE:

WINDOWS:

GNU/LINUX:

 

 

Installation

Fortunately, the only file that is needed to run X-Mem on Windows is the respective executable xmem-win-.exe on Windows, and xmem-linux- on GNU/Linux. It has no other dependencies aside from the pre-installed system prerequisites which were just outlined.

 

Running X-Mem



 

Launch Performance Lab Module Switcher

 

Double click on the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 7

 

Click on the Start button for Module 7.

NOTE: Please wait a couple of minutes, and do not proceed with the lab until you see Remote Desktop windows appear.

 

 

Reposition Remote Desktops

 

The script will open Remote Desktop Connections to two Windows VMs.  However, we need to make both of them visible.  Drag the title bars of the Remote Desktop windows:

  1. Position perf-worker-01a on the left (as shown)
  2. Position perf-worker-01b on the right (as shown)
  3. Note that perf-worker-01a has 4 vCPUs
  4. Note that perf-worker-01b has only 1 vCPU
  5. Note that both VMs have 2GB (2047 MB) RAM

Given #5, you might think the memory performance of these 2 VMs should be identical.  As we will see, X-Mem can run multiple worker threads to exercise multiple CPUs simultaneously, allowing better scalability with more vCPUs.

 

 

X-Mem Command Line Options

Command-line Option Purpose
-j
# of worker threads to use in benchmarks.
NOTE: Can not be larger than the # of vCPUs.
-n
# of iterations to run; helps ensure consistency (the results shouldn't fluctuate much) 
-t
Throughput benchmark mode (as opposed to -l for latency benchmark mode)
-R
Use memory read-based patterns
-W
Use memory write-based patterns

Here is a summary of some of the command-line options we'll be using in this lab, but X-Mem has many more options to customize how it is run.

 

 

X-Mem Command-Line Help

 

Within the perf-worker-01b Remote Desktop window:

  1. Click the Command Prompt taskbar icon
  2. Type this command:  xmem -h  (h for help) and press Enter
  3. The window is not very big, and there's a lot of help text, so use the Up arrow or the scrollbar to scroll back up and see all the different options X-Mem has.

As you can see, X-Mem has a ton of options!  Let's look at some we will be utilizing for this lab.

 

 

Run X-Mem throughput (2 jobs) on perf-worker-01b (FAIL)

 

You should already have a Command Prompt window open on perf-worker-01b from the previous step; if not, click the Command Prompt icon on the taskbar.

Let's try to run X-Mem with a couple of command-line parameters we just saw: -t to test memory throughput, and -j2 to run two worker threads:

  1. Type xmem -t -j2 and press Enter
  2. You should see output like what is shown here, namely:
ERROR: Number of worker threads may not exceed the number of logical CPUs (1)

This is expected, because if you recall, this VM only has 1 virtual CPU.  

 

 

Run X-Mem throughput (2 jobs) on perf-worker-01a (PASS)

 

Now run the exact same X-Mem command that failed on perf-worker-01b on perf-worker-01a:

  1. Select the perf-worker-01a Remote Desktop window.
  2. Click the Command Prompt icon on the taskbar.
  3. Type this command and press Enter: xmem -t -j2
  4. Notice how this command successfully runs the benchmark on this VM.

This command ran successfully on this VM, because it has 4 virtual CPUs (so -j3 or -j4 would also work).  Next, we will take a closer look at the results.

 

 

Review X-Mem throughput results (2 jobs)

 

Once you're back at the command prompt, use the scrollbar to scroll back up and look at the results:

  1. The first benchmark throughput test, Test #1T, will show Read/Write Mode: read.  Since we specified -j2, the output shows that it ran 2 worker threads.
    The result in this example was 90664.66 MB/s (or 90.664 GB/s).  Note that your performance may vary, given the shared resources of the hands-on lab environment (where many other workloads are running).
  2. The second benchmark throughput test, Test #2T, will show Read/Write Mode: write.  Since we specified -j2, the output shows that it ran 2 worker threads.
    The result in this example was 44113.39 MB/s (or 44.11 GB/s).  Note that your performance may vary, given the shared resources of the hands-on lab environment (where many other workloads are running).

Why did the second test have lower (in this case, about half) the throughput of the first?  Well, writes are almost always more expensive than reads; this is true for memory/RAM, and other subsystems, such as disk storage I/O.

 

 

Run X-Mem read throughput (4 jobs) on perf-worker-01a

 

Let's further customize the X-Mem command line options, again on perf-worker-01a:

  1. Make sure the focus is on the Command Prompt of the perf-worker-01a Remote Desktop window (if it isn't already)
  2. Type this command and press Enter:   xmem -t -R -j4 -n5
  3. The results will be listed under the *** RESULTS*** heading, as shown here.

You will notice that the benchmark ran differently, due to the different command line we used.  Here is an explanation of each option:

 

 

Review X-Mem throughput results (4 jobs)

 

Once you're back at the command prompt, use the scrollbar to scroll back up and look at the results.  In this example, the results are consistently around 170,000 MB/sec (170 GB/sec).  Since we specified -j4, it ran 4 worker threads, so the memory performance is significantly higher than when we ran with 2 worker threads.  NOTE: Given the nature of our hands-on lab environment, your results may (and probably will) vary from this example.

As shown here, if your application is multi-threaded, additional vCPUs can potentially increase the VM's memory performance.

 

 

Close the Remote Desktop windows

 

  1. Close the perf-worker-01a Remote Desktop window
  2. Close the perf-worker-01b Remote Desktop window

 

Conclusion and Clean-Up



 

Stop Module 7

 

On the main console, find the Module Switcher window and click Stop.

 

 

Key takeaways

During this lab, we learned that X-Mem is a flexible memory benchmark tool.  It can:

You can download this tool to run in your environment, to ensure you are getting optimal memory performance out of your hosts and virtual machines.

 

 

Conclusion

This concludes Module 7, Memory Performance with X-Mem. We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents'  to quickly jump to a module in the manual.

 

 

Module 8 - Storage Performance and Troubleshooting (30 minutes)

Introduction to Storage Performance Troubleshooting


Approximately 90% of performance problems in a vSphere deployment are typically related to storage in some way.  There have been significant advances in storage technologies over the past couple of years to help improve storage performance. There are a few things that you should be aware of:

In a well-architected environment, there is no difference in performance between storage fabric technologies. A well-designed NFS, iSCSI or FC implementation will work just about the same as the others.

Despite advances in the interconnects, performance limit is still hit at the media itself, in fact 90% of storage performance cases seen by GSS (Global Support Services - VMware support) that are not configuration related, are media related. Some things to remember:

A good rule of thumb on the total number of IOPS any given disk will provide:

So, if you want to know how many IOPs you can achieve with a given number of disks:

This test demonstrates some methods to identify poor storage performance, and how to resolve it using VMware Storage DRS for workload balancing. The first step is to prepare the environment for the demonstration.


 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

For users with non-US Keyboards

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

The CLI commands, user names and passwords that needs to be entered, can be copied and pasted from the file README.txt on the desktop.

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

 

Start this Module

 

Let's start this module.

Launch Chrome from the shortcut in the Taskbar.

 

 

Open Flash vSphere Web Client

 

Open Google Chrome and then click on the vSphere Flash Client bookmark

 

Click on Use Windows session authentication and login

 

 

Select Hosts and Clusters

 

 

Storage I/O Contention



 

Launch Performance Lab Module Switcher

 

Double click on the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 8

 

Click on the Start button under Module 8

 

The script configures and starts up the virtual machines, and launches a storage workload using IOmeter.

The script may take up to 5 minutes to complete. While the script runs, spend a few minutes on reading through the next step, to gain understanding on storage latencies.

 

 

Disk I/O Latency

 

When we think about storage performance problems, the top issue is generally latency, so we need to look at the storage stack and understand what layers there are in the storage stack and where latency can build up.

At the top most layer, is the Application running in the guest operating system. That is ultimately the place where we most care about latency. This is the total amount of latency that application sees and it include the latencies off the total storage stack including the guest OS, the VMKernel virtualization layers, and the physical hardware.  

ESXi can’t see application latency because that is a layer above the ESXi virtualization layer.

From ESXi we see 3 main latencies that are reported in esxtop and vCenter.  

The top most is GAVG, or Guest Average latency, that is the total amount of latency that ESXi can detect.  

That is not saying this is the total amount of latency the application will see, in fact if you compare the GAVG (the Total Amount of Latency ESX is seeing) and the Actual latency the Application is seeing, you can tell how much latency the Guest OS is adding to the storage stack and that could tell you if the guest OS is configured incorrectly or is causing a performance problem. For example, if ESX is reporting GAVG of 10ms, but the application or perfmon in the guest OS is reporting Storage Latency of 30ms, that means that 20ms of latency is somehow building up in the Guest OS Layer, and you should focus your debugging on the Guest OS’s storage configuration.

Ok, now GAVG is made up of 2 major components KAVG and DAVG, DAVG = basically how much time is spent in the Device from the driver HBA and storage array, and KAVG = how much time is spent in the ESXi Kernel (so how much over is the kernel adding).  

KAVG is actually a derived metric - ESXi does not specifically calculate KAVG. ESXi calculates KAVG with the following formula:

Total Latency –  DAVG =  KAVG.  

The VMKernel is very efficient in processing IO, so there really should not be any significant time that an IO should wait in the kernel or KAVG, so KAVG should be equal to 0 in well configured / running environments. When KAVG is not equal to 0, then that most likely means that the IO is stuck in a Kernel Queue inside the VMKernel.  So the vast majority of the time KAVG will equal QAVG or Queue Average latency (The amount of time an IO is stuck in a queue waiting for a slot in a lower queue to free up so it can move down the stack).

 

 

View the Storage Performance as reported by IOmeter

 

When the storage script has completed, you should see two IOmeter windows, and two storage workloads should be running.

The storage workload is started on both perf-worker-02a and perf-worker-03a. It will take a few minutes for the workloads to settle, and performance numbers to become almost identical for the two VMs. These virtual machines testing disk share the same datastore and that datastore is saturated.

The performance can be seen in the IOmeter GUI as...

Latencies (Average I/O Response Time), latencies around 6ms.  

Low IOPs (Total I/O per Second), around 160IOPs

Low Throughput (Total MBs per Second), around 2.7MBPS

Disclaimer: Because we run this lab in a fully virtualized environment, where the ESXi host servers also run in virtual machines, we cannot assign physical disk spindles to individual datastores. Therefore the performance numbers on these screenshots will vary depending on the actual load in the cloud environment the lab is running in.

 

 

Select perf-worker-03a

 

  1. Select "perf-worker-03a"

 

 

View Storage Performance Metrics in vCenter

 

  1. Select "Monitor"
  2. Select "Performance"
  3. Select "Advanced"
  4. Click "Chart Options"

 

 

Select Performance Metrics

 

  1. Select "Virtual disk"
  2. Select only "scsi0:1"
  3. Click "None" under "Select counters for this chart"
  4. Select "Write latency" and "Write rate"
  5. Click "OK"

The disk that IOmeter uses for generating workload is scsi0:1, or sdb inside the guest.

 

 

View Storage Performance Metrics in vCenter

 

Repeat the configuration of the performance chart for perf-worker-02a and verify that performance is almost identical to perf-worker-03a.

Guidance:  Device Latencies that are greater than 20ms may be a performance impact to your applications.

Due to the way we create a private datastore for this test, we actually have pretty good low latency numbers. Scsi0:1 is located on an iSCSI datastore based on a RAMdisk on perf-worker-04a (DatastoreA), running on the same ESXi host as perf-worker-03a. Hence, latencies are pretty low for a fully virtualized environment.

vSphere provides several storage features to help manage and control storage performance:

Let’s configure Storage DRS to solve this contention problem.

 

Storage Cluster and Storage DRS


A datastore cluster is a collection of datastores with shared resources and a shared management interface. Datastore clusters are to datastores what clusters are to hosts. When you create a datastore cluster, you can use vSphere Storage DRS to manage storage resources.

When you add a datastore to a datastore cluster, the datastore's resources become part of the datastore cluster's resources. As with clusters of hosts, you use datastore clusters to aggregate storage resources, which enables you to support resource allocation policies at the datastore cluster level. The following resource management capabilities are also available per datastore cluster.

Space utilization load balancing You can set a threshold for space use. When space use on a datastore exceeds the threshold, Storage DRS generates recommendations or performs Storage vMotion migrations to balance space use across the datastore cluster.

I/O latency load balancing You can set an I/O latency threshold for bottleneck avoidance. When I/O latency on a datastore exceeds the threshold, Storage DRS generates recommendations or performs Storage vMotion migrations to help alleviate high I/O load. Remember to consult your storage vendor, to get their recommendation on using I/O latency load balancing.

Anti-affinity rules You can create anti-affinity rules for virtual machine disks. For example, the virtual disks of a certain virtual machine must be kept on different datastores. By default, all virtual disks for a virtual machine are placed on the same datastore.


 

Change to the Datastore view

 

  1. Change to the datastore view
  2. Expand "vcsa-01a.corp.local" and "RegionA01"

 

 

Create a Datastore Cluster

 

  1. Right Click on "RegionA01"
  2. Select "Storage"
  3. Click "New Datastore Cluster..."

 

 

Create a Datastore Cluster ( part 1 of 6 )

 

For this lab, we will accept most of the default settings.

  1. Type "DatastoreCluster" as the name of the new datastore cluster.
  2. Click Next

 

 

Create a Datastore Cluster (part 2 of 6 )

 

  1. Click "Next"

 

 

Create a Datastore Cluster ( part 3 of 6 )

 

  1. Change the "Utilized Space" threshold to "50"
  2. Click "Next"

Since the HOL is a nested virtual environment, it is difficult to demonstrate high latency in a reliable manner. Therefor we do not use I/O latency to demonstrate load balancing. The default is to check for storage cluster imbalances every 8 hours, but it can be changed to 60 minutes as a minimum.

 

 

Create a Datastore Cluster ( part 4 of 6 )

 

  1. Select "Clusters"
  2. Select "RegionA01-COMP01"
  3. Click "Next"

 

 

Create a Datastore Cluster ( part 5 of 6 )

 

  1. Select "DatastoreA" and "DatastoreB"
  2. Click "Next"

 

 

Create a Datastore Cluster ( part 6 of 6 )

 

  1. Click "Finish"

 

 

Run Storage DRS

 

Take a note of the name of the virtual machine that Storage DRS wants to migrate.

  1. Select "DatastoreCluster"
  2. Select the "Monitor" tab
  3. Select "Storage DRS"
  4. Click "Run Storage DRS Now"
  5. Click "Apply Recommendations"

Notice that SDRS recommends moving one of the workloads from DatastoreA to DatastoreB. It is making the recommendation based on space. SDRS makes storage moves based on performance only after it has collect performance data for more than 8 hours. Since the workloads just recently started SDRS would not make a recommendation to balance the workloads based on performance until it has collected more data.

 

 

Storage DRS in vSphere 6.5

 

  1. Select "Configure"
  2. Select "Storage DRS"
  3. Investigate the different settings you can configure for Storage DRS

A number of enhancements has been made to Storage DRS in vSphere 6.5, in order to remove some of the previous limitations of Storage DRS.

Storage DRS has improved interoperability with deduplicated datastores, so that Storage DRS is able to identify if datastores are baked by the same deduplication pool or not, and hence avoid moving a VM to a datastore using a different deduplication pool.

Storage DRS has improved interoperability with thin provisioned datastores, so that Storage DRS is able to identify if thin provisioned datastores are baked by the same storage pool or not, and hence avoid moving a VM between datastores using the same storage pool.

Storage DRS has improved interoperability with Array-based auto-tiering, so that Storage DRS can identify datastores with auto-tiering, and treat them differently, according to the type and frequency of auto-tiering.

Common for all these improvements is that they all require VASA 2.0, which requires that the storage vendor has an updated storage provider.

 

 

Select the VM that was migrated

 

  1. Return to the "Hosts and Clusters" view
  2. Select the virtual machine that was migrated using Storage DRS, in this case perf-worker-03a

 

 

Increased throughput and lower latency

 

  1. Select the "Monitor" tab
  2. Select "Performance"
  3. Select "Advanced"

Now you should see the performance chart you created earlier in this module.

Notice how the throughput has increased and how the latency is lower (green arrows), than it was when both VMs shared the same datastore.

 

 

Return to the Iometer GUIs to review the performance

 

Return the Iometer workers, and see how they also report increased performance and lower latencies.

It will take a while for Iometer to show these higher numbers, maybe 10 minutes. This due to the way the storage performance is throttled in this lab. If you want to try a shortcut:

  1. Click the "Stop sign", and wait for about 30 seconds
  2. Click the "Green flag" (start tests) to restart the two workers (see arrows on the picture)

The workload should spike, but then settle at the higher performance level in a couple of minutes.

 

 

Stop the Iometer workloads

 

Stop the Workloads

  1. Press the "Stop Sign" button on the Iometer GUI
  2. Close the GUI by pressing the “X
  3. Press the "Stop Sign" button on the Iometer GUI
  4. Close the GUI by pressing the “X

 

Conclusion and Clean-Up



 

Stop Module 8

 

On the main console, find the Module window and hit stop.  

 

 

Key take aways

During this lab we saw the importance of sizing your storage correctly, with respect to space and performance. It also shows that sometimes when you have two storage intensive sequential workloads sharing the same spindles, the performance can be greatly impacted. If possible try to keep workloads separated; sequential workloads separate (back by different spindles/LUNs) from random workloads.

In general, we will aim to keep storage latencies under 20ms, lower if possible, and monitor for frequent latency spikes of 60ms or more, which would be a performance concern and something to investigate further.

Guidance: From a vSphere perspective, for most applications, the use of one large datastore vs. several small datastores tends not to have a performance impact. However, the use of one large LUN vs. several LUNs is storage array dependent and most storage arrays perform better in a multi LUN configuration than a single large LUN configuration.

Guidance: Follow your storage vendor’s best practices and sizing guidelines to properly size and tune your storage for your virtualized environment.

 

 

Conclusion

This concludes Module 8, Storage Performance and Troubleshooting. We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents'  to quickly jump to a module in the manual.

 

 

Module 9 - Network Performance, Basic Concepts and Troubleshooting (15 minutes)

Introduction to Network Performance


As defined by Wikipedia, network performance refers to measures of service quality of a telecommunications product as seen by the customer.

These metrics are considered important:

In the following module, we will show you how to monitor and troubleshoot some network-related issues, so that you can troubleshoot similar issues that may exist in your own environment.


 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

For users with non-US Keyboards

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

The CLI commands, user names and passwords that needs to be entered, can be copied and pasted from the file README.txt on the desktop.

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

 

Start this Module

 

Let's start this module.

Launch Chrome from the shortcut in the Taskbar.

 

 

Open Flash vSphere Web Client

 

Click on the vSphere Flash Client bookmark

 

Log into vSphere.

If, for some reason, that does not work, uncheck the box and use these credentials:

User name: CORP\Administrator
Password: VMware1!

 

 

Refresh the UI

 

In order to reduce the amount of manual input in this lab, a lot of tasks are automated using scripts. Therefore, it's possible that the vSphere Web Client does not reflect the actual state of the inventory immediately after a script has run.

If you need to manually refresh the inventory, click the Refresh icon in the top of the vSphere Web Client.

 

 

Select Hosts and Clusters

 

 

Show network contention


Network contention, is when multiple VM's are fighting for the same resources.

In the VMware Hands on labs, it's not possible, to use all network resources, in a way that simulates the real world.

Therefore this module will focus on creating network load, and show you where to look, when you suspect network problems in your own environment.

You might see different results on your screen, due to the load of the environment when you are running the lab.


 

Launch Performance Lab Module Switcher

 

Double click on the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 9

 

Click on the Start button under Module 9.

 

 

Select VM

 

  1. Select "perf-worker-02a" on ESXi host "esx-01a.corp.local"
  2. Select "Monitor" tab
  3. Select "Performance" tab
  4. Select "Advanced"
  5. Click "Chart Options"

 

 

Select Chart options

 

  1. Select "Network"
  2. Click "None"
  3. Select "perf-worker-02a"
  4. Select the Packets received, transmitted and usage.  
  5. Click "OK"

Note : If you are unable to select all the metrics shown here, wait until the script starts the VM's and select open the "Chart options" again.

 

 

Monitor chart

 

Depending on the time it has taken for you to get to here, the Network load might be done. You should still be able to see the load that was running in the charts. Notice, that on the picture above, we ran the network twice for illustrational purposes.

  1. Here you can see the graphical network load, on perf-worker-02a
  2. Here you can monitor the load, of the VM and see the actual numbers, of the packets received.

Some good advice on what to look for is:

Usage:

If this number is high, depending on what you expect, it might be because of problems in the network, or in the VM.

Packets received and Transmitted packets:

This is a good indication of contention. This means that packets received are high, which could lead to dropped packets.  As a result, the packets need to be re-transmitted, which could be caused by contention or problems in the network.

Let's go to the host, and see if this is a VM, or a host problem.

 

 

Select Host

 

  1. Select "esx-01a.corp.local"
  2. Select "Monitor" tab
  3. Select "Performance" tab
  4. Select "Advanced"
  5. Select "Network" from the drop down menu
  6. Click "Chart Options"

 

 

Select Chart options

 

  1. Click "None"
  2. Select "esx-01a.corp.local"
  3. Select "Packets received, transmitted, Received packets dropped and Transmit packets dropped"
  4. Click "OK"

 

 

Monitor Chart

 

  1. See if there are any dropped packets on the host

In this example, there is no packets dropped in the host, wich indicates that this is a VM problem.

Note that you might see different results, in the lab, due to the nature of the Hands on Labs.

 

Conclusion and Cleanup



 

Stop Module 9

 

On the main console, find the Module window and hit stop.  

 

 

Key take aways

During this lab we saw how to diagnose networking problems, in VM's and hosts, using VMware's build in monitoring tools in vCenter.

There are many other ways of performance troubleshooting.

If you want to know more about performance troubleshooting, continue with the next modules, or see this article:

Troubleshooting network performance issues in a vSphere environment

http://kb.vmware.com/kb/1004087  

 

 

Conclusion

This concludes Module 9, Network Performance, Basic Concepts and Troubleshooting We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents'  to quickly jump to a module in the manual.

 

 

Module 10 - Advanced Performance Feature: Latency Sensitivity Setting (45 minutes)

Introduction to Latency Sensitivity


The vSphere 6.5 latency sensitivity feature was developed to address major sources of latency that can be introduced by virtualization. This feature was designed to programmatically reduce response time and jitter on a per VM basis allowing sensitive workloads exclusive access to physical resources and avoid resource contention on a granular basis. This is achieved by bypassing virtualization layers reducing overhead. Even greater performance can be realized when latency sensitivity is used in conjunction with a pass-through mechanism such as single-root I/O virtualization (SR-IOV). This is a per-VM basis, so a mixture of both normal VMs and latency sensitive workload VMs can be run on a single vSphere host.


 

Who should use this feature?

The latency sensitivity feature is intended for only specialized use cases. Workloads that require extremely low latency. It is extremely important to determine if your workload could benefit from this feature before enabling it. Latency sensitivity provides extremely low network latency performance with a tradeoff of increased CPU and memory cost because of less resource sharing, and increased power consumption.

The definition of a “high latency sensitive application” is one that requires network latencies in the tens of microseconds and very small jitter. An example would be stock market trading applications which are highly sensitive to latency, any introduced latency could mean the difference of making millions or losing millions.

Before making the decision to leverage VMware’s latency sensitivity feature perform the necessary cost benefit analysis if this feature is necessary. Choosing to enable this feature just because it exists can lead to higher host CPU utilization, higher power consumption, and can needlessly impact performance of the other VMs running on the host.

 

 

Who should not use this feature?

Choosing whether to enable the latency sensitivity or not is one of those “Just because you can doesn’t mean you should” choices. The Latency sensitivity feature reduces network latency. Latency sensitivity will not decrease application latency especially if latency is influenced by storage or other sources of latency besides the network.

The latency sensitivity feature should be enabled in environments in which the CPU is under committed. VMs which have latency sensitivity set to High will be given exclusive access to the physical CPU on the host. This means the latency sensitive VM can no longer share the CPU with neighboring VMs.

Generally, VMs that use the latency sensitivity feature should have a number of vCPUs which is less than the number of cores per socket in your host to ensure that the latency sensitive VM occupies only one NUMA node.

If the latency sensitivity feature is not relevant to your environment, consider choosing a different module.

 

 

Changes to CPU access

When a VM has 'High' latency sensitivity set in vCenter, the VM is given exclusive access to the physical cores it needs to run. This is termed exclusive affinity. These cores will be reserved for the latency sensitive VM only, which results in greater CPU accessibility to the VM and less L1 and L2 cache pollution from multiplexing other VMs onto the same cores. When the VM is powered on, each vCPU is assigned to a particular physical CPU and remains on that CPU.

When the latency sensitive VM's vCPU is idle, ESXi also alters its halting behavior so that the physical CPU remains active. This reduces wakeup latency when the VM becomes active again.

 

 

Changes to virtual NIC interrupt coalescing

A virtual NIC (vNIC) is a virtual device that exchanges network packets between the VMkernel and the guest operating system. Exchanges are typically triggered by interrupts to the guest OS or by the guest OS calling into VMkernel, both of which are expensive operations. Virtual NIC interrupt coalescing, which is enabled by default in vSphere, attempts to reduce CPU overhead by holding back packets for some time (combining or "coalescing" these packets) before triggering interrupts, which causes the hypervisor to wake up VMs more frequently.

Enabling 'High' latency sensitivity disables virtual NIC coalescing, so that there is less latency between when a packet is sent or received and when the CPU is interrupted to process the packet.   Typically, coalescing is desirable for higher throughput (so the CPU isn't interrupted as often), but it can introduce network latency and jitter.

While disabling coalescing can reduce latency, it can also increase CPU utilization, and thus power usage.  Therefore this option should only be used in environments with small packet rates and plenty of CPU headroom.

Are you ready to get your hands dirty? Let's start the hands-on portion of this lab.

 

 

Look at the lower right portion of the screen

 

Please check to see that your lab is finished all the startup routines and is ready for you to start. If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

For users with non-US Keyboards

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

The CLI commands, user names and passwords that needs to be entered, can be copied and pasted from the file README.txt on the desktop.

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

Let's start this module.

Launch Chrome from the shortcut in the Taskbar.

 

 

Open Flash vSphere Web Client

 

Click on the vSphere Flash Client bookmark

 

Click on Use Windows session authentication and login

 

 

Select Hosts and Clusters

 

 

Performance impact of the Latency Sensitivity setting


In this section, we will observe the impact of the Latency Sensitivity setting on network latency.  To do so, let's start up some workloads to stress the VMs.


 

Launch Performance Lab Module Switcher

 

Double click on the Performance Lab MS shortcut on the Main Console desktop

 

 

Launch Module 10

 

Click on the Start button under Module 10.

 

 

VM Stats Collectors: CPU intensive workload started

 

In a few minutes, when the script completes, you will see two “VM Stats Collector” applications start up. Within a minute after, each utility will start a CPU intensive application on the perf-worker-05a and perf-worker-06a virtual machines and will be collecting the benchmark results from those CPU intensive workloads. These VMs perf-worker-05a and perf-worker-06a will create high demand for CPU on the host, which will help us demonstrate the Latency Sensitivity feature.

IMPORTANT NOTE: Due to changing loads in the lab environment, your values may vary from the values shown in the screenshots.

 

 

Select ESX Host

 

The environment where the lab is running, is not constant. Due to that, it's important, to notice the speed of the CPU's on the nested ESX hosts.

  1. Select esx-02a.corp.local
  2. Make a note of the cpu speed of the processor (in this case 2.80GHZ)

You will be using this, in a later step.

 

 

Edit perf-worker-04a

 

We will use the perf-worker-04a virtual machine to demonstrate the Latency Sensitivity feature. To show how the 'High' Latency Sensitivity setting affects network latency, we will compare network performance between perf-worker-04a with Latency Sensitivity set to 'Normal' and that same VM with Latency Sensitivity set to 'High'.

The Latency Sensitivity feature, when set to 'High', has two VM resource requirements. For best performance, it needs 100% memory reservation and 100% CPU reservation.

To make a fair comparison, both the 'Normal' latency sensitivity VM and the 'High' latency sensitivity VM should have the same resource reservations, so that the only difference between the two is the 'High' latency sensitivity setting.

First, we will create resource allocations for the perf-worker-04a virtual machine while Latency Sensitivity is set to "Normal".

  1. Select perf-worker-04a.  This VM resides on esx-02a.corp.local.
  2. Select edit settings

 

 

Set CPU Reservation to Maximum

 

  1. Expand CPU
  2. Set the Reservation value to the highest value possible, according the the cpu speed, that you noted in the earlier step. If the cpu speed was 2.80 GHZ, set it til 2753MHZ.

Note that it must be a couple of mhz less, for the VM to be able to start.

This sets a near-100% CPU reservation for the VM. When the VM has the 'High' latency sensitivity setting, this CPU reservation enables exclusive affinity so that one physical CPU is reserved solely for use of the 'High' Latency Sensitive VM vCPU.

Note that normally you should select "Maximum" reservation, but due to this being a fully virtualized environment, the CPU speed is detected with a wrong value.Therefore we set it manually according to the underlying hardware.

 

 

Set Memory Reservation

 

Still on the Edit Settings page,

  1. Click CPU to collapse the CPU view
  2. Click Memory to expand the Memory view
  3. Check the box Reserve all guest memory (All locked)

This sets a 100% memory reservation for the VM.

Right now, we are going to test network performance on a 'Normal' Latency Sensitivity VM, but when we change the VM's latency sensitivity to 'High' later, 100% memory reservation ensures that all the memory the VM needs will be located close to the processor which is running the VM. If the VM has a 'High' Latency Sensitivity setting and does not have a 100% memory reservation, it will not power on.

 

 

Ensure Latency Sensitivity is 'Normal'

 

Still on the Edit Settings page:

  1. Click the VM Options tab
  2. Click Advanced to expand this section
  3. Confirm the Latency Sensitivity is Normal
  4. Click OK

 

 

Power on perf-worker-04a

 

  1. Right click on "perf-worker-04a".
  2. Select "Power"
  3. Click "Power On"

 

 

Monitor esx-02a host CPU usage

 

  1. Select esx-02a.corp.local
  2. Select Monitor
  3. Select Performance
  4. Select Advanced
  5. You can see that the Latest value for esx-02a.corp.local Usage should be close to 100%.  This indicates that all VMs are consuming as much CPU on the host as they can.

Although an environment which contains latency-sensitive VMs should typically remain CPU undercommitted, creating demand for CPU makes it more likely that we can see a difference between the 'Normal' and 'High' Latency Sensitivity network performance.

The VM perf-worker-03a will serve as the network performance test target.

 

 

Monitor Resource Allocation

 

  1. Select perf-worker-04a
  2. Select Monitor
  3. Select Utilization

The Resource Allocation for the 'Normal' Latency Sensitive VM shows only a small portion of the total CPU and Memory reservation is Active. Your screen may see different values if the VM is still booting up.

 

 

Open a PuTTY window

 

Click the PuTTY icon on the taskbar

 

 

SSH to perf-worker-04a

 

  1. Select perf-worker-04a
  2. Click Open

 

 

Test network latency on 'Normal' latency sensitivity VM

 

At the command line, type:

ping -f -w 1 192.168.100.153 

Press enter.

Wait for the command to complete, and run this command a total of 3 times. On the second and third times, you can press the up arrow to retrieve the last command entered.

Ping is a very simple network workload, which measures Round Trip Time (RTT), in which a network packet is sent to a target VM and then returned back to the VM. The VM perf-worker-04a, located on esx-02a.corp.local, is pinging perf-worker-03a, located on esx-01a.corp.local, with the IP address 192.168.100.153. For a period of one second, perf-worker-04a sends back-to-back ping requests. Ping is an ideal low-level network test because the request is processed in the kernel and does not need to access the application layer of the operating system.

We have finished testing network latency and throughput on the 'normal' Latency Sensitivity VM. Do not close this PuTTY window as we will use it for reference later. We will now change the VM to 'high' Latency Sensitivity.

 

 

Shut down the perf-worker-04a VM

 

To enable the latency sensitivity feature for a VM, the VM must first be powered off. You can still change the setting while the VM is powered on, but it doesn't fully apply until the VM has been powered off and then back on again.

  1. Right-click perf-worker-04a
  2. Select Power
  3. Click Shut Down Guest OS

 

 

Confirm Guest Shut Down

 

Click Yes

Wait for perf-worker-04a to shut down.

 

 

Edit Settings for perf-worker-04a

 

We will use the perf-worker-04a virtual machine to demonstrate the Latency Sensitivity feature. To show how the 'High' Latency Sensitivity setting affects network latency, we will compare network performance with this setting set to Normal and High.

The Latency Sensitivity feature, when set to 'High', has two VM resource requirements. For best performance, it needs 100% memory reservation and 100% CPU reservation.

To make a fair comparison, both the 'Normal' latency sensitivity VM and the 'High' latency sensitivity VM should have the same resource reservations, so that the only difference between the two is the 'High' latency sensitivity setting.

First, we will create resource allocations for the perf-worker-04a virtual machine while Latency Sensitivity is set to "Normal" (the default setting).

  1. Click perf-worker-04a
  2. Click Actions
  3. Click Edit Settings...

 

 

Set 'High' Latency Sensitivity.

 

  1. Select VM Options
  2. Expand Advanced
  3. Select High
  4. Click OK

 

 

CPU reservation warning

 

Maybe you noticed a warning in the previous picture? "Check CPU Reservation" appear next to the Latency Sensitivity setting. For best performance, High Latency Sensitivity requires you set 100% CPU reservation for the VM, which we did earlier. This warning will always appear in the Advanced Settings screen, even when the CPU reservation has already been set high enough.

If no reservation is set, the VM is still allowed to power on and no further warnings are made.

 

 

Power on perf-worker-04a

 

  1. Right-click perf-worker-04a
  2. Select Power
  3. Click Power On

 

 

Monitor Resource Allocation

 

  1. Select the "Monitor" tab
  2. Select "Utilization"

On the top half of this image, we see that the 'High' Latency Sensitivity VM shows 100% Active CPU and Private Memory even though the VM itself is idle. Compare this to the Resource Allocation for the 'Normal' Latency Sensitive VM which we examined earlier. It shows only a small portion of the total CPU and Memory reservation is Active. This increase in Active CPU and Memory is the result of the 'High' Latency Sensitivity setting.

Although we cannot see the difference in this environment when 'High' Latency Sensitivity is set with 100% CPU reservation, the Host CPU will show 100% utilization of the physical core which is hosting the VM's vCPU. This is a normal result of exclusive affinity in the Lab environment and occurs even when the VM itself is idle. On many Intel processors, the physical CPU hosting the vCPU will be idle if the vCPU is idle but it will still be unavailable to other vCPUs.

 

 

Monitor the VM Stats Collectors

 

Before we set 'High' Latency Sensitivity for perf-worker-04a, the CPU workers had equivalent benchmark scores. Now, one of the CPU workers will have a lower score. In the example above, perf-worker-06a has a lower score. Your lab may show either perf-worker-05a or perf-worker-06a with a lower score. This confirms that perf-worker-04a has impacted perf-worker-06a's access to CPU cycles which decreases its CPU benchmark score.

Next, we will test network latency on the 'High' Latency Sensitivity VM.

 

 

Open a PuTTY window

 

Click the PuTTY icon on the taskbar

 

 

SSH to perf-worker-04a

 

  1. Select perf-worker-04a
  2. Click Open

 

 

Test network latency on 'High' Latency Sensitive VM

 

At the command line, run the command:

ping -f -w 1 192.168.100.153

Like last time, wait for the command to complete, and run this command a total of three times.

We'll take a look at the results in a second, but first we will set the Latency Sensitivity setting back to default.

 

 

Compare network latency tests

 

From the taskbar, click the PuTTY icons to bring both PuTTY windows to the foreground and arrange them with Normal Latency Sensitivity on top and High Latency Sensitivity on the bottom.

Hint: At the bottom of both windows, there should be a timestamp:

Broadcast message from root (timestamp): The oldest timestamp is the Normal Latency Sensitivity VM. Place this window on top and the other on bottom.

Now let's delve into the performance results.

Important Note: Due to variable loads in the lab environment, your numbers may differ from those above.

The ping test we completed sends as many ping requests to the remote VM as possible ("Back to back pings") within a one second period. As soon as one ping is returned, another request is sent. The ping command outputs four statistics per test:

Of these, we are most interested in minimum latency and maximum deviation.

From 'eyeballing' the differences in numbers between the 'Normal' and 'High' Latency Sensitivity VMs, hopefully you will be able to see the difference.  Note the numbers within the green brackets; the smaller deviation in the 'High' Latency sensitive VM represents less "jitter". Because this is a shared virtualized test environment, these performance results are not representative of the effects of the Latency Sensitivity setting in a real-life environment. They are for demonstration purposes only.

Remember, these numbers were taken from the same VM with the same resource allocations, under the same conditions. The only difference between the two is setting 'Normal' versus 'High' Latency Sensitivity.

 

 

Close the VM Stats Collector windows

 

From the taskbar, click the .NET icon to bring the VM Stats Collectors to the foreground.

We have finished the network tests. Close the windows using the X on each window.

 

 

Close open PuTTY windows

 

Close the open PuTTY windows.

 

Conclusion and Cleanup



 

Stop Module 10

 

On the main console, find the Module window and hit stop.  

 

 

Key take aways

The Latency Sensitivity setting is very easy to configure. Once you have determined whether your application fits the definition of 'High' latency sensitivity (tens of microseconds), configure Latency Sensitivity.

To review:

1. On a powered off VM, set 100% memory reservation for the latency sensitive VM.

2. If your environment allows, set 100% CPU reservation for the latency sensitive VM such that the MHz reserved is equal to 100% of the sum of the frequency of the VM's vCPUs.

3. In Advanced Settings, set Latency Sensitivity to High.

If you want to learn more about running latency sensitive applications on vSphere, consult these white papers:

http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf  

http://www.vmware.com/files/pdf/techpaper/latency-sensitive-perf-vsphere55.pdf

 

 

Conclusion

This concludes Module 10,, Advanced Performance Feature: Latency Sensitivity Setting. We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents'  to quickly jump to a module in the manual.

 

 

Module 11 - Advanced Performance Tool: esxtop CLI introduction (30 minutes)

Introduction to esxtop


There are several tools to monitor and diagnose performance in vSphere environments. It is best to use esxtop to diagnose and further investigate performance issues that have already been identified through another tool or method. esxtop is not a tool designed for monitoring performance over the long term, but is great for deep investigation or monitoring a specific issue or VM over a defined period of time.

In this lab, which should take about 30 minutes, we will use esxtop to dive into performance troubleshooting, in both CPU, Memory, Storage and Network. The goal of this module is to expose you to the different views in esxtop, and to present you with different loads, in each view.  This is not meant to be a deep dive into esxtop, but to get you comfortable with this tool so that you can use it in your own environment.  

To learn more about each metric in esxtop, and what they mean, we recommend that you look at the links at the end of this module.

For day-to-day performance monitoring of an entire vSphere environment, vRealize Operations (vROPs) is powerful tool that can be used to monitor your entire virtual infrastructure. It incorporates high-level dashboard views and built in intelligence to analyze the data and identify possible problems.  We also recommend that you look at the other vRealize hands-on labs when you have finished with this one for better understanding of day-to-day monitoring.


 

Check the Lab Status

 

Please check the lower-right corner of the desktop to make sure your lab has finished all the startup routines and is ready for you to start.

If you see anything other than "Ready", please wait a few minutes. If after 5 minutes you lab has not changed to "Ready", please ask for assistance.

 

 

README.txt to assist with entering keyboard input

 

If you are using a device with non-US keyboard layout, you might find it difficult to enter CLI commands, user names and passwords throughout the modules in this lab.

These can be copied and pasted from the file README.txt on the desktop, which you can double click to open in Notepad (as shown).

 

 

On-Screen Keyboard

 

Another option, if you are having issues with the keyboard, is to use the On-Screen Keyboard.  

To do so, click Start and On-Screen Keyboard, or the shortcut on the Taskbar.

 

 

Launch Chrome browser

 

Click the Chrome icon in the taskbar to start this module.

 

 

Login to vSphere Client

 

  1. Select the Use Windows session authentication check box
  2. Click the Login button

If, for some reason, that does not work, uncheck the box and use these credentials:

User name: CORP\Administrator
Password: VMware1!

 

 

Refresh vSphere Client (as necessary)

 

In order to reduce the amount of manual input in this lab, a lot of tasks are automated using scripts. Therefore, it's possible that the vSphere Web Client does not reflect the actual state of the inventory immediately after a script has run.

If you need to manually refresh the inventory, click the Refresh icon in the top of the vSphere Web Client.

 

 

Select Hosts and Clusters

 

 

Show esxtop CPU Features


Esxtop can be used to diagnose performance issues involving almost any aspect of performance at both the host and virtual machine level. This section will step through how to view CPU performance, using esxtop in interactive mode.


 

Open a PowerShell window

 

Click on the "Windows PowerShell" icon in the taskbar to open a command prompt.

 

 

Reset Lab

 

Type

.\StopLabVMs.ps1

and press Enter.  This resets the lab in to a base configuration.  

 

 

Start CPU load on VMs

 

Type

.\StartCPUTest2.ps1

and press Enter.  Wait until you see the RDP sessions to continue.  

 

 

Open PuTTY

 

 

 

SSH to esx-01a

 

  1. Select host esx-01a.corp.local
  2. Click Open

 

 

Start esxtop

 

  1. From the ESXi shell, type
esxtop

and press Enter.

2.     Click the Maximize icon so we can see the maximum amount of information.

 

 

Select the CPU view

 

If you just started esxtop, you are default on the CPU view.

To be sure, press "c" to switch to the CPU view.

 

 

Filter the fields displayed

 

Type

f

To see the list of available fields (counters).

Since we don't have a lot of screen space, let's remove the ID and Group Id counters.

Do this by typing the following letters (NOTE: make sure these are capitalized, as these are case sensitive!)

A
B

Press Enter

 

 

Filter only VMs

 

This screen shows performance counters for both virtual machines and ESXi host processes.

To see only values for virtual machines

Press (capital)

V

 

 

Monitor VM load

 

Monitor the load on the 2 Worker VM's: perf-worker-01a and perf-worker-01b.

They should both be running at (or near) 100% guest CPU utilization. If not, then wait for a moment and let the CPU workload startup.

One important metric to monitor is %RDY (CPU Ready).  This metric is the percentage of time a “world” is ready to run, but awaiting the CPU scheduler for approval.  This metric can go up to 100% per vCPU, which means that with 2 vCPU's, it has a maximum value of 200%.  A good guideline is to ensure this value is below 5% per vCPU, but it will always depend on the application.

Look at the worker VMs to see if they go above the 5% per vCPU threshold.  To force esxtop to immediately refresh, click the Space bar.

 

 

Edit Settings of perf-worker-01a

 

Let's see how perf-worker-01a is configured:

  1. Expand the VMs hosted on esx-01a.corp.local so the perf-worker-01a VM is visible.
  2. Right-click on the perf-worker-01a virtual machine
  3. Click Edit Settings…

 

 

Add a vCPU to perf-worker-01a

 

  1. Expand the CPU dropdown
  2. Change CPU to 2
  3. Click OK to save.

 

 

Edit Settings of perf-worker-01b

 

Let's add a virtual CPU to perf-worker-01b to improve performance.

  1. Right click on the perf-worker-01b virtual machine
  2. Click Edit Settings…

 

 

Add a vCPU to perf-worker-01b

 

  1. Select 2 vCPUs
  2. Click "OK"

 

 

Monitor %USED and %RDY

 

Return to the PuTTY (esxtop) window.

Now we have added an additional vCPU to each virtual machine, you should see results like the screenshot above:

 

 

Monitor %USED and %RDY (continued)

 

After a few minutes, the CPU benchmark will start to use the additional vCPUs and %RDY will increase even more. This is due to CPU contention and SMP scheduling (increased %CSTP) on the system. The ESXi host has 4 vCPUs across two active virtual machines, attempting to run 2 vCPUs at 100% each, are fighting for resources. Remember that the ESXi host also requires some CPU resources to run, and this causes CPU contention.

 

Show esxtop memory features


Esxtop can be used to diagnose performance issues involving almost any aspect of performance and at both the host and virtual machine perspectives. This section will step through how to view memory performance, using esxtop in interactive mode.


 

Open a PowerShell Window (if necessary)

 

Click on the Windows PowerShell icon in the taskbar to open a command prompt

NOTE: If you already have one open, just switch back to that window.

 

 

Reset Lab

 

Type

.\StopLabVMs.ps1

and press Enter.  This resets the lab in to a base configuration.  

 

 

Start Memory Test

 

In the PowerShell window type

.\StartMemoryTest.ps1

And press enter, to start the memory load.

You can continue to the next step, while the script is running, but please don't close any windows, since that will stop the memory load.

 

 

Select the esxtop Memory view

 

In the PuTTY window type

m

To see the memory view

 

 

Select correct fields

 

Type

f

To see the list of available counters.

Since we don't have so much screen space, we will remove the 2 counters ID and Group Id

Do this by pressing (capital letters)

B
H
J

 Press enter to return to the esxtop screen

 

 

See only VMs

 

This screen shows performance counters for both virtual machines and ESXi host processes.

To see only values for virtual machines

Press (capital)

V

 

 

Monitor memory load with no contention

 

When the load on the worker VM's begin, you should be able to see them, in the top of the esxtop window.

Some good metrics to look at is :

MCTL :

Is the balloon driver installed?  If not, then it's a good idea to fix that first.

MCTLSZ :

Shows how inflated the balloon is.  How much memory has been taken back from the operating system. This should be 0.

SWCUR :

Shows how much the VM has swapped. This should be 0, but could be ok, if the last counters are ok.

SWR/S :

Shows how much read there is on the swap file.

SWW/S :

Shows how much write there is on the swap file.

Depending on the lab, all counters should be ok. But due to the nature of the nested lab, it's unclear what you might see. So look around, and see if everything looks fine.

 

 

Power on perf-worker-04a

 

  1. Right Click "perf-worker-04a"
  2. Select "Power"
  3. Click "Power On"

 

 

Monitor memory load under contention

 

Now that we have created memory contention on the ESXi host, we can see:

  1. perf-worker-02a and 03a are ballooning around 400MB each
  2. perf-worker-02a, 03a and 04a are swapping to disk, indicating too much memory strain in this environment

 

 

Stop load on workers

 

  1. Stop the load on the workers that appeared after you started the load script by closing the 2 VM stat collector windows.

 

Show esxtop storage features


Esxtop can be used to diagnose performance issues involving almost any aspect of performance and at both the host and virtual machine perspectives. This section will step through how to use esxtop to view storage performance using esxtop in interactive mode.


 

Open a PowerShell Window (if necessary)

 

Click on the Windows PowerShell icon in the taskbar to open a command prompt

NOTE: If you already have one open, just switch back to that window.

 

 

Reset Lab

 

Type

.\StopLabVMs.ps1

and press Enter.  This resets the lab to a base configuration.  

 

 

Start Storage Test

 

In the PowerShell window type

.\StartStorageTest.ps1

and press enter to start the lab

The lab will take about 5 minutes to prepare. Feel free to continue, on the other steps, while the script finishes.

After you start the script, be sure that you don't close any windows that appear.

 

 

Different views

 

When looking at storage in esxtop, you have multiple options to choose from.

Esxtop shows the storage statistics in three different screens:

And

We will focus on the VM screen in this module.

In the Putty window type (lower case)

v

To see the storage vm view

 

 

Select correct fields

 

Type

f

To see the list of available counters.

In this case, all counters are added by vscsi id.

Since we have enough room for all counters, we will add this too by pressing (capital letter)

A

Press enter when finished

 

 

Start IOmeter load on VMs

 

The StartStorageTest.ps1 script that we executed in the beginning of this lab, should be finished now and you should have 2 IOmeter windows on your desktop, looking like this.

If not, run the

.\StartStorageTest.ps1 

again, and wait for it to finish.

 

 

Monitor VM load

 

You have 4 running VM's in the Lab.

2 of them, is running IOmeter Workloads, and the other 2 are iSCSI storage targets using RAM disk. Because they are using a RAM disk as storage target, they do not generate any disk I/O.

The metrics to look for here is :

CMDS/S :

This is the total amount of commands per second and includes IOPS (Input/Output Operations Per Second) and other SCSI commands such as SCSI reservations, locks, vendor string requests, unit attention commands etc. being sent to or coming from the device or virtual machine being monitored.

In most cases, CMDS/s = IOPS unless there are a lot of metadata operations (such as SCSI reservations)

LAT/rd and LAT/wr :

Indicates average response time or Read and Write IO, as seen by the VM.

In this case, you should see high values, in CMD/s on the worker VM's that is currently doing IO Meter load (perf-worker-02a and 03a) indicating, that we are generating a lot of IO.

And a high value in LAT/wr, since we are only doing writes.

The numbers can be different, on your screen, due to the nature of the Hands on labs.

 

 

Device or Kernel latency

 

Press

d

To go to the Device view.

Here you can see that the storage workload is on device vmhba33, which is the software iSCSI adapter. Look for DAVG (device latency) and KAVG (kernel latency). DAVG should be below 25ms and KAVG, latency caused by the kernel, should be very low, and always below 2ms.

In this example the latencies are within acceptable values.

 

 

Stop load on workers

 

Close BOTH IOmeter workers

  1. When finished, stop IOmeter workloads by clicking "STOP"
  2. Close the window, by selecting the X in the top right corner.

 

Show esxtop Network features


Esxtop can be used to diagnose performance issues involving almost any aspect of performance and at both the host and virtual machine perspectives. This section will step through how to view network performance, using esxtop in interactive mode.


 

Open a PowerShell Window (if necessary)

 

Click on the Windows PowerShell icon in the taskbar to open a command prompt

NOTE: If you already have one open, just switch back to that window.

 

 

Reset Lab

 

Type

.\StopLabVMs.ps1

and press Enter.  This resets the lab in to a base configuration.  

 

 

Start Network Test

 

In the PowerShell window type

 .\StartNetTest.ps1

And press enter.

Continue with the next steps, while the script runs, it will take a few minutes.

 

 

Select the network view

 

In the PuTTY window type

n

to see the networking view

 

 

Select correct fields

 

Type

f

To see the list of available counters.

Since we don't have so much screen space, we will remove the 2 counters PORT-ID and DNAME

Do this by pressing (capital letters)

A
F

Press enter when finished.

 

 

Monitor load

 

Monitor the metrics.

Note that the result might be different, on your screen, due to the load of the environment where the Hands On Labs is running.

The screen updates automatic, but you can force a refresh, by pressing

space

The metric to watch for, is :

%DRPTX and %DRPRX :

Which is the % of sent and received packages that were dropped.

If this number goes up, it might be an indication of high network utilization.

Note that the StartNetTest.ps1 script that you ran in the first step, starts the VM's and then waits for 2 minutes before running a network load for 5 minutes.

Depending on how fast you were, at getting to this step, you might not see any load, if it took you more than seven minutes.

 

 

Restart network load

 

If you want to start the network load for another 5 minutes, return to the PowerCLI window.

In the PowerShell type

 .\StartupNetLoad.bat

And press enter.

The network load will now run for another five minutes.  While you wait, you can explore esxtop more.

 

 

Network workload complete

 

As described previously, the load will stop by itself.  When the PowerShell window says "Network load complete" no more load will be generated.

 

Conclusion and Cleanup



 

Key takeaways

During this lab we learned how to use esxtop to monitor load, in both CPU, memory, storage and network views.

We have only scratched the surface of what esxtop can do.

If you want to know more about esxtop, see these articles:

 

 

 

Clean up procedure

In order to free up resources for the remaining parts of this lab, we need to shut down the used virtual machine and reset their configuration.

 

 

Open a PowerShell Window (if necessary)

 

Click on the Windows PowerShell icon in the taskbar to open a command prompt

NOTE: If you already have one open, just switch back to that window.

 

 

Reset Lab

 

Type

.\StopLabVMs.ps1

and press Enter.  This resets the lab in to a base configuration.  You can now move on to another module.

 

 

Conclusion

This concludes Module 11, Performance Tool: esxtop CLI introduction. We hope you have enjoyed taking it. Please do not forget to fill out the survey when you are finished.

If you have time remaining, here are the other modules that are part of this lab along with an estimated time to complete each one.  Click on 'More Options - Table of Contents' to quickly jump to a module in the manual.

 

 

Conclusion

Thank you for participating in the VMware Hands-on Labs. Be sure to visit http://hol.vmware.com/ to continue your lab experience online.

Lab SKU: HOL-1804-01-SDC

Version: 20180312-190912