October 24, 2017

Accept button greyed out when trying to update VCSA

Background

Updating to new minor versions of VCSA is really simple since all the update functionality is built into the web interface living on port 5480 of the vCenter Server, also known as VAMI (vCenter Appliance Management Interface).

Problem

When trying to upgrade from 6.5.0.10000 Build Number 5973321 to 6.5.0.10100 Build Number 6671409 I wasn't able to click the Accept button because it was greyed out. Clicking the EULA link several times was of no help.

Solution

Switching browser helped. I originally tried using Chrome version 61.0.3163.100. Switching to Firefox version 56 revealed the Accept button and I could proceed with the update..

September 10, 2017

Visibility of private VMware services on the public internet

Background

VMware services like ESXi hosts and vCenter are services you would normally place in your private networks. Preferably not in your average internal networks, but in your management network along with other services you provide management for. VMs on the other hand are placed in other networks like internal networks, DMZ networks and similar.

Results

By using Shodan I was able to find 4644 (probable) vCenter Servers (servers with the vsphere web client on port 9443):

With the same search engine it's also easy to find computers that are hosting VMs (ESXi, Workstation, Player, by looking for computers with VMware Authentication daemon (providing VNC) on port 902) and the number is quite astonishing:

Most of these systems are not identified by OS (only ~3k of ~200k), but I suspect that a big majority here is hosted products and not ESXi hosts. We can also tell by the version of the VMware Autherntication Daemon that some of the systems are dated with pre 2009 versions.





We can even search for the VMware Self Signed certificate that is installed by default by most VMware services:
By looking at the certificate information you're also able to either get the internal ip address or the local hostname of the service.

By monitoring these queries over some time I've observed that the number of systems reported are changing on a semi weekly basis by up to 20%. Some times up and sometimes down.

By using Richard Garsthagens tool https://github.com/AnykeyNL/vmware_scanner you can also reveal that many of these systems are very old.

Conclusion

That these systems are available on the internet may not seem like a big issue at the moment as things may seem to be working as expected. 

The main reason it is not recommended to expose these services is that this is the doorway to manage and control all of your virtual environment. All you need is a valid username and password. Those who have monitored the logs of internet exposed systems know that automated systems will try to login on a regular basis.  

We also know that even though some services are regarded safe and have no known security holes over many years they still may turn out with some hole at some point and can potentially give people access without a valid username and password.

Many of the systems exposed seem to be very old and we all know that is bad karma to leave an old unpatched system open to the internet.

August 18, 2016

Configuring the HPE 6125XLG Ethernet Blade Switch for use in a VMware environment - part 2

Background
Many people are using Flex Fabric for Ethernet (+FC) connectivity for their HP Blade environments. For better functionality and control we've chosen to use HPE 6125XLG blade switches instead and documenting how we achieved this. It's interesting to note that the 6125XLG is using the exact same hardware that is also used in the FlexFabric -20/40 F8.

Problem
I've found the documentation for the H3C line of switches is a bit confusing and some times wrong. Our switches are using a command set known as Comware7 while many examples are using Comware5.

Solution
We have configured our system with the following features:

  1. The switches are stacked and works as one big switch. See part 1 for a closer description.
  2. There are two 10GbE uplinks from each of these switches to two Cisco 6500 series switches.
  3. The trunk between the 6125XLGs and Cisco 6500 is setup with LACP.
  4. Spanning tree between switches is configured to RSTP
  5. CDP has been setup between switches and servers
  6. VMware ESXi is setup with distributed switch using LBT+NetIOC
  7. Logs are forwarded to logstash
  8. SNMP has been configured (for future use)
  9. NTP

There are two 6125XLG switches in the C7000 and each of the blades has one nic connected to each of these switches. The two switches has 4 10GbE ports connected to each other and these are normally used for stacking (IRF) and FCoE (you dedicate a pair for each). Each switch also has 8x 10GbE SFP+ ports and 4x 40GbE QSFP+ ports. It's recommended to use original HPE GBICs, but third party GBICs has also proven also work nicely. 
Logical view


1. Stacking

When you configure IRF you have 4 ports to choose from. You can either use two or four of these (you can dedicate two for FCoE if you need to). In this example we're using all four ports to aggregate the switches into one large one. In H3C language this is called Intelligent Resilient Framework.
 irf mac-address persistent timer
 irf auto-update enable
 undo irf link-delay
 irf member 1 priority 10
 irf member 2 priority 1

irf-port 1/1
 port group interface Ten-GigabitEthernet1/0/17
 port group interface Ten-GigabitEthernet1/0/18
 port group interface Ten-GigabitEthernet1/0/19
 port group interface Ten-GigabitEthernet1/0/20
#
irf-port 2/2
 port group interface Ten-GigabitEthernet2/0/17
 port group interface Ten-GigabitEthernet2/0/18
 port group interface Ten-GigabitEthernet2/0/19
 port group interface Ten-GigabitEthernet2/0/20
interface Ten-GigabitEthernet1/0/17
 description IRF
#
interface Ten-GigabitEthernet1/0/18
 description IRF
#
interface Ten-GigabitEthernet1/0/19
 description IRF
#
interface Ten-GigabitEthernet1/0/20
#
interface Ten-GigabitEthernet2/0/17
 description IRF
#
interface Ten-GigabitEthernet2/0/18
 description IRF
#
interface Ten-GigabitEthernet2/0/19
 description IRF
#
interface Ten-GigabitEthernet2/0/20
 description IRF
#

 2. Trunk ( stp, LACP, 4x 10GbE, CDP) 

On each of the two 6125 switches we establish a trunk facing the core Cisco switches. In our example we decided to use rstp for spanning tree. We use CDP instead of LLDP for our external facing interfaces.
 stp mode rstp
 stp global enable
#
interface Bridge-Aggregation1
 port link-type trunk
 port trunk permit vlan all
 link-aggregation mode dynamic


Interfaces on switch 1:
interface Ten-GigabitEthernet1/1/5
 port link-mode bridge
 description Trunk 6500
 port link-type trunk
 port trunk permit vlan all
 lldp compliance admin-status cdp txrx
 port link-aggregation group 1
#
interface Ten-GigabitEthernet1/1/6
 port link-mode bridge
 description Trunk 6500
 port link-type trunk
 port trunk permit vlan all
 lldp compliance admin-status cdp txrx
 port link-aggregation group 1
Interfaces on switch 2:
interface Ten-GigabitEthernet2/1/5
 port link-mode bridge
 description Trunk 6500
 port link-type trunk
 port trunk permit vlan all
 lldp compliance admin-status cdp txrx
 port link-aggregation group 1
#
interface Ten-GigabitEthernet2/1/6
 port link-mode bridge
 description Trunk 6500
 port link-type trunk
 port trunk permit vlan all
 lldp compliance admin-status cdp txrx
 port link-aggregation group 1
#

3. Interfaces facing ESXi hosts

Each of the ESXi hosts have a config for each of it's nics, one on each switch. Flow control is enabled by default on all ESXi nics so we also enable it on the switch. Since we are using LBT+NetIOC we are not using etherchannel / LACP on the ESXi ports (like most examples provided by HPe do).
interface Ten-GigabitEthernet1/0/1
 port link-mode bridge
 description xyz-esx-01
 port link-type trunk
 port trunk permit vlan all
 flow-control
 stp edged-port
 lldp compliance admin-status cdp txrx


interface Ten-GigabitEthernet2/0/1
 port link-mode bridge
 description xyz-esx-01
 port link-type trunk
 port trunk permit vlan all
 flow-control 
 stp edged-port
 lldp compliance admin-status cdp txrx
#

4. Management (clock, syslog, snmp,  ssh, ntp)


#
 clock timezone CET add 01:00:00
 clock summer-time CETDT 02:00:00 March last Sunday 03:00:00 October last Sunday 03:00:00
#
 info-center synchronous
 info-center logbuffer size 1024
 info-center loghost 10.20.30.40 port 20514
#
 snmp-agent
 snmp-agent local-engineid 800063A280BCEAFA031F8600000001
 snmp-agent community write privatecleartextpassword
 snmp-agent community read publiccleartextpassword
 snmp-agent sys-info version all
#
 ssh server enable
#
 ntp-service enable
 ntp-service unicast-server 1.2.3.4
#

Conclusion

Finding the right syntax that we needed to configure this switch was a bit challenging as many of the examples we found didn't work right out of the box since the command set is slightly different of different versions. After having overcome the initial obstruction we were able to configure the switch exactly as we needed. 


March 18, 2016

Configuring the HPE 6125XLG Ethernet Blade Switch for use in a VMware environment - part 1

Background
In a HPE C7000 blade system a common method of accessing the network is through Flex Fabric/Flex-10 modules. These modules are not fully featured switches, but still have some switch features built in. Another alternative is to use a real switch such as the HPE 6125XLG or Cisco Nexus B22HP FEX.

A real switch has many technical benefits over a FlexFabric system, but has a different approach for configuration than the FlexFabric (that has server admins as their main target and is often hated by people who know networking). The 6125XLG has a CLI that has a similar feel as IOS, but not as much as NXOS or ProCurve. The 6125XLG is the heritage of a cooperation between 3Com and Huawei that HPE bought a few years back and is often referred to as H3C and the CLI is referred to as Comware. It's a blade integrated switch with 10GbE facing the blade servers and both 10GbE (SFP+) and 40GbE (QSFP+) uplinks that can be used to connect to the network.


Problem
One problem I found while trying to configure this switch was the lack of good documentation. There is a lot of documentation available, but a lot of it is for Comware v5 while the 6125 uses Comware v7.  The 6125XLG Fundamentals Configuration Guide stated that it was important to use the command line class aux as part of the stacking (IRF) process, but this command was not available on my switches.
[HP]line class aux
^
% Unrecognized command found at '^' position.
It turned out that the firmware that came preinstalled had a bug that prevented you from stacking the two switches without the use of a RS232 cable. The HPE forums had many helpful posts, but posting there didn't provide me any answers from active users. I did however find a couple of blog posts that helped me going even though they didn't really provide a solution.

Solution
Upgrading the firmware of both switches from Release 2306 to Release 2422P01 before trying to do anything else solved this problem. The firmware upgrade is described at length in the firmware download package.  I chose to upload the firmware image to the switches using ftp. I could now stack my switches according to the Fundamentals Guide (and this HPE Support article: HP 6125g Switch Series - How to Configure Intelligent Resilient Framework (IRF).



February 16, 2016

Accessing the Global Knowledge labs from Ubuntu Linux

Background
While attending training I tried accessing the labs from my BYOD computer (Buy Your Own Device). I was warned before the training that the Global Knowledge labs were best working with an OS that supported Internet Explorer: "please use an Operating System that supports the Internet Explorer Browser. We have found that Mac Books do not work well when connecting to this environment".

Problem
I while back I was able to make the labs work from my personal Linux desktop, but it seems that the labs have been changed and my old method would not work anymore.
getaddrinfo: Name or service not known [17:22:43:165] [20836:1234650880] [INFO][com.freerdp.core.gateway.tsg] - TS Gateway Connection Success [17:22:44:030] [20836:1234650880] [ERROR][com.freerdp.core.capabilities] - expected PDU_TYPE_DEMAND_ACTIVE 0001, got 0007 [17:22:44:030] [20836:1234650880] [ERROR][com.freerdp.core] - ERRINFO_SERVER_INSUFFICIENT_PRIVILEGES (0x00000009):The user cannot connect to the server due to insufficient access privileges. [17:22:44:031] [20836:1234650880] [ERROR][com.freerdp.core.capabilities] - expected PDU_TYPE_DEMAND_ACTIVE 0001, got 0007 [17:22:44:047] [20836:1234650880] [ERROR][com.freerdp.core.rdp] - DisconnectProviderUltimatum: reason: 1

Solution
The solution was however quite simple. The Remote labs portal has information about accessing from a variety of devices. I've also got a document describing some NTLMv2 requirements. I used Firefox and logged in to the portal. When trying to connect I was offered to download an .rdp config file. I chose to save this file in the default location.

Logging in to the portal


Launch the Remote Labs!

Save file

Now I could use this file as an input to freerdp (version 1.20) and connect without problems by using the command:
xfreerdp cpub-vcloud-launcher-RemoteApps-CmsRdsh.rdp /d:gklabs /u:username /p:password  -nego




December 14, 2015

SSO is not initialized

Background
After upgrading vCenter from 6.0 to 6.0U1 we had the vCenter GUI back. This HTML5 based GUI will allow you to manipulate certificates and several other things that you could only configure from appliancesh before U1.

Problem
After the upgrade we experienced an error message within this GUI: "SSO is not initialized". This system was running an external PSC and authentication was working nicely as it should do. We didn't quite understand why this error message was there.

Solution
We had a support case going on this problem for a few weeks. We were repeatedly told to repoint our SSO until they finally told us that this error message was in fact a bug: "...this is something that we are looking to rectify as this information should not be shown when using an external PSC.
Our Engineering department are aware of this are looking to make a graphical change to this.
With regards to your environment however, I can confirm that SSO is functioning correctly and you are not experiencing an issue with SSO at this time."


November 21, 2015

Replacing a vSAN caching disk

Background
Replacing disks in vSAN could be a bit less smooth than some of the traditional Storage Arrays. For normal disks used for storage it's quite easy, but disks used for caching it can be a slightly different story. If you get a dead caching disk you should remove it from the config before removing it physically from the server. Otherwise you will get the problems described in this posting.

Problem
Once the disk has been replaced you will be unable to delete the disk or the disk group both from the vSphere Web client and RVC. The reason this fails is that it can't find the disk. The disk will show up with a status of "Dead or Error" or "Absent" (depending on where you look)
.

"esxcli vsan storage list" will show all the other disks belonging to vsan on that server, but not the missing SSD disk.

Listing out the disks in RVC with the command vsan-host_info shows that the disk is in an Absent status:


Trying to use RVC with "vsan.host_wipe_vsan_disks -f" to remove the disk also fails:

Solution
A solution that did work in the end was to use partedUtil to remove the partitions of all spinning disks of this disk group. partedUtil is a very dangerous tool so if you have multiple disk groups on your host (like we had) you must make sure you're working with the correct disks. We found it best to locate the naa IDs of the failed disk group from the web client.

After removing both partitions of all the disks belonging to this disk group, the disk group was gone and we could create a new one where we were able to use our new SSD disk and all the spinning ones.

Appendum
The official way to solve thisproblem is to remove the disk from the pool while it's still present in the server. In our case that was not possible. The SSD disk had for some unknown reason entered "Foreign mode", which is a Dell disk controller feature. We had to enter the Perc controller BIOS settings (from POST), clear the Foreign Config and we also had to configure the disk in the controller config in order to use it again. Because of these things the disk came up with a new naa ID even though we didn't really have a failed disk.