Last week I rapped up some configuration with the Cisco Nexus 1000V and was testing vMotion & Storage vMotion with 4 test VMs when I ran into a peculiar issue.
Preface: Listed is a high level overview of the hardware involved:
> HP c7000 Blade Enclosure
> 3x HP BL460c G6 ESXi Hosts [4.1 U1]
> 1x NetApp FAS2040 [NFS]
> Cisco Nexus 5k & 1000V switches provide networking
What seemed to be happening was that when vMotion was initiated for Guest1 to migrate to HOST1 it would hang at 78% and then fail with the ‘destination failed to resume’ error. I didn’t think much of it and initiated a vMotion, again, for Guest1 to migrate to HOST2 which was successful. “Interesting,” I thought, so I initiated a vMotion on Guest2 to migrate to HOST1; it went the first time, no problem. I initiated 2 other guests and noticed that they both failed when migrating to HOST1 (Guest3 & 4)
I figured something had to have been mapped wrong during setup so I popped over to ‘Configuration–>Storage’ on HOST1, 2 and 3 and compared exactly how each datastore was mapped [including case and anything mapped FQDN vs. direct IP]
~Guest1,3&4 were sitting in the xxxNFS1 vol mapped as XXXNETAPPA.XXX.XXX.COM
~Guest2 was sitting in the xxxNFS2 vol mapped with a direct IP
Everything was mapped exactly the same way all the way through and through on each host. I refreshed the storage configuration on HOST1 to make sure connectivity to my storage still existed (it did) and then compared the other hosts configuration. After about 30 minutes of troubleshooting I was staring at the vSphere datastore configuration on HOST1 and noticed that xxxNFS1 was now showing ‘xxxnetappa.xxx.xxx.com’ mapped in lowercase; which is weird because 30 minute prior it was capitalized and this is a case sensitive setting.
Fired up 3 PuTTY sessions to the hosts and did a ‘esxcfg-nas -l’, moved the windows around to compare and found my “peculiar” issue. My initial hypothesis was correct in that the storage was mapped wrong. [Guest2 migrated between all hosts flawlessly because it was sitting in the xxxNFS2 vol which was mapped with the same direct IP on all hosts]
I never remember reading or hearing of this exact issue happening [my experience with virtualization is still quite infant] and found it to be quite interesting and worth sharing after my shortcomings with searches. I was aware of the sensitivity of how things need to be mapped but bug or not, in ESXi 4.1 U1 [may affect additional versions] when you refresh the storage on a host it will replicate and report the exact way volumes are mapped on said host in every other hosts storage configuration. While looking at HOST1 and realizing that the storage was mapped in all lowercase [xxxnetappa.xxx.xxx.com] I remembered when I started looking into this it was mapped in all capital letters on HOST1. The last thing I did before figuring this out was refresh the storage on HOST3, which in turn caused HOST1 to report that it was also mapped in all lowercase, despite being actually mapped in all upper case [verified by SSH]. I verified this by refreshing the storage on HOST3, noting that XXXNFS1 was mapped in lowercase to ‘xxxnetappa.xxx.xxx.com’. When I went back to HOST1 and looked at the storage configuration it had changed and was reporting all lowercase just as HOST3 was reporting. I tested this on all the hosts and they all changed when refreshing HOST1 or HOSTs 2 and 3.
The moral of the story is, it may be a best practice to double check via SSH or KVM/iLO on your host in order to verify how your storage is mapped to each host before continuing with testing. It may save you some time.