Strange Hyper-V Issue (random connection issues)

Soldato
Joined
30 Sep 2005
Posts
16,526
Hi Everyone,

We've been having a really strange problem lately and I have no idea what's causing it.

We have the following equipment:



Dell Blade M1000e Blade Chassis with 8 PEM640 blades

Blades all running Windows Server 2019, latest updates, bios and firmware

Intel X520 Mezz Network Cards (latest v20 firmware and drivers)



The 8 servers are all joined in a hyper-V cluster using a Dell SC4020 SAN for storage.



Looking in failover cluster logs we are getting errors every second due to IO Timeouts for the clustered storage. Doing some troubleshooting I am experiencing strange connectivity issues:

Every server can ping each other by hostname and IP (rules out DNS)

I can powershell to ping via various ports (RPC etc) and everythings ok (rules out Firewall)

RDP connection works on every server, to every other server

nslookup reports everything correct



Opening server manager on each server list random issues where some servers can't see others (unable to connect to target):

HV1<>HV2 FAILS

HV3<>HV7 FAILS

HV6<>HV8 FAILS

on these servers, you can't browse to \\servername\c$ but you can "sometimes" browse via the IP address



What is strange is that whilst HV1 can't talk to HV2 and vice versa, all the other servers can talk to them, and it back. Example, HV1 can talk to HV3 and HV3 can talk to HV1.



My gut tells me we have an issue with something on the backend networking side (switch, switch config, cabling) etc etc



any ideas?
 
Soldato
Joined
18 Oct 2002
Posts
8,116
Location
The Land of Roundabouts
Its never DNS until it is DNS :D

joking aside, im inclined to agree with your gut, though in depth networking is not my strong point. it does sound like some switch/nic issue, perhaps something is filling the arp tables or a tcp/udp timout issue (less likely if there is no firewall sitting between them all).
 
Soldato
OP
Joined
30 Sep 2005
Posts
16,526
Its never DNS until it is DNS :D

joking aside, im inclined to agree with your gut, though in depth networking is not my strong point. it does sound like some switch/nic issue, perhaps something is filling the arp tables or a tcp/udp timout issue (less likely if there is no firewall sitting between them all).

Right mate, Hope you've got some good ideas for me.

Spent another day looking into this, and have two network engineers working with me. Gone through all switches, configs, done a fw upgrade, reboots etc etc

Nada!

anyway, 5mins ago I decided to give HV6 a different IP address. BANG! It's working. WTF! Now when HV6 is on it's new IP it can talk to HV8.

Could this be anything virtual switch / virtual adapter related? or have we still got network gremlins?

You'd think it was an IP conflict or something, but all the other hosts can talk to it fine when it's on it's old IP.

We have two external companies looking into this, Dell and Microsoft and they've never seen anything like it.
 
Soldato
Joined
18 Oct 2002
Posts
8,116
Location
The Land of Roundabouts
Are the nic's/switches setup with failover/load balncing or bridges? could be worth dropping it all down to a single pipe per say if there is capacity to do such a thing without impacting to much.

Am i reading it right that hv1 & hv2 can ping each other ok no problem?
 
Soldato
OP
Joined
30 Sep 2005
Posts
16,526
Are the nic's/switches setup with failover/load balncing or bridges? could be worth dropping it all down to a single pipe per say if there is capacity to do such a thing without impacting to much.

Am i reading it right that hv1 & hv2 can ping each other ok no problem?


Yeah, we are working on HV6 and HV8 at the minute as they are out of production but same applies to HV1 and HV2, then HV3 and HV7

so HV6 and HV8 can ping and rdp each other fine. What they can't do is browse \\HV6\c$ What we've since found, is changing the IP address on HV6 makes it work. I assume the same will fix the other two pairs.

We had this same problem two years ago, and just gave HV3 a new IP. Never found out what the cause was, but now it seems to have gotten worse.

The physical NICs aren't in a team through the OS. There is a virtual switch which is teamed enabled, with the two physical nics attached to it.

What I might do today is delete the v-switch on HV6, put the old IP address straight onto the physical adapter and see what happens.
 
Soldato
OP
Joined
30 Sep 2005
Posts
16,526
done it...and it works

I've deleted the v-switch and put the original IP address straight onto the physical adapter

Now it can ping HV8

:edit: bit more troubleshooting between physical and virtual setups. I think this is something to do with MTU sizes.
 
Last edited:
Soldato
OP
Joined
30 Sep 2005
Posts
16,526
Thats awesome, such a simple little tweak as well. Wonder how changing the ip helped in this instance, its not a correlation i'd have associated with mtu settings.

That's the confusing part lol I'm going through all the servers making the change, and it already 'feels' much better.
 
Associate
Joined
6 Aug 2015
Posts
44
Location
Cornwall
MTU is a common problem - particularly when iscsi is used (not sure if you are using it). Id advise ideally not using iscsi and not changing MTU. Not because anything wrong with it as technology as such, it's just more to remember. Typically you'll get multiple engineers over the life of a server and some will not understand/remember and break things. Whereas SAS (direct attach) and fibre channel just work solidly, typically faster, and more support engineer exposure.
 
Back
Top Bottom