AMD Epyc has problems when you max out PCIe lanes

Quartz · 5 Feb 2020 at 13:46

Linus Tech Tips has an interesting video about the problems they had when they maxxed out the PCIe lanes on their Epyc server with umpteen NVME drives

TLDR There are major performance issues; it's all too fast and the bandwidth is overloaded. I'm wondering if Intel solutions have the same problems?

andy_mk3 · 5 Feb 2020 at 14:53

They were using up a very high percentage of 8 channel memory as well which wasn't helping the issue. I'd be amazed if an Intel platform would perform any better.

Quartz · 5 Feb 2020 at 15:20

So perhaps more RAM might have solved the issue?

humbug · 5 Feb 2020 at 16:14

Quartz said:
Linus Tech Tips has an interesting video about the problems they had when they maxxed out the PCIe lanes on their Epyc server with umpteen NVME drives

TLDR There are major performance issues; it's all too fast and the bandwidth is overloaded. I'm wondering if Intel solutions have the same problems?

Put simply they built a system with Mass Storage that is faster than RAM.

You need to watch the first part of this build video to understand what is going on. Its simply a case of EPYC CPU's having so many PCIe lanes its possible to Raid so many NVMe drives together the speed of that is faster than DDR4 can keep up with causing transfers between memory and drives to stall out, because the memory cannot keep up with the speed of the drives.

They Raid 24 NVMe drives. They were pushing storage transfer rates of near 30GB/s. that's faster than the memory can keep up with, or at least its about equivalent to DDR4 3800MT/s

Intel's CPU's don't have anywhere near as many PCIe lanes so its not possible to raid 0 anywhere near as many drives so Intel CPU's cannot possibly get anywhere near the speed to cause the memory to stall.

If you actually watch the video they explain what the problem is and the cure is to slow the transfer rates down, to a speed that's probably still faster than Intel can manage.

This is the video where they built this monster....

Edit: if they would use Page Filing they would get higher memory performance than using RAM

Edit2: the CPU has 128 PCIe4 lanes, 24 NVMe drives running with 4 PCIe lanes is using 96 PCIe lanes. Memory is 4TB 8 Channel at 3200MT/s.

Quartz · 5 Feb 2020 at 16:57

humbug said:
You need to watch the first part of this build video to understand what is going on.

I did watch it and was fully cognisant of them fully using the PCIe lanes.

humbug said:
Put simply they built a system with Mass Storage that is faster than RAM.

As I said in my OP, it's too fast.

Vince · 5 Feb 2020 at 17:05

More to the point.. 24 nvme in raid 0, who takes that sort of risk in the real world? At the point where you are even considering it you should be considering san storage, at which point the bottleneck becomes the FC network.

Its a good technical exercise but really very little more.

Quartz · 5 Feb 2020 at 17:08

Vince said:
More to the point.. 24 nvme in raid 0, who takes that sort of risk in the real world?

I thought they put on ZFS and RAID 5 or something, not RAID 0.

Vince · 5 Feb 2020 at 17:10

Quartz said:
I thought they put on ZFS and RAID 5 or something, not RAID 0.

I havent actually watched it. Just read the comments above

where I thought i read 24 drives in 0 but perhaps not or was there a ninja @humbug?

humbug · 5 Feb 2020 at 17:12

Quartz said:
I thought they put on ZFS and RAID 5 or something, not RAID 0.

Initially yes but the software couldn't deal with it so they switched it to Raid 0.

Vince said:
More to the point.. 24 nvme in raid 0, who takes that sort of risk in the real world?

Apparently AWS (Amazon Web Services) have been trying something similar and running into the same transfer rate bottlenecks.

Its useful if you're remote video editing, provided you have a 40Gb Lan to keep up. lol

Vince · 5 Feb 2020 at 17:14

humbug said:
Initially yes but the software couldn't deal with it so they switched it to Raid 0.

Apparently AWS (Amazon Web Services) have been trying something similar and running into the same transfer rates bottlenecks.

Its useful if you're remote video editing, provided you have a 40Gb Lan to keep up.

I seriously doubt any hyperscaler would be using any directly attached storage, it just makes scaling up very difficult. Even a business like mine doesnt bother, we use a single lane on rome and that houses the sd card the machine boots from

humbug · 5 Feb 2020 at 17:20

Vince said:
I seriously doubt any hyperscaler would be using any directly attached storage, it just makes scaling up very difficult. Even a business like mine doesnt bother, we use a single lane on rome and that houses the sd card the macine boots from

Right but if you have the tools to push the boundaries of whats possible why not play with it, you know, for science.... one day it might become something you as a company can deploy so why not have a look?

Vince · 5 Feb 2020 at 17:24

humbug said:
Right but if you have the tools to push the boundaries of whats possible why not play with it, you know, for science.... one day it might become something you as a company can deploy so why not have a look?

Im not knocking it as a technical exercise, more that its just not how hyperscale or anybody works (that i know of) right now. Imagine deploying thousands of epyc rome servers then disks directly attached and not common to all machines. Imagine the carnage!! What happens if you lose a host, that is a single point of failure for all attached disks. Pretty risky stuff.

humbug · 5 Feb 2020 at 17:35

Vince said:
Im not knocking it as a technical exercise, more that its just not how hyperscale or anybody works (that i know of) right now. Imagine deploying thousands of epyc rome servers then disks directly attached and not common to all machines. Imagine the carnage!!

Point taken, i'm guessing they are just testing the bandwidth.

Vince · 5 Feb 2020 at 17:38

humbug said:
Point taken, i'm guessing they are just testing the bandwidth.

Still though those speeds are insane!! Its impressive that we are at the point where the memory subsystem can't keep up with I/O. It's all pretty clear in what they can do to make it even better again and banging against the limits of, well everything, is pretty awesome!!

Minstadave · 5 Feb 2020 at 17:39

andy_mk3 said:
They were using up a very high percentage of 8 channel memory as well which wasn't helping the issue. I'd be amazed if an Intel platform would perform any better.

CPU was struggling to keep up with the parity calculations too on a 24c48t CPU.

Vince · 5 Feb 2020 at 17:43

Minstadave said:
CPU was struggling to keep up with the parity calculations too on a 24c48t CPU.

Whats funny is you would saturate even 128gb FC which has a max throughput of something like 26gb/s. You would effectively need to load balance some 4 connections per controller, per server, to get close to the available throughput

That is insane and expensive!! Expensive as in something like 10k for a switch that even gets close (for just one server)

andy_mk3 · 5 Feb 2020 at 18:18

Vince said:
More to the point.. 24 nvme in raid 0, who takes that sort of risk in the real world? At the point where you are even considering it you should be considering san storage, at which point the bottleneck becomes the FC network.

Its a good technical exercise but really very little more.

They use it for editing their 8k Red video for multiple editors so it's being used well. There's only a small risk involved as they wouldn't lose anything hugely important in the worst case as it's not being used for storage, just active projects.

Minstadave said:
CPU was struggling to keep up with the parity calculations too on a 24c48t CPU.

They even upgraded it to a 64c CPU and changed to a 32c CPU in the end.

It's seriously impressive how far storage has come lately. Always being the massive bottleneck in a system to being bottlenecked by a system, insane really.

jigger · 5 Feb 2020 at 18:53

For this kind of work load a dual socket setup would probably be cheaper and faster. I’d look at a pair of EPYC 7262 or 7302’s.

Vince · 5 Feb 2020 at 18:55

jigger said:
For this kind of work load a dual socket setup would probably be cheaper and faster. I’d look at a pair of EPYC 7262 or 7302’s.

Thats interesting, I wonder if the same number of drives in a dual socket would alleviate the issue, im really going to have to watch the video arent I? Commenting up here when i havent even looked. Noob.

Do we know what controller they were using?

andy_mk3 · 5 Feb 2020 at 19:27

Vince said:
Thats interesting, I wonder if the same number of drives in a dual socket would alleviate the issue, im really going to have to watch the video arent I? Commenting up here when i havent even looked. Noob.

Do we know what controller they were using?

System is this one: https://www.gigabyte.com/Rack-Server/R272-Z32-rev-100#ov

Using CNV3024 NVMe risers.