Home Server Failure, that it end up working. Custom Build Epyc, SSD & 100Gbe

Stragalet
14 min readSep 18, 2022

--

Is this a guide?

No it is not, this is our adventure on how to upgrade our Freelancer Unit server that we use at home with a custom built TrueNAS, Epyc CPU powered, ECC memory, SSD storage and some 100Gbe Mellanox CX5…

At least the objective is clear… is it?

We wanted to get some speedy performance ( SSD ) and MORE than 10Gbe speeds if possible while maintaining a relative big storage pool and also without emptying our bank account :D

What do we do with all that ?

We are a Unit of artists that we do some commercial work for publicity, videogame, anime, movies, and we are pushing strong with our personal project lately.
Look development requires fast pace work so waiting for files to copy is not cool.

Our starting point

We came from a Qnap Ts453be (it costs around 500$ at the time ) specs as follows:

  • 4x 2TB SSD Samsung EVO 860 (Sata3)
  • QXG-10G1T, 10Gbe NIC PCIe (Aquantia chip)
  • 4Gb RAM ( 2x2GB)
The ATAT and the hidden tanuki protect the server… the Floppy is just for the retro looks ok?

We called our little server QuuChan.

That is the card there was inside QXG-10G1T, 10Gbe NIC PCIe (Aquantia chip)

The main problem with this setup was limit storage. I believe we had it running in RAID5 for priorizing speed and storage, and we used 2 external backups in spinning rust, so the total space was less than 6GB, and the buss for the PCIe NIC was limited to a “miserable” Gen2 x2, so the max performance we can get on the network was around 700MB/s not slow, but we wanted to improve.

First plan.

We wanted to upgrade with another Qnap for safety reasons, custom builds could be a hassle to maintain, bugs, errors problems… and we are, in the end, artists that don’t want to care about all that stuff.

So we searched and found out that we wanted something like this

We don’t needed the full HD type bay, as we would like to fill it with SSD but well, the plan changed fast anyways soo...

Change of plans.

We were already more or less convinced with the pricey Qnap, because the risk of a “custom build” was too much.

But a “wild friend appeared” with lots of experience in the server field working for really big companies… so it started to recommend stuff, and from one day to another, suddenly we were convinced to, at least, give a try to the custom built server.

New plan!

The new plan consisted in that with almost for the same price as the Qnap, we get more storage, more speed, more CPU power! all good news right? Wait and see….

This new plan started an endeavour that effectively lasted for some months, I don’t even remember precisely how many…, I don’t even want to remember…

First things first.

Lets start making a list of what we want, or what we need, because… sometimes is not the same, right? I mean I might want all PCIe Gen4 NVME 4TB drives but… do I need it? would I even be able to squeeze the juice from it?
So after lots of investigation we settled on some Epyc CPU (8cores ), some ECC memory 64GB, some SSD samsung EVO 860 4TB… and of course the rest of stuff that we might need mother board, power supply and so on… and we gave up on some fancy stuff like the NVME PCIe Gen4 cards that HighPoint is selling… was way to expensive parts anyways…

isn't it beautiful?…

Look at this transfer speed!!!!

look at this speeds!!

So I force myself to forget about this beast and went back to the real world.

Shopping Time!

Oh boy… was C*vid times and the market was crazy… we were forced to buy parts in different countries, with lots of delays and even we were forced to upgrade the CPU because we couldn’t find the one we wanted… and the price was double from RRP… Still, the price was below Qnap Server full equipped and the possibility of expansion and raw power was a lot more.

The shopping went like this, you will see that some parts are upgraded from the previous mention:

  • AMD EPYC 7352
    24c 128 PCIe Lanes, 155Watt 128MB cache around 222.000¥ (2.000$ aprox ) at the time (totally overkill CPU but there weren’t any other we could find…)
    We really wanted the EPYC 7252 8c/16th for around 88.000¥ (700$ aprox)
  • Asrock Rack ROMED8–2T
    A beast mother board with 7 PCIe slots full Gen4x16 each. We didn’t need this much but yeah lets go full crazy with this. Price difference with the smaller ones weren't that much, and maybe in the future we can plug some HighPoint NVME full speed cards… aaaahhh~ (dreaming is free!)
    This MoBo was around 90.000¥ (800$aprox)
That is lots of expansion slots…
  • Samsung EVO860 6x4TB ( SATA3 yeah… the slow SATA3… sniff snif)
    People will say, why you go with EVO, and not PRO grade, or some Kioxia enterprise… well well well… I just trust my friend ok? remember, he is a pro? :)
    So we will have around 22TB of raw space in the main volume without RAID ( although we will use RAID in the end ) Plus 4x 2TB SSD EVO 860 that we already had for libraries, assets, system stuff and so.
    So around 22TB + 7.2TB raw storage. ( I’m discounting a bit because 2TB drives usually lets you write around 1.8 )
    Hopefully adding lots of slow SATA3 SSD drives will gives a boost in Read/Write performance.
    We were veeeery lucky with the SSD drives and we found an offer for 40.000¥ each ( 350$ for 4TB! )
    2 x 970 EVO plus for SO redundancy.
an expensive pile of of SSD
  • 128 GB of ECC ram. TrueNAS uses it as cache for data so it acts like a booster, the more the better. Also ECC was important for stability of the system ( was actually cheaper than non ECC )
well that is some amount of RAM but we are used… we are designers so we devour RAM.
  • Noctua Coolers “for the win”, silent stuff.
  • Power supply Seasonic Focus 800Watt efficiency.
  • We didn’t need graphic card as it comes built in in the mother board! woah! that is convenient.
  • And my favourite IcyDock, we changed the fans for Noctuas though.
Look at this beauty! can fit 8 SSD SATA3 drives.
Look how compact this is!! it is a 5.25 inch bay!
  • Later on we had to purchase some second hand expansion PCIe because the mother board didn’t have enough SATA connectors. LSI 9217–8i SAS2308 6GB HBA/RAIDCARD
  • For the Box we end up using an old ANTEC we had… yeah.. nothing fancy.

By the way, you can’t use that two ports on the mother board for SATA drives, even if you buy an extensor / adaptor. I tried… didn’t work, seems they are just for some fancy NVME PCIe Drives, I learned that the hard way so you don’t have to. :D

NICs… o boy what a trip.

You can skip this part, but is the funny one…. “NIC working, FINALLY!” is the next title you need to jump to if you want to skip :)

In the previous post I forgot to mention one important thing… what about the NICs!? I said 100Gbe!! where are they?.

We had to investigate what all that QSFP28 connectors meant, what the cables needed to be… lets go 40Gbe? 100Gbe?
Can we connect 2 computers with it directly? without a switch (100Gbe switches are very expensive) can we use one 100Gbe NIC and split it in 4 of 25Gbe ?
Too many questions, we could solve some, others are still blurry…

Well This was more a test so we went on anyways.
I really wanted to try 100Gbe, so we would try to connect at least 1 PC at 100Gbe directly to the server and the rest will use the 10Gbe switch and RJ45 stuff we already had set up.
Searched on eBay and found 3 Mellanox Connect CX5 100Gbe NICs at good price, 750$ for all 3 aprox.
And also got a 100Gbe copper cable (DAC), super thick and short at Fs.com.

The three musketeers, 100Gbe Mellanox CX5
DAC cable from the people at FS.com

After some time we got them and… surprise surprise only one NIC worked… what a fail!… we can’t test any speed with them if I only have one working!
I’ve never used Linux in a serious way before, I didn’t even know how to list a directory, really…

MFE_NO_FLASH_DETECTED

I was getting again and again this error every time I tried to flash the NIC.. somehow was recognized by the system but wouldn’t work.
I know I wasn’t doing something wrong because one of them, exactly same model worked, well, or at least was detected properly by the system, we couldn’t test it anyways…
I end up requesting a refund for the two non working cards.
I guess… back to the 10Gbe plan then.

Surrender!? never!!

I made my mind to go back to 10Gbe when, after some days…, suddenly that message came to my mind again… MFE_NO_FLASH_DETECTED…and I was inspecting the cards over my table when something hit me!

This is the working NIC on the left, and the not working NIC on the right.

Left, working NIC, right not working NIC… of course Linux can’t find the FLASH memory,… it is NOT there!!!

Well my theory is that, possibly this was on a very large company and they had their custom BIOS for the cards, so how to avoid that the other companies use your proprietary BIOS? yeah… destroy the chip. Well that is my guess.. I really don’t know.

Both of the broken NICs had the same problem of a missing chip… that had been pulled out in a less than delicate way.

So the next question was… can something be done about it?

Back to research…

At this point my partner already hated me so much, every time something deviated we had to investigate on internet for hours…

The chip!

But we got the refund money for the NICs… what if I can buy a soldering kit and the chip (5 of them 5$ A windbond 25q128FVSG) and fix it?

We then asked in some specialist forums and the people was kind, but realistic, “it is a very complicated procedure, need to scrap the board.. find the connectors make a bridge… and if you succeed and don’t burn anything else on the process possibly you won’t be able to flash the chip anyways… as that chips usually are programmed before soldering with a special tool “

Well that didn’t give me much room for hope…

We didn’t came all this way for nothing… now we play!

So I had the refund money back… and we contacted some japanese shop that was specialized in this kind of soldering…the price was around 20.000¥ for both cards If I can recall correctly.
so… we went for it and asked for repair.

Will it work after all?

Well we got the cards back and had a very fine and professional fix.

It had some protective glue or something around…

I was personally supper excited! this time will work!! but long story short. It didn’t…

No valid FS4 image found.

New error appeared…. Burning FS4 image failed… No valid FS4 image found…
Again I’m a complete noob in Linux so… I guess we are doomed.

The power of Linux.

But after some more research… again… found that the reason had to do with Linux trying to check for the previous BIOS before installing the new one. I just needed to stop Linux from checking for the previous BIOS and… there is actually function that does exactly that!

Oh boy…Oh boy…

NIC working, FINALLY!

That is windows, and that is 2 Mellanox! :D
It worked… I could flash the BIOS and Windows recognized the card without trouble either.

I had to do some extra tweaking's in Linux still because the MAC address was set to 0.0.0.0 in both of the Cards so there could be a conflict, but felt easy compared to the rest we’ve been through, I’m starting to get comfy in Linux after all.

A first test between a Gen3 NVME drive to the RAIDZ0 (stripe) in the server.

A preliminary test just for fun. For comparison 10Gbe would give around 1200MB/s transfer speed so this is well beyond that. Almost local drive speed through a DAC cable connected from a Mellanox CX5 to a another CX5.

We need supernatural protection now.

We need extra protection, and I’m not talking about a Firewall or something like that… after all that happened I felt like I need to give thanks to the gods so we went to a temple in Arashiyama ( Kyoto ) that is famous to give protection to all electrically related stuff… they even have the omamori (protective talisman) in SD card format.

Arashiyama Omamori

This is gonna stay close by our server once is completed.

Now Software part.

Now that the hardware was all completed, and my friend tested with some fancy tools in Linux to check if the performance of the Mellanox was good,
We got a peak transfer of 98Gbe, not bad! No idea what he used… lots of instances of something… to saturate the connection, first sending zeroes, and the with other stuff, way too technical for me.

That 98Gbe is the connection speed not the Raid access speed, just to clarify.

I won’t go on very specifics because it can be quite dense here, file system, block size, snapshot setup, FTP access, permissions, Network optimization, and so on…
So I’ll just comment briefly each of the steps.

So next was OS and RAID.

TrueNAS CORE.

I don’t use fancy stuff, just server for files and few things more.. so TrueNAS Core seemed fine at first because is a proved OS already with some years… and SCALE at the time was in Beta still.
Installing it was easy and straight forward, no problems at all.

TrueNAS Core UI

RAIDZ.

We went for RAIDZ1 ( similar to RAID5 ), some people would say is too risky, just one drive failure tolerance, but after testing the speed RAIDZ1 performed better than RAIDZ2 with our use case… and we also have extra external backups every some hours and every day in spinning rust.
And SSD fail or don’t fail… but are quite stable over all.

Windows Sucks.

We soon realized that windows file transfer system sucks… the speeds we could get easily in Linux with multi thread transfers in Windows are capped to 1 thread… you can use RDMA and other command line copy tools but is a hassle, and we didn’t want to purchase Windows for Workstations.

So our top speed really will be a bit limited by windows limited transfers… a big shame ( and I’m happy we didn’t buy the NVME High Point card now… phew!… would have been a waste of money )

TrueNAS CORE to TrueNAS SCALE.

We recently changed to SCALE while it was on beta because we wanted to try some of the Docker functions it had to install a local GitLab server for instance.. and so on.

The performance it didn’t change meaningfully, with the first version of the beta we installed there was some minor issues in the UI but recent updates fixed most of it…

TrueNAS Scale UI ( beta RC )

Final Conclusion.

Was a good experience, learnt a lot. I even would do it again! now that I learnt all that stuff, but only because I had a colleague that was supporting and giving me confidence in the process.

Extra fans for cooling the Mellanox… it runs hot.

The server itself is awesome, is fast, silent very very quiet, power consumptions is around 100Watt, much more responsive than the old Qnap, and almost I forgot to do maintenance for months now, (touch wood and the Omamori Talisman)

DYNATRON A26 2 U First CPU cooler we tried, worked fine but when we changed for the huge Noctua it dropped like 17º Celsius.

So I would say, that even if I went the hard way, tried to push it a bit too much with the 100Gbe, I think 10Gbe was good enough.. it gets completely saturated with 2 computers accessing at same time though, possibly 25Gbe was sweet spot.
But just for fun now I can access my server at 100Gbe and get that extra boost.

Omamori protecting and ugly and non impressive server Box.

The little server it is called RomeChan, because Epyc CPU product name.

For us, look developer artists, waiting 30 minutes to copy a 1000 frames sequence can be very upsetting, cutting the mood of creation and that after all, make us loose money, so I’m very happy that now that 30 minutes transfer is done in 1 minute.

At as for today, after several months of professional use, the only thing that failed was a couple of defective SSD that Samsung replaced without trouble, and we could do the change using TrueNAS UI options in a very smooth way. (I was sweating that day though… a lot)

The overall cost of the project was around 850.000¥ (7.800$ aprox ) and endless hours…
But that got us impressive specs, and future proof expandability.
I still think though that if you don’t want the hassle Qnap is a very good option, but certainly you will need to spend similar or more, and will be more limited with ports and hardware specs while giving you more options to play in their QTS OS as is a bit more user friendly overall and has nice online support.

So did it work? yes, did we put 100Gbe to the limit? not even close… but maybe in the future.

That is all! I know I’m a bit verbose when explaining but hopefully was a fun read. Good luck with your servers!

By the way If you like you can check our works at:

Project Instagram URL : https://www.instagram.com/s.t.a.b.a.t/
Personal Instagram URL : https://www.instagram.com/stragalet/
Twitter: @stragalet

--

--

Stragalet

ストラガレットです! I like making 3DCG stuff OC. Indie Game dev