What happened when my team inherited an unloved but critical system
At the end of 2016, the Financial Times Product & Tech department experienced a bit of a cabinet reshuffle. In real terms this meant that a number of key applications in use were going to be supported by other teams. One of these happened to be Citrix. An outgoing member of staff smiled soberly at me on his way out and said “Good luck with Citrix, you’re going to need it”.
At the time I reacted, like I often do, with a flippant laugh born primarily out of nervousness and a time honoured refusal to accept what I’m being told. I mean, what did I know about Citrix? Absolutely nothing, apart from right clicking on system tray icons and checking server names before escalating tickets.
What is Citrix? — I hear you ask. Don’t they sponsor Red Bull in Formula One? Well, yes they do, but as far as the FT is concerned, we use Citrix to globally deliver applications via a nice looking web portal which in turn gives you a menu of applications and then it launches these applications on a Server Farm when you click on them.
You don’t need the application software installed on your device, save the small Citrix Receiver app. Their software is really good in normalising access to software from global locations where latency can be an issue to the user experience. It was implemented before my time at the FT when backend Application Servers and Databases were moved from regions to a central location. From a support perspective you can deliver applications through a consistent standard setup.
What you’re about to read is a description of a 3 and a half year labour of love between my team and Citrix, detailing all of the highs and the numerous lows.
Citrix Glossary of terms
A web server that allows you to access your applications in Citrix
Where the Apps are Stored and Servers are Configured
Where end users finally access their applications
An early horror show..
So when 2017 passed 2016 like two ships in the night, the baton stealthily passed over to my team. By March 2017, things were not looking very good at all. Users were experiencing periods on slowness and available sessions were few and far between.
Common sense prevailed and an external consultant was called in to look at it for us.
The review of the current estate at the time revealed the the following:
- We had 9 servers running Windows 2003 and a very old version of Citrix that had gone End of Life in 2013 and out of Support in 2016
- We had 21 servers running Windows 2008 and a version of Citrix that wasn’t going to be in support for very long
- Something really pivotal called Machine Creation Services (MCS) was completely broken
- The VMWare Hypervisor Certificate was broken
- New servers weren’t getting registered properly and weren’t available for use
The MCS issue above was the biggest one. In the Citrix world, you have a master server image in each data centre. Then you “Create” new additional Application Servers from them and deliver them into a Server Farm in VMWare via a hosting setup using a Hypervisor.
Unfortunately this wasn’t working whatsoever. Servers were not being imaged with the correct configuration and they weren’t being launched and registered properly. In addition, and perhaps even more alarmingly at the time, MCS also allowed you to reboot Application Servers to clear issues. What happens is that as long as there’s a valid link back to VmWare it restores the Application Server back to its original state.
So imagine a place where you can’t build new servers and you slowly, and unwittingly, remove the existing ones by rebooting servers to clear common issues such as Windows Profile corruptions and stuck Remote Sessions. In doing this our original server farm of 12 App Servers dwindled down to a nadir of 1 working server. When you consider at the time that over 150 users used Citrix on a Daily basis, then you can imagine why things started to go a bit wrong and why the user experience was so bad.
In essence, we’d inherited something in a less than ideal state and we had very little up-to-date documentation to help us fix things. The team, with help from the External Consultant, got to work soon after the review.
As a result of that consultation period and subsequent work the following things were corrected
- MCS service was reconfigured and corrected after a hosting problem was found in the link between Citrix and the VmWare storage
- A new certificate was issued and configured on Citrix Delivery Controllers via a PowerShell script
- Master image servers in both sites were updated with the correct settings
- Improved monitoring of Citrix using Zabbix and Dashing
These 4 things enabled us to be able to bring more servers back into Production and just as importantly we could then reboot them without fear of losing them. With improved monitoring ourselves and the operations team could respond better and quicker to any issues that arose.
Whilst still being on a steep learning curve about all things Citrix related, our Citrix Estate was in a more stable place by May 2017. The next challenge was to get rid of the 9 Application Servers running Windows 2003 by the beginning of 2018. We identified a list of 6 applications that needed to be re-hosted from Windows 2003 servers.
The major operational issues we had previously experienced spooked us into not trusting Citrix as a future delivery method. To be very fair to Citrix this was primarily down to our implementation of their products rather than a problem with their software itself.
We looked at the AWS offerings of Workspaces and Appstream. At the time, and due to the legacy nature of the applications, we found that there were major performance issues in AsiaPac regions. We’re talking here about apps built in the 1990s and out of support since the mid-noughties, all talking to incredibly old Oracle Databases.
We also briefly looked into Citrix Cloud, but the legacy state of our entire Citrix farm, or maybe our lack of experience with the product, meant it was nigh on impossible to migrate to it.
Improvements and getting rid of Tech debt
With the age-old gift of hindsight we wasted way too much time on trying to reinvent the wheel and shoe-horn stuff from Citrix into something in AWS. We probably did about 6 months testing in this area before accepting that it wasn’t going to happen.
Maybe we, by focusing on the most difficult area (the oldest apps and the oldest servers), didn’t really give the AWS products the chance to deliver the Apps that were served on the newer Citrix App Servers.
In the end in order to get off the Windows 2003 Servers we replatformed the Apps hosted on those servers to the newer Citrix App Servers running Windows 2008. This proved to be a major piece of work given that we were fully reliant on other teams to rebuild and reconfigure their applications. Again we’re talking about some 16 bit apps (or components of them at the very least) which had been quite happy to run on 32 bit servers for many years, only for us to come along and suddenly demand that they’d run ok on 64 bit servers.
When that work was completed in early 2019, the next step was to upgrade the Citrix version on the Windows 2008 Servers. All 21 of them. That required us to build 4 new Delivery Controllers. We first attempted this without external help and it didn’t go well. If I remember correctly we were a Citrix Database restore away from complete oblivion, an Abyss of pain and sorrow. Then again we accepted defeat and we paid for an external consultant to come in for 3 days.
Re-hosting the entire farm this time
2019’s challenge meant that we had to re-host everything running Windows 2008 (as it was going end of support in Q1 2020) to Windows 2012. Ideally we wanted to go to Windows 2016 or 2019 but we’d already upgraded the Windows 2008 Remote Desktop Servers Licences to Windows 2012. These licences allow an unlimited number of connections to Windows servers via Remote Sessions. Microsoft allow 2 connections by default. But if you didn’t pay for the licences and you had 100 End Users, you’d need 50 Citrix App Servers instead of 6!
This time we took the decision to both upgrade and move from our two data centres into AWS. We had our Cloud 2020 strategy looming on the horizon, which meant that by the end of 2020 everything hosted in our Data Centres had to go. To us it didn’t make sense to just upgrade the servers in the data centres and then move them to AWS afterwards, although it didn’t stop me trying to find an easy way at first!
The plan was to go from this….
Start of Q4 2019 : Citrix hosted in two Data Centres with a DB in AWS
End of Q4 2019 : A Hybrid of Data Center and AWS
End of Q1 2020 : Fully AWS
In the end we pretty much managed it on time, apart from the Delivery Controllers as they were tackled in Q2 because they were running Windows Server 2012 and weren’t going end of support, therefore not a huge priority.
- Again needing other teams to remember how they helped install these applications many years previously. Lots of undocumented fixes were implemented over the years and we had to collectively unpick them all/review/re-apply or bin.
- Encountering many different network issues when trying to access all backend components of applications when they were hosted in a variety of places
- Encountering even more Network issues when 3rd Parties in the US and the Netherlands tried to access them
- Using up more goodwill in getting End Users to Test Citrix things
- Hoping that Citrix Software would make up for the latency experienced in AsiaPac when hosting in AWS back in 2017 and 2018. It does, thankfully.
- Trying to put multi factor authentication in front of Citrix via Okta without the standard recommended integration infrastructure being in place.
Tech Debt removed
- Got rid of the Appsense servers that were in the past needed to govern Microsoft Office Usage
- Removed the need for the Jet Nexus load balancer boxes by adding in an AWS balance loader attached to the Storefronts
- Added extra security into the storefronts by changing from http to https
- Removed Citrix Profile Management from the setup which enabled us to delete around 1500 legacy profiles and around 300GB of files and folders
We’ve reduced the number of App Servers down from 10 to 6 in Q2 2020 and moved the Delivery Controllers fully into AWS.
We’re currently working on adding an additional AWS Availability Zone (AZ) to the Citrix Estate so that each Citrix Component has a separate AZ for resilience purposes.
Once that’s completed we aren’t expecting to do any further major work to Citrix until 2023 when Windows 2012 Server goes end of support. Although the external data centres will be going at the end of 2020 so there will be some minor network config changes to allow connectivity between Citrix and the new host locations.
So what’s the long term future for Citrix at the FT? At the moment it feels like it’s a medium term solution as at best. Citrix does a great job of delivering applications hosted centrally to global sites and areas that sometimes do not enjoy the same quick or reliable internet connection. Their software does some stuff in the background that normalise user inputs from a mouse or keyboard that create a reliable user experience that we didn’t get when testing AWS alternatives. It doesn’t sound like much but when it’s your main tool the little things make all the difference.
In 2020 some of the Apps that used to be delivered by Citrix have been transitioned onto AWS Appstream but there are some latency issues when it comes to certain areas of the world. Depending on what happens with Covid19 improvements to our AWS cross region networking might have a positive impact on the latency, or it may not if working from home continues to be the new norm.
Some other Apps are also due to be transitioned to new solutions which may well be web based and not reliant on accessing resources inside the FT Network.
My personal feeling is that we should plan to make Citrix future proof regardless of what happens above. This means that we should move the hosting of it outside of the FT, and where better than in Citrix’s very own Cloud? The current Citrix Server Estate numbers at 19 Servers which includes Prod, Dev and Test Environments. Over the last 3 and a half years we’ve reduced the number of Citrix Servers by 41%.
If we move to Citrix Cloud now we’d benefit from..
- Not having to support 19 Servers (££££ saving in person hours)
- Not having to host 19 Servers (around £18K)
- Not having to patch 19 Servers
- Not having to update Citrix Versions (Tricky to do and needs outside consultants ££££)
- Flexibility in scaling down or up
- Being able to better integrate it with Okta with multi factor authentication
- We wouldn’t be reliant on our internal Network
However, as with most things in life, a move fully into the Cloud comes with trade offs.
- We might lose the ability to monitor so closely
- Dev and Test Environments may or may not be so easy to access and rebuild
You’d also lose control of your environment which might affect things like…
- Choosing when to deploy a change that might break an Application
- Troubleshooting issues might seem more difficult than appear especially when you don’t have full visibility of the route from A to B
- Implementing new Applications or changing existing ones would be a slower process
So there we are. In a little over 3 years we’ve gone from not having the slightest clue to understanding, stabilising, removing legacy, upgrading, re-hosting, moving more legacy off, improving and then re-platforming in AWS. The next step should now be to re-platform it again outside of the FT and future proof it, yet retain the flexibility either let it gently slip away or to build on its proven capabilities if the current situation with Covid 19 continues and the business need changes further.