I will start this off by saying that as a UC Engineer I don’t have much experience with Microsoft’s Direct Access, or DA, as most call it. That being said, recently I had the [dis]pleasure of troubleshooting what seemed to be a complex issue that turned out to be quite simple in the end, all related to DA and the DA NLS Service.
It was a fine Sunday afternoon — I had just got done playing in my hockey game and after showering up I took a look at my phone only to notice I had several missed calls and text messages from my boss. This is almost never a good sign. I called him back and received a frantic greeting when he answered the call. He explained to me that the networks team had been making some changes over the weekend and that there’s currently no DNS name resolution, company-wide!
An all hands conference call was already spun up and when I joined everyone had their theories. These ranged from, “We need a WINS Server!” to “I think there’s been a master browser election on the network..” to “Something is wrong with the domain controllers..”. As I sat listening to all of these [ridiculous] hypotheses, I was at a loss. Fortunately and unfortunately the boss man made a decision on the call that everyone must drive into the office and figure this thing out. This was around 7pm on a Sunday evening, I was not enthused — I’m a UC guy, not a networks guy, why do I have to go in?
As I arrived at the office I can see the panic in everyone’s eyes. No one wants to be here, including me. The crazy ideas keep coming and I hear them flowing from people’s mouths like the Niagara in the still [no air conditioning, it’s a weekend] musty office air.
I sit in my little cubicle — affectionately known as “the hole” since it’s very small and tucked into an odd corner. I am the new guy at this place, having only been here about a month at this point so I guess I get what I have earned.
My troubleshooting begins like any logical person’s would. Can I ping stuff by name? No. Can I ping stuff by IP? Yes. I start looking at all the typical items like nslookup. Nslookup works, so I know DNS itself is fine. I go into the networks office and ask them if they can articulate what they had changed. They claim they had done some work on the Netscaler (Hardware Load Balancer) but that their work failed and they had “rolled everything back”. At this point I figured that something had happened along their journey that was not “rolled back” and that the answer likely lies in that Netscaler config.
A bit later and after some Papa John’s pizza (This was purchased PRIOR to all of his recent misgivings) a few of the good guys here decided to hack out all of the GPOs from their machines via Regedit. After doing this and rebooting, magically DNS was fine once again. We were getting closer. Another guy piped up and said his has been fine all along — so what’s the difference? Our good old pal, DA. Since he was typically a VDI user, his user account was not in the security group used to apply the DA GPOs. Closer ever more.
My co-worker then opened up the DA management console and had me take a look. I see the glaring red X on the “Network Location Server” item in the operational status pane. You click on that and you get a very nice error message that tells you that the NLS server is “down” and that internal clients may have issues. Bingo! Now to find out why that service is showing as being problematic…
I checked out the GPO and found that the network location server was set as “https://danls.domain.com”. So I ping that address and think, “It must be a server, right?” Wrong. I cannot RDP to it. When I try the URL in a browser I get a certificate warning. This is not unusual here — pretty much EVERYTHING has a certificate with one problem or another, that’s another topic. I go to the networks guy and ask him what an IP in that particular subnet would be hitting since it was a subnet I was not familiar with. His response was not helpful, but I managed to get him to look at his beloved Netscaler and we found it as a vIP there. To what does it point, I ask. It’s pointing back to itself — the management IP of the Netscaler. What? Why? I figured something had to have gone wrong in their “roll back”.
At some point during all of this, one of the other guys had raised an incident with Microsoft. I could hear the chatter of what seemed to be a very nice Asian lady booming from the speaker phone (We still have Cisco desk phones — UGH!). When I told her about the error we were seeing in the DA management console, she started rambling off some words that frankly I did not understand. I told her to hold up and asked her how to go about fixing that error. Her response was to change the URL of the NLS service, change the GPO and would be well. Ummm, what? No one can get the new GPO if no one can do any name resolution. Also, we have people spread across 214 locations in rural America, this was not going to happen.
I asked her if this NLS service website needed anything special. Was the client looking for some special code in the presented page? No, she says. Fine, I’m just going to change the DNS record for danls.domain.com to point at any old web server I could find and bind a proper internally-generated certificate to it. “Will this work?”, I asked. After what seemed like 5 minutes of silence on her end, I hear — and only hear, “Yes.”
The DA servers themselves run IIS, so I did the needful and changed DNS to point there for danls.domain.com and bound the new cert in IIS. I then went to the NLS URL configuration in the DA console and clicked the verify button. Test successful! But the error was still present in the console? A reboot of the DA server solved this. Once it (and the other DA server was bounced) was back up, we tried pinging stuff from a known affected workstation. Houston — we have name resolution!
Now that everything was back to working — as well as before all this network work, anyways, it was time to figure out what had gone wrong. It turns out that Netscaler was to blame. You see, the guys that implemented DA here a few years ago no longer work here. They were also nice enough to have zero documentation for us to look over. As it goes, they had pointed the NLS service DNS record to a vIP on the Netscaler that purposely pointed back to its own management page because, well, it just needs to go somewhere that could host an https page. It’s technically bad practice to point this back to your DA servers (We’ve fixed this since) because if your DA servers are down, you will be back in this situation again. During the “roll back” of the Netscaler work, someone assigned the wrong certificate to that specific Netscaler Virtual Server. When the Windows client went to hit that page to determine if it was inside or outside the network, it didn’t receive a 200 OK response because the certificate was wrongly named. It then thought it was outside the network and tried to spin up a DA tunnel. It did what it was supposed to do, but this will not work internally.
In conclusion, if you are faced with company-wide DNS issues and you’ve checked all the normal stuff, check your DA config. It’s frightening how a single certificate that really does nothing can bring down your entire company’s LAN.