IIS – or – Don't Take Candy from Strangers
Two weeks ago I started noticing a strange problem with IIS. I use IIS 5.1 for development. Not that I’m a huge fan of IIS 5.1, but I do deploy sites to IIS 6, so 5.1 is as close as I can get to it for development work.
Anyhow, on almost every page request at least one asset would fail to download. For example, and most obviously, an image (or a few) would fail to download. This was really apparent when looking at a page with lots of images. Additionally, I’m quite sure, but I can’t prove it, that JavaScript and CSS files would occasionally fail to load. The problem occurred when I was using either IE or Firefox. Furthermore, when the problem began to occur it wouldn’t take long before pages stopped loading at all, which made it appear as if IIS itself were hanging.
The problem rapidly escalated over a few days. On day one I noticed this happen a few times. By day seven it was happening so frequently I was unable to get work done.
Of course, I tried all the normal culprits. I restarted IIS, ColdFusion, SQL Server, the whole server and so on. I also tried shutting off virus scanning and more. Nothing made a difference. I spent hours Googling for solutions and got, quite literally, nowhere.
Seeing as this was seriously impacting my ability to get work done I decided that I had two options: Reinstall Windows on my laptop (hey, it used to work just fine!) or call Microsoft support and make them fix the problem.
I figured a reinstall would take two days. Seriously! Making backups of important files takes several hours. Installing Windows take a few hours. Then, you’ve got to make sure all the drivers are installed correctly, restore your old files and download and reinstall all your software. It always takes me a few days to really get rolling with a new install.
Alternatively, I figured if I called Microsoft Support, even if I had to pay a couple hundred bucks for it, that in a matter of minutes, maybe a couple hours at most, they’d be able to find the problem and provide a fix. The time savings (and therefore the money savings) seemed obvious!
So, the first thing I did was call Windows support. Obviously this makes sense, I’m on a Windows operating system using a component that comes with Windows. Well, after about 10 minutes of trying to explain my problem to the support tech I finally had to break down and ask him if he’d ever even heard of IIS. His answer: No.
It turns out that despite the fact that Windows is shipped with IIS, IIS is a server product and you need to contact their Server support. When I found out the cost for this was $250 I nearly just reinstalled my laptop. However, I figured that I’d be guaranteed to run into the problem again within the next 6 months. I might as well pay the cash and learn the real solution to the problem.
It’s easy to bash Microsoft for products which have problems or support that may (or may not, as we will see) be miserable. However, after shelling out the cash things began to get interesting.
Over the span of a week and a half I spent many, many, hours on the phone with Rohan Makhija a Microsoft IIS/ASP.NET Support Engineer. One day I spent nearly ten hours on the phone with him. It turned out that Microsoft had not seen anything quite like this before. (Why do I always find the obscure bugs?! Am I secretly in QA or something?)
I use IIS Admin to allow multiple sites to run in IIS 5.1. Predictably, the first thing Rohan did was to remove all the extra sites from IIS. After that, we set the default site to point to the project I was working on and reproduced the problem and looked at the IIS logs. The logs showed that we were running into HTTP 403 errors when files were not loading.
The theory on these 403s is that IIS was reaching the maximum number of connections allowed and was rejecting all other connections. By default IIS 5.1 only accepts 10 connections. You can use an obscure command line tool to increase this to 40, but even that didn’t improve things.
Rohan used perfom to watch the number of open connections when we ran into problems. When the problem didn’t occur there was one connection for each asset plus one for the html document. When the problem did occur there were fewer connections.
To resolve this problem we turned on HTTP Keep-Alives. Traditionally, an HTTP request is a single, atomic, request for one document. Once the file is downloaded the HTTP connection is closed. HTTP Keep-Alives, however, leave the connection open and can send multiple requests over one connection (though still only one document can be delivered at a time through one connection). This did the trick for the 403s errors.
Unfortunately, we were still able to reproduce the problem, but this time IIS was logging 500 errors on static content.
Pop quiz, what about that last sentence is weird? That’s right: static content. One would expect a web server to be able to serve static content without server errors. However, gifs, jpegs and other static files were producing server errors when they’re not doing anything dynamic at all!
Next, Rohan made me remove the JRun ISAPI filter. This was simple enough using the Web Server Configuration Tool. But, it turns out that removing all connectors doesn’t actually remove the JRun ISAPI filter. So Rohan forced me, quite against my will, to uninstall ColdFusion.
At this point, I’ve got to say, I would have been absolutely pissed if the problem went away. Seriously. Like most techs, I have plenty of righteous indignation towards Microsoft and (my perception of) their crap software. If uninstalling ColdFusion fixed the problem I would have been forced to eat my words. Much to my relief, uninstalling ColdFusion didn’t fix the problem. Amusingly, in Rohan’s troubleshooting summary it simply says “Problem persists”.
Rohan did a bit more work to make sure nothing was mis-configured. He made sure Jrun was completely removed and that somehow there weren’t ISAPI extensions for GIF or JPEG images and other related settings. As expected, this did nothing.
Microsoft apparently has an internal ISAPI filter called IISMon. IISMon logs all the data going through IIS and is useful for debugging the sort of problem I was running into. This is actually not a public tool and has a built in “time bomb” to automatically disable the program. Why? Well, apparently it can even view the contents of pages delivered over SSL.
When IISMon was installed the problem never once occurred. The reigning theory was that IISMon somehow slowed the request down enough that the problem didn’t occur. It would have been a good enough solution, in my opinion, to just leave it installed. But, unfortunately, the built in time bomb made that unrealistic.
Next, Rohon removed all the standard ISAPI filters from IIS. Same problem.
After that, we uninstalled IIS completely and reinstalled it (which I did previous to calling Microsoft). We stopped all third party services and processes. Same problem.
And, to avoid boredom, I’m going to avoid describing each of the steps he took to debug the problem (you can see his report below). Let it suffice to say that Rohan left no leaf unturned.
Where it started to get interesting was when we enabled extended logging and noticed that whenever we received a 500 error it corresponded to a Win32 status of 6, “ERROR_INVALID_HANDLE”.
I’m not quite sure how, but at this point Rohan noticed something really weird, which I had completely failed to bring up before this point. Namely, I had several thousand small files with strange names in the root of my C drive. Here’s a (slightly forged) screenshot of these junk files:
Now, I’m sure that all my technical readers out there are screaming “It’s a virus you moron!” And that’s what I thought too at one point too. But I’d scanned my machine, checked everything I could think of and searched the web for “thousands of files starting with s’ in my c drive” (go ahead, you try to find something about this symptom) without success. More likely than not, there was some program I installed that was configured to write temp files to the root, for whatever reason.
This is where Rohan earned his money from Microsoft’s and why I’m just this dude who programs. He (you’re not going to believe this) opened one of these files in a text editor! Really! And what to our wondering eyes did appear? Well, something like this:
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
Date: Tue, 19 Sep 2006 23:35:50 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Tue, 19 Sep 2006 23:35:41 GMT
ETag: “308875244dcc61:953″
Content-Length: 237
<style>
img {
border: 1px solid black;
}
</style>
<IMG alt=home! src=”example_files/home8_100.jpg”>
<IMG src=”example_files/thumbnail1.gif”>
<IMG alt=”Antarctic Glacier ” src=”example_files/antaricwaves-thumb3_100.jpg”>
Bonus points if you can tell me what this is. That’s right, it’s an HTTP response!
We make a backup of these files and then deleted them. Subsequent HTTP requests rewarded us by writing one file to the root for each file delivered back to us. Rohan asserted that this was not normal behavior for IIS.
At this point Rohan started getting a bit Medieval on IIS. He used some tools from sysinternals.com to gather “hang dumps”, whatever those are, from the inetinfo process. This showed a third party component from ByteMobile installed, bmnet.dll. In other words, this dll was somehow attached to the inetinfo process.
A little googling showed that this was associated with the Sprint PCS Connection Manager application, which I had installed.
This is kind of amusing actually. About 9 months ago Sprint asked me to participate in their Ambassador program. The Ambassador program was really a marketing campaign where Sprint sent thousands of phones to thousands of bloggers along with free, unlimited service for 6 months. They did this under the guise of a beta test. But, it was rather apparent that they hoped for positive reviews.
30 second review: The phone was OK, but too bulky. They wanted me to review their media services. I thought they were extremely mediocre. Their streaming video and audio services were lame, honestly, and severely overpriced. However, I could plug the phone in to my laptop via USB and get wireless high speed internet access from just about anywhere. That was extremely cool but, ultimately, the genesis of my problem.
My time on the trial had just recently run out anyhow so Rohan had me uninstall the Sprint PCS Connection Manager. And, I’ll be damned, but, that solved the problem.
So, many days and $250 later my problem was solved. It wasn’t as efficient as simply reinstalling my OS, but I’ll bet that this blog entry will help at least one other person out there find and fix this same problem.
The next day, after the problem was solved, I received a call from someone who was not Rohan at Microsoft. They wanted to know how I felt about the service and I answered honestly. I was extremely happy. Rohan pulled out all the stops to fix the problem. He did exactly what I wanted, namely using tools I didn’t know how to use to identify and fix the cause of the problem without requiring me to do a full install.
And, you know what; Microsoft gave me my money back. That’s right, my money back. Because it took so long to actually find the cause of the problem, which was not even a Microsoft product.
So, all in all, a very positive experience. One I’m not sure I’ll ever repeat, but at least I know they’re competent and helpful. (I wonder how much a related call to RedHat would have cost?)
Anyhow, as one last piece of supporting information I’ve attached the case summary below.
So, the moral of the story, dont take candy from strangers.