The amazing adventures of Doug Hughes

Archive for October, 2006

IIS – or – Don't Take Candy from Strangers

Two weeks ago I started noticing a strange problem with IIS. I use IIS 5.1 for development. Not that I’m a huge fan of IIS 5.1, but I do deploy sites to IIS 6, so 5.1 is as close as I can get to it for development work.

Anyhow, on almost every page request at least one asset would fail to download. For example, and most obviously, an image (or a few) would fail to download. This was really apparent when looking at a page with lots of images. Additionally, I’m quite sure, but I can’t prove it, that JavaScript and CSS files would occasionally fail to load. The problem occurred when I was using either IE or Firefox. Furthermore, when the problem began to occur it wouldn’t take long before pages stopped loading at all, which made it appear as if IIS itself were hanging.

The problem rapidly escalated over a few days. On day one I noticed this happen a few times. By day seven it was happening so frequently I was unable to get work done.

Of course, I tried all the normal culprits. I restarted IIS, ColdFusion, SQL Server, the whole server and so on. I also tried shutting off virus scanning and more. Nothing made a difference. I spent hours Googling for solutions and got, quite literally, nowhere.

Seeing as this was seriously impacting my ability to get work done I decided that I had two options: Reinstall Windows on my laptop (hey, it used to work just fine!) or call Microsoft support and make them fix the problem.

I figured a reinstall would take two days. Seriously! Making backups of important files takes several hours. Installing Windows take a few hours. Then, you’ve got to make sure all the drivers are installed correctly, restore your old files and download and reinstall all your software. It always takes me a few days to really get rolling with a new install.

Alternatively, I figured if I called Microsoft Support, even if I had to pay a couple hundred bucks for it, that in a matter of minutes, maybe a couple hours at most, they’d be able to find the problem and provide a fix. The time savings (and therefore the money savings) seemed obvious!

So, the first thing I did was call Windows support. Obviously this makes sense, I’m on a Windows operating system using a component that comes with Windows. Well, after about 10 minutes of trying to explain my problem to the support tech I finally had to break down and ask him if he’d ever even heard of IIS. His answer: No.

It turns out that despite the fact that Windows is shipped with IIS, IIS is a server product and you need to contact their Server support. When I found out the cost for this was $250 I nearly just reinstalled my laptop. However, I figured that I’d be guaranteed to run into the problem again within the next 6 months. I might as well pay the cash and learn the real solution to the problem.

It’s easy to bash Microsoft for products which have problems or support that may (or may not, as we will see) be miserable. However, after shelling out the cash things began to get interesting.

Over the span of a week and a half I spent many, many, hours on the phone with Rohan Makhija a Microsoft IIS/ASP.NET Support Engineer. One day I spent nearly ten hours on the phone with him. It turned out that Microsoft had not seen anything quite like this before. (Why do I always find the obscure bugs?! Am I secretly in QA or something?)

I use IIS Admin to allow multiple sites to run in IIS 5.1. Predictably, the first thing Rohan did was to remove all the extra sites from IIS. After that, we set the default site to point to the project I was working on and reproduced the problem and looked at the IIS logs. The logs showed that we were running into HTTP 403 errors when files were not loading.

The theory on these 403s is that IIS was reaching the maximum number of connections allowed and was rejecting all other connections. By default IIS 5.1 only accepts 10 connections. You can use an obscure command line tool to increase this to 40, but even that didn’t improve things.

Rohan used perfom to watch the number of open connections when we ran into problems. When the problem didn’t occur there was one connection for each asset plus one for the html document. When the problem did occur there were fewer connections.

To resolve this problem we turned on HTTP Keep-Alives. Traditionally, an HTTP request is a single, atomic, request for one document. Once the file is downloaded the HTTP connection is closed. HTTP Keep-Alives, however, leave the connection open and can send multiple requests over one connection (though still only one document can be delivered at a time through one connection). This did the trick for the 403s errors.

Unfortunately, we were still able to reproduce the problem, but this time IIS was logging 500 errors on static content.

Pop quiz, what about that last sentence is weird? That’s right: static content. One would expect a web server to be able to serve static content without server errors. However, gifs, jpegs and other static files were producing server errors when they’re not doing anything dynamic at all!

Next, Rohan made me remove the JRun ISAPI filter. This was simple enough using the Web Server Configuration Tool. But, it turns out that removing all connectors doesn’t actually remove the JRun ISAPI filter. So Rohan forced me, quite against my will, to uninstall ColdFusion.

At this point, I’ve got to say, I would have been absolutely pissed if the problem went away. Seriously. Like most techs, I have plenty of righteous indignation towards Microsoft and (my perception of) their crap software. If uninstalling ColdFusion fixed the problem I would have been forced to eat my words. Much to my relief, uninstalling ColdFusion didn’t fix the problem. Amusingly, in Rohan’s troubleshooting summary it simply says “Problem persists”.

Rohan did a bit more work to make sure nothing was mis-configured. He made sure Jrun was completely removed and that somehow there weren’t ISAPI extensions for GIF or JPEG images and other related settings. As expected, this did nothing.

Microsoft apparently has an internal ISAPI filter called IISMon. IISMon logs all the data going through IIS and is useful for debugging the sort of problem I was running into. This is actually not a public tool and has a built in “time bomb” to automatically disable the program. Why? Well, apparently it can even view the contents of pages delivered over SSL.

When IISMon was installed the problem never once occurred. The reigning theory was that IISMon somehow slowed the request down enough that the problem didn’t occur. It would have been a good enough solution, in my opinion, to just leave it installed. But, unfortunately, the built in time bomb made that unrealistic.

Next, Rohon removed all the standard ISAPI filters from IIS. Same problem.

After that, we uninstalled IIS completely and reinstalled it (which I did previous to calling Microsoft). We stopped all third party services and processes. Same problem.

And, to avoid boredom, I’m going to avoid describing each of the steps he took to debug the problem (you can see his report below). Let it suffice to say that Rohan left no leaf unturned.

Where it started to get interesting was when we enabled extended logging and noticed that whenever we received a 500 error it corresponded to a Win32 status of 6, “ERROR_INVALID_HANDLE”.

I’m not quite sure how, but at this point Rohan noticed something really weird, which I had completely failed to bring up before this point. Namely, I had several thousand small files with strange names in the root of my C drive. Here’s a (slightly forged) screenshot of these junk files:

Long list of files

Now, I’m sure that all my technical readers out there are screaming “It’s a virus you moron!” And that’s what I thought too at one point too. But I’d scanned my machine, checked everything I could think of and searched the web for “thousands of files starting with s’ in my c drive” (go ahead, you try to find something about this symptom) without success. More likely than not, there was some program I installed that was configured to write temp files to the root, for whatever reason.

This is where Rohan earned his money from Microsoft’s and why I’m just this dude who programs. He (you’re not going to believe this) opened one of these files in a text editor! Really! And what to our wondering eyes did appear? Well, something like this:

HTTP/1.1 200 OK
Server: Microsoft-IIS/5.1
Date: Tue, 19 Sep 2006 23:35:50 GMT
Content-Type: text/html
Accept-Ranges: bytes
Last-Modified: Tue, 19 Sep 2006 23:35:41 GMT
ETag: “308875244dcc61:953″
Content-Length: 237
<style>
img {
border: 1px solid black;
}
</style>
<IMG alt=home! src=”example_files/home8_100.jpg”>
<IMG src=”example_files/thumbnail1.gif”>
<IMG alt=”Antarctic Glacier ” src=”example_files/antaricwaves-thumb3_100.jpg”>

Bonus points if you can tell me what this is. That’s right, it’s an HTTP response!

We make a backup of these files and then deleted them. Subsequent HTTP requests rewarded us by writing one file to the root for each file delivered back to us. Rohan asserted that this was not normal behavior for IIS.

At this point Rohan started getting a bit Medieval on IIS. He used some tools from sysinternals.com to gather “hang dumps”, whatever those are, from the inetinfo process. This showed a third party component from ByteMobile installed, bmnet.dll. In other words, this dll was somehow attached to the inetinfo process.

A little googling showed that this was associated with the Sprint PCS Connection Manager application, which I had installed.

This is kind of amusing actually. About 9 months ago Sprint asked me to participate in their Ambassador program. The Ambassador program was really a marketing campaign where Sprint sent thousands of phones to thousands of bloggers along with free, unlimited service for 6 months. They did this under the guise of a beta test. But, it was rather apparent that they hoped for positive reviews.

30 second review: The phone was OK, but too bulky. They wanted me to review their media services. I thought they were extremely mediocre. Their streaming video and audio services were lame, honestly, and severely overpriced. However, I could plug the phone in to my laptop via USB and get wireless high speed internet access from just about anywhere. That was extremely cool but, ultimately, the genesis of my problem.

My time on the trial had just recently run out anyhow so Rohan had me uninstall the Sprint PCS Connection Manager. And, I’ll be damned, but, that solved the problem.

So, many days and $250 later my problem was solved. It wasn’t as efficient as simply reinstalling my OS, but I’ll bet that this blog entry will help at least one other person out there find and fix this same problem.

The next day, after the problem was solved, I received a call from someone who was not Rohan at Microsoft. They wanted to know how I felt about the service and I answered honestly. I was extremely happy. Rohan pulled out all the stops to fix the problem. He did exactly what I wanted, namely using tools I didn’t know how to use to identify and fix the cause of the problem without requiring me to do a full install.

And, you know what; Microsoft gave me my money back. That’s right, my money back. Because it took so long to actually find the cause of the problem, which was not even a Microsoft product.

So, all in all, a very positive experience. One I’m not sure I’ll ever repeat, but at least I know they’re competent and helpful. (I wonder how much a related call to RedHat would have cost?)

Anyhow, as one last piece of supporting information I’ve attached the case summary below.

So, the moral of the story, dont take candy from strangers.

Rohan’s Case Notes

Long Overdue Reactor News

Every week I read the ColdFusion Open-Source Update put out by Brian Rinaldi. Now, I used to pride myself on seeing Reactor mentioned frequently on that list. Unfortunately, several weeks have gone by without any real news from the Reactor camp.

This is really my fault. I honestly just haven’t felt much like blogging lately. I’ve been recovering from a stressful period in my live involving a new baby (who I am simply smitten by, by the way), selling my townhouse, moving into a new house, lots of work, too much email and a few new projects. All in all, I just caved in for a while. I’m feeling better now and I’m trying to try to blog more.

Furthermore, as much as I respect Brian, and as much as I know he tries not to be too critical or too imbalanced, I feel as if Reactor is getting bad press on his update. I understand he had some negative experiences with Reactor, though I’m not sure what they were.

Anyhow, in the attempt to try to shed some more positive light on Reactor, here’s some Reactor news I haven’t covered on my blog in the last two months:

Beth Bowen Takes the Bull By the Horns

Beth Bowden, who initially contributed Oracle support has been fixing my bugs. Since mid August she’s made nearly 40 commits to the framework. The fixes have includes a wide range of issues both large and small.

You know what? This is nothing small. I’ve done next to nothing compared to her in terms of bug fixes in that time period. She’s simply the unsung hero of Reactor and deserves wide praise and appreciation from the Reactor community in general.

Beth, take a bow!

Huge Performance Gains

As much as it pains me to say this, historically Reactor’s been a bit of a dog (in terms of speed) and a bit of a hog (in terms of memory usage).

There! I said it out loud! I feel liberated! I’ve secretly been beating myself senseless about this.

Actually, it’s not so secret. I asked for help from people in tracking down these issues months ago. The guys from Webappr stepped up the plate and determined that, yes, Reactor used a lot of ram. Furthermore, for some reason Reactor was “causing” Java to do a lot of full garbage collections (Full GCs).

These full GCs didn’t seem to be able to completely collect old memory, for some reason. As more and more memory was used the full GCs (which, by the way are a stop-the-world event) would take more and more time. Eventually servers would hang and need to be restarted.

The odd thing? It didn’t seem to happen to everyone. It although the problem occurred production mode, it was a real problem (at least for me) in development mode I personally would be guaranteed at least one manual CF restart a day. What a mess.

However, after a while Reactor’s memory usage did seem to level out. And, thankfully the Webapper guys informed me that full GCs could be disabled in JRun. That worked absolute miracles for stability. Unfortunately, Reactor continued to be slower and more memory intensive than I wanted it to be.

So, I spent quite a while thinking about the problems. There was nothing obvious that was specifically slowing down the works or eating up ram. I tried a dozens of techniques to identify where the problems were and tweaked lots code.

Ultimately I speculated that the problem lies, at least in part, with the use of so many bloody CFCs in Reactor. OO queries used to use a LOT of CFCs to do their job. So, I essentially rewrote them and removed two entire classes (choosing to store data in structs instead). This helped a bit, but not as much as I wanted.

I suspect that the deep inheritance tree on Reactor objects doesn’t help. So, I removed the base abstract object (which really served no essential function). I might in the future make the generation and use of DBMS agnostic files optional. I think that not using these might help a bit too.

The thing that made a huge difference was something I only did within the past week or so. Logically, OO queries are very dynamic when they don’t need to be. That is to say, if you create a query, join to a table, add a few where statements and sort the results, each time that code runs it’s going to have to go through all of the validation (which is expensive) and the translation of the OO query to a real SQL statement.

In development mode this process could take a quarter second. In production it was more like 60 ms (of course this is dependant on hardware, etc). This isn’t bad, but it’s not good either.

So, I decided to make a facade of OO queries which implemented the same interface as the actual OO queries. However, when you call methods the facade simply stores the commands in a structure. Then, when you run the query, the gateway, which use to be at least partially in charge of transforming the query to SQL, simply asks the facade for a file to include. The facade does a quick hash of the data you provided it and then looks for a file on disk that matches that hash. If the file is found it’s returned otherwise the facade validates the query and generates a static SQL statement (complete with cfqueryparams) to disk and returns the path to that file.

The gateway then simply includes the SQL statement into its cfquery tag. The first run of a query is on par with previous speeds. However, subsequent requests are more like 14 ms. That’s a 300% improvement in speed!

Duckies -or- Anything Goes

Speaking of speed, as is well known, a few popular frameworks have gone “typeless”. That is to say that all (or most) of the core framework’s methods have had their cffunction and cfattribute tags updated so that their returntype and type attributes are set to any, instead of specifying a specific data type. This essentially disables ColdFusion’s runtime type checking.

Because ColdFusion no longer checks the types when methods are called Reactor should theoretically get a lot faster. In practice it seems that this helped some cases but not others. Jared Rypka-Hauer praised it as being much faster while I’m not sure I saw much of a difference.

Anyhow, to make sure that the actual types were recorded for documentation I added two new made up attributes to the cffunction and cfargument tags, _returntype and _type. These hold the “real” data type and show up in the CFC’s metadata, but not in the ColdFusion-generated documentation, unfortunately.

BeanFactory Support

Joe Rinehart contributed a feature whereby ColdSpring can inject itself into the ReactorFactory when it’s instantiated. Furthermore, Reactor lets you access the BeanFactory to get any ColdSpring configured bean. Simply use the _getBean() method. (Note, this won’t show up in metadata due to the fact it’s mixed in.) It’s debatable as to if this is a good thing, but it’s there!

Reactor Continues to Evolve

Simply put, there have been a ton of little things quietly changing over the past few months. Far too many to blog about.

I know the docs aren’t done yet. That’s my fault. I’ll get there soon! And before long we’ll be at a 1.0! (And no, contrary to certain reports, Reactor is still not a 1.0.)

Tag Cloud