I ran into a new and exciting issue with ColdFusion today. I was called into figure out a bizarre problem a coworker of mine was having. The problem was that the navigation bar on a website was occasionally (meaning about 1 out of 25 times or so) not showing all the data or was showing too much data, or just plain wrong information. If you were to go to the site and reload the page, every once in a while youd see data that was just wrong. For Example:
Then, randomly, youd see this:
In general, the code that outputs the navigation menu was looping over a query in the application scope. When the application was initially loaded application.nav_menu would be set to the results of a query. Then, on all page loads the nav menu would be generated by looping over the query in application.nav_menu.In addition to that, the problem could be duplicated by creating some code as simple as this and then reloading the page over and over:
<cfoutput query="application.nav_menu"> #CurrentRow#.) #Name#" </cfoutput>
When you ran this code most of the time you would see rows 1 through 30. However, occasionally, it would skip rows! You might see 1, 2, 4, 5, 7. Notice that 3 and 6 are missing. The next time youd reload the page youd see 1, 2, 3, 4, 5 with no missing data.
Incorrect (note the missing data):
(and so on)After beating my head against this problem for a few hours I began to have an idea what was going on. Here are a few more hints:
- The problem only occurred on the live site.
- The problem began occurring after upgrading from ColdFusion 5 to ColdFusion MX 6.1 with the 6.1 updated released in August.
- The ColdFusion server had been restarted at least once.
- The site gets about 125,000 hits per day. (I assume thats on more than just .cfm files)
After some research I was able to reproduce the problem outside of production. It seems that when you have a query in the application scope that all of the querys metadata is in the application scope too, including the currentrow value. That means that if you have two people looping over the same query in the application scope at the same time that as user one reaches the end of the loop and the currentrow is incremented that its also incremented for the other user looping over the same application variable. The second user when they reach the end of the query will increment the currentrow and begin the loop again. However, at this point they will appear to have jumped two rows, not one. To test this theory, I created a new folder in my application and wrote a few test files. One was an application.cfm designed to isolate my tests from the application.
<CFAPPLICATION name="test6" clientmanagement="No" sessionmanagement="Yes" applicationtimeout="#CreateTimespan(2,0,0,0)#" sessiontimeout="#CreateTimespan(0,2,0,0)#" setdomaincookies="true">
I then created a simple file, test1.cfm, which would create and cache the query into the application scope if it didnt exist and loop over the cached query:
<cfif not isdefined("application.NavMenuTEST")> <cfquery name="application.NavMenuTEST" datasource="beta"> EXEC amsp_NavMenuSetup 0, 'http://www.website.org/', 'http://www. website.org/', 4.1 </cfquery> <h2>Application Query Var Set</h2> </cfif> <cfoutput query="application.NavMenuTEST" maxrows="30"> #CurrentRow#.) #Name#" </cfoutput>
At this point I could load test1.cfm all day and never see the problem. This is good. These tests were not under any load and it was the expected behavior.So, I created another file, tDoug.cfm. This file simulated thousands of users looping over the cached query at the same time.
<cfloop from="1" to="50000" index="x"> <cfoutput query="application.NavMenuTEST"> <cfset t = "#CurrentRow#.) #Name#" /> </cfoutput> </cfloop>
This file looped over the entire query 50000 times. Each time it would perform some unimportant mumbo-jumbo.Running tDoug.cfm took 15 or so seconds. If, while this file was running, I hopped to another tab and ran test1.cfm the data would be all out of order with lots of missing rows. Just the same problem I was having but to a larger extreme.Interestingly enough, I was not able to reproduce this on ColdFusion 5, or ColdFusion MX 6.1 (on Linux) but I was able to reproduce it on a different ColdFusion MX 6.1 with the 6.1 updater server while pointing to a different database and server altogether. It sure sounds to me like a problem with the 6.1 updater.For those of you who are having this problem my solution was to duplicate the query into the local variables scope before looping over it. The major drawback is that Ive got to have a minimum of two copies of the query in memory at any given point in time for this to work (because Im duplicating the query on each request.) In other words, if I updated my test1.cfm as follows I no longer ran into the problem.
<cfset navmenu = duplicate(application.NavMenuTEST) /> <cfoutput query="navmenu" maxrows="30"> #CurrentRow#.) #Name#" </cfoutput>
This is all I can say at the moment. I let some people at Macromedia know about the problem and the ball is in their court now.
Update 9/23/2004: So, my coworker did send the results of our findings off to Macromedia yesterday. Apparently Macromedia has no confirmed that this is a bug going all the way back to MX 6.0! And no one ever caught this before?! Any how, they hadn’t heard of this problem at all before yesterday and then, all of the sudden, five companies call in with the same problem. Very strange.
The Macromedia support engineer is apparently going to suggest a hot fix for this issue. In the meantime, the official Macromedia line appears to be that you should use an exclusive lock around any loops over application queries. The other suggestion we’ve come up with is to use the duplicate method to copy the query into the variables scope before looping over it. Both will hurt application performance in their own ways. If you have this problem , I would suggest trying both solutions under load before choosing one or the other.I’ll keep this post updated as I learn more.
Comments on: "ColdFusion MX 6.1 Updater Introduces New Bug When Looping Over Query In the Application Scope" (11)
I wouldn’t consider this a bug – having multiple threads interacting with application scope is just a coding error! You need to lock application scope (or use a named lock) since you have a very simple, obvious race condition. If it worked on CF5, you were extremely lucky since you pretty much only had to breathe on application scope without locks and your server fell over…
Since this is caused by a shared loop index (application.nav_menu.currentrow), another workaround would be to recode it to use a local variable for the loop index:
<cfloop index="i" from="1" to="#application.nav_menu.recordcount#">
Assuming that application.nav_menu is created once in Application.cfm and is not updated, you wouldn’t need a lock. (Sean, please correct me if I’m wrong).
Sean, I ran the same set of tests on a ColdFusion 5 server and was not able to reproduce the problem. I do agree that everything in CF should be locked… that doesn’t mean that this is always done. In the case of the code Im supporting I dont think it was ever done.
Travis, That’s a great suggestion!
The issue at hand seems to be a combination of not-so-hot code with an apparent change in behavior between CF 5 and MX which caught us off guard. I havent heard anything more on this since 9/23.
The killer on this is that the app that Im supporting relies heavily on that particular query (and dozens of others in the application scope). Updating all of these may be next to impossible; much less the fact that I would expect that it would hurt what remains of the applications performance.
If I had all the time and energy in the world Id probably choose Travis solution.
Travis, your solution is still not thread-safe because it’s the whole query that is updated in application scope and therefore application.nav_menu.recordCount could change during the execution of the loop, as could application.nav_menu.name[i].
Either the code needs to be single-threaded with a named lock or the menu should be created once and cached (and only lock the code if the query / menu needs to be recreated).
OK: this is about queries but would this affect looping of any complex variable in shared scope – eg an array of structs? my gut feeling would be YES, since the problem is the index variable being changed
>> or the menu should be created once and cached
for queries only – how about an array of Structs?
also, would copying the query/array/etc over to request scope and looping that avoid the issue?
I think this just illustrates some of the considerations you need to take when placing any data in the application scope.
In this particular case I was taken by surprise that the current row/index was stored in the application scope. I expected it to be in something more like the request scope.
It’s my opinion that the way it functions is not the way it seems like it should.
In the case of arrays, structs or any other type of data in the application scope I think you need to be careful when changing and reading it.
When you place *anything* in a shared scope, you are bound to consider race conditions.
>> you are bound to consider race conditions.
which not only means writing but also iterating (at least with queries).
so what’s wrong with copying the query to request scope and looping over that?
Copying the query around will soon lose you the advantage of caching (since you’d have to duplicate() it to copy it to request scope) – since this example is just used to create a global nav, it would probably make sense to create the nav once and cache that instead of just caching the query.
Just for the record, I agree with Sean. However, also for the record, it looks like Macromedia will be issuing a hot fix for this. I’ll blog about that at some point today.
Coming late to the thread here, but what the hell… better late than never.
Seems to me that the application scope should encapsulate the data of the query as well as it’s metadata. Anything else would be atypical behavior. I mean, instance data is instance data… a query object’s current row is part of its instance data and should be in the application scope.
I’m not exactly sure how cursors work in classic Java resultsets, but it would seem to me that to have a variables scoped counter used with array notation to get a particular row from the query object is going to force the currentRow value in the application scope to go bouncing everywhere… which may not be an issue, but it could also be a small tax on performance.
Why not just either create the nav once and use that to draw the menu or, instead of doing a Duplicate() on every single request, just do a Duplicate() into the session scope of session.nav_query doesn’t exist? It would get you the benefits of caching without the hassle and overhead of duplicating on every request.