The amazing adventures of Doug Hughes

Archive for February, 2008

Applying Load To Troubleshoot Application Issues

We are asked to help clients with various issues at various stages. Sometimes we are brought in when an application is simply being planned, sometimes just before launch or enhancement. This blog piece covers the times when we are brought in to pinpoint application slow-downs or failures.

As I have mentioned in previous blog postings, the first thing I always want to do when trying to help a client with application slow-downs or failures is to see the recent Production ColdFusion and JRun logs. The logs I focus on can depend on what kind of install, what operating system and also if there are any custom logging settings. If there is a {server-instance}-out.log I look there first. I never go back further than 7 days unless the client has a specific reason for me to do so.

So, what does this have to with Load-Testing? Please read on, if you wish. The logs I mention above will definitely give us pointers to what is going wrong, if we add Metrics Logging and Verbose Garbage Collection logging we will see more detail on that. All of this will give some overall indications of the system state. For instance, if there are many different templates, events or fuseactions in the logs I would suspect overall environment problems. If it is a just a few templates, events or fuseactions repeating over and over again then that is usually indicative of specific code issues.

As a note point, in frameworks such as FuseBox, Mach-ii or ModelGlue knowing what actually ran is more difficult and we need to capture the URL Query String. Both FusionReactor and SeeFusion are of great assistance in this respect. Hopefully the client can provide a testing environment, and in that case we would want to use a load test tool.

That test environment should be as identical to Production as possible, and it is critical that the code be the same. From the pointers gained from the Production logs and hopefully with some guidance from the client, we then create load-test scripts that exercise the code that we saw causing issues in Production.

We almost always start by tuning the JVM, particularly if there are indications that the site overall is slow. One last critical point; we look very closely at is the errors in the logs on the test environment. We want to see the same or very similar errors to those seen in Production otherwise we have some fundamental differences which will make it more difficult to achieve success.

Model-Glue Training In DC Recap

Today is my first day back in the office after four intense days of teaching my Enterprise Model-Glue with ColdFusion class in Arlington, Virginia. This was the first time I gave the class in front of a general audience, but not the first time Ive given the class. Overall, I was pleased with how it went and I think the participants were too.

The group of attendees had a wide range of experience, expectations, and intentions. For example, there were a few people who already used Model-Glue but wanted some formal instruction. There were others who used other frameworks already, or none at all. A couple of people were there more to learn how to support the frameworks in various capacities rather than learning how to implement them. Most people had at least some level of understanding of what we went over coming in.

The title of the course was Enterprise ColdFusion with Model-Glue. I obviously provided training on Model-Glue. But, most of the attendees were surprised by how much focus was given to other topics such as the usage of CFCs, Object Oriented Programming, Design Patterns, unit testing, ColdSpring, Reactor and much, much, more. Why all this focus on things other than Model-Glue? Well, thats really what makes up the Enterprise in Enterprise ColdFusion.

The hardest lesson taught was that Model-Glue is not only fairly simple; its also a very small part of the picture. Focusing on Model-Glue does not, by itself, do much for you. Where a developer needs to focus is on their applications model. This is, after all, where the magic really happens. By building a cohesive and well-encapsulated model you will ease the maintenance of your application over time. In turn, you save money and lower total cost of ownership.

The feedback I received was overwhelmingly positive and I can’t wait to do it again.

By the way, if youre in a corporation, Im happy to come on-site to provide your team with training. Just send us a message and well get the ball rolling.

In your expert opinion, where should I hold the next general audience class?

The Sales Job at Alagad is Still Unfilled

A couple weeks ago I posted a long-winded blog entry announcing that Alagad is hiring a Senior Sales Representative. It turns out that this is quite a challenging prospect and I was wondering if the blog community could help.

Alagad is looking for a really, really, good sales person. Someone who can sell millions of dollars of custom software a year.

The problem is twofold:

  • The sales people who can produce are already employed and not trolling around career builder looking for my advertisement.
  • The sales people who can produce probably don’t read my blog.

That’s where you come in. If you know of someone who works for a consulting company selling custom software, could you please mention this job to them? I would sincerely appreciate it.

I plan to compensate the right sales person generously in terms of salary and commission. They’ll be coming in at the top and will hopefully be able to work with Alagad to define our entire sales strategy. Over time I think this will become a director-level position where growth is virtually unlimited.

So, if you have any friends or acquaintances you think might be interested, please put them in contact with me:

Doug Hughes, President
Alagad Inc.
dhughes@alagad.com
888-Alagad4

Details of Tuning A ColdSpring, Model Glue, Reactor Application On ColdFusion 8

This is a real world and recent example of tuning a ColdFusion application built in ColdSpring, ModelGlue and Reactor. Here are the details of the system and software this application is running on.

  • Windows Server 2003 – IIS6
  • ColdFusion 8 Enterprise with Sun JVM 1.6
  • ColdSpring
  • ModelGlue 2.0.304(2)
  • Reactor revision 4.29

In the interests of really making this posting interesting I am going to start with two graphs. Both show the performance of the application under load using a load test tool with scripts modeled from user behavior. This first graphic shows response times in our first load tests and before any tuning.

Before

This graphic shows a spectrum of request response times during a 15 Minute, 50 vUser load test with 8 second think times between “clicks. The bottom left access shows request response times with the fastest on the left progressing to the slowest at the bottom right axis point. The bottom right to farthest right shows the duration of the test, from the start at the axis point to the end at the far right.

I realize the numbers are hard to read but by looking at the 3D colored bars we can see that most of the requests response times are in the mid to high range, many of them being in excess of 5 seconds in duration. The next graphic shows the performance results of the application running the same 15 Minute, 50 vUser load test with 8 second think times load test, after JVM, code and SQL tuning.

After

The bottom left access shows request response times with the fastest on the left progressing to the slowest at the bottom right axis point. The bottom right to farthest right shows the duration of the test, from the start at the axis point to the end at the far right. I realize the numbers are hard to read but by looking at the 3D colored bars we can see that most of the requests response times are now much faster, many of them under 1 second in duration. So that is the graphical representation of what we achieved, here is what we achieved and how we achieved it.

The JVM

I firmly believe in starting by optimally tuning the JVM before the application code, SQL etc. My reason is this; in the web application world, most applications consist of objects that are created and hopefully destroyed fairly quickly. That is because the request lifecycle is fairly short lived, as opposed to something like a client-server paradigm where objects can and often will remain referenced for considerable amounts of time. For this reason there are typically some overall good practices that I have found to be beneficial, here they are.

Before making any changes to the JVM I enabled verbose garbage collection logging and referred to that log throughout the testing. Full details of how to do that can be found in This previous blog posting.

  • Set a Start Size for the JVM Memory Space (The Heap). This is done by passing the Xms{number}m (where {number} is a numeric value) argument to the JVM at start up. I would typically set this to 50% of the size set in the Xmx setting. If there is no Xmx setting I have found that safe “rule of thumb” settings of -Xms512m and -Xmx1024m, where a server has a minimum of 2GB of Ram. I like to do this by adding it directly to the jvm.config file in the Arguments to VM. The reason that I favor directly editing the jvm.config file is because there have been occasions where doing this via the CF Admin GUI has caused problems with the jvm.config file not being written correctly. In fairness, this was in earlier versions of ColdFusion. For the server I am reporting on in this test the setting was set to -Xms512m -Xmx1024m.
  • Another good practice is the set a start size for the Permanent Generation of the JVM Heap. On looking at the verbose garbage collection logs I often see this generation, also known as the permgen; fills up. The permanent generation is where the Servlet Container (JRun) has it’s working objects and I like to set both a start and maximum size by passing these settings via the Arguments to VM – -XX:PermSize=128m -XX:MaxPermSize=192m. These are the settings for the server in this test and again are usually fine on a system with a minimum of 2GB of RAM. The permanent generation is in addition to the other maximum and minimum that we set in the paragraph above, not a part of it.
  • Full Garbage Full GC’s) collections are another issue which comes very much down to tuning to find the right balance of enough Full GC’s and not too many. From what I have seen and this is a general observation. the Sun 1.4x and prior JVM’s seem to run Full GC’s at 60 second intervals. The 1.5 and 1.6 JVM’s seem to run too few Full GC’s, I have seen almost out of memory issues and the Xmx (maximum heap size setting) does not always kick in properly on 1.5 and 1.6 JVM’s. Typically I set explicit Full GC’s to occur on a regular basis, I have found that setting them to occur at 10 minute intervals is a good interval. However, in the case of this particular system I was load-testing, I was still seeing the heap running low on memory. So I set Full GC’s to run at 5 minute intervals with the following via the Arguments to VM -Dsun.rmi.dgc.client.gcInterval=300000 -Dsun.rmi.dgc.server.gcInterval=300000. I am still working through what effects the 1.5 and 1.6 Sun JVM’s have on ColdFusion and Flex/Air based applications.

All the settings here were arrived at by testing different settings against both the default CF8 JVM (Sun 1.6) and also the 1.5 as suggested in other postings to be a better JVM for cfc heavy apps; as this one certainly is. In actual fact and after doing lots of load testing the default 1.6 JVM proved to be much more responsive; I will enumerate this shortly.

ColdFusion

The first thing I did in ColdFusion was to establish Metrics Logging and split out the JRun logs by using the log.level modifier. This is really useful in my opinion, as the state of the memory, number of threads in use and if enabled number of current sessions are shown at regular intervals in the metrics log and the out log. This latter log is particularly useful in troubleshooting as all errors end up there. Being able to see problems and the state of threads and memory at or around the same time can be really helpful in pinning down cause and effect.

There were two settings in CF Admin that I also set, the first was to turn on Trusted Cache which is recommended for Production servers. With a formal deployment process in place, Production code should only change as and when needed, so Trusted Cache is a very beneficial setting. In addition I turned off the saving of Class Files to disk. Once again, these changes were made in a controlled fashion and load tested after changing to determine their effects. At this point I had tuned the JVM and ColdFusion and as a result we saw average request response times drop from over 10 seconds to just over 4 seconds; so this was a good start achieved by server side tweaks only.

SeeFusion

This is a ColdFusion 8 installation so I am sure there are those who could rightly question why I would choose to use a third-party tool in addition. I have used SeeFusion since before it was officially launched; I was a beta tester. What I found is that SeeFusion would often continue to respond after ColdFusion had stopped responding altogether. This is because SeeFusion runs at the Java layer and the Servlet Container and the JVM will often continue responding and reporting after the Application server has failed. In addition, it is very easy to set up SeeFusion to log information to a database and I had set this up before starting the load testing. This information enabled us to now start homing in on code level-SQL issues. Here is a screen shot taken during the initial load test of the JVM 1.5 tuning, this may be hard to read but we are almost out of heap memory, indicated by the yellow bar graphic at top left and the bottom left graph; I have found that both 1.5 1nd 1.6 Sun JVM’s have different memory handling characteristics to 1.4 and prior in the sense that Full GC’s occur far less frequently; hence my comments above in the JVM tuning section.

Here are some blow-ups of the “almost-out-of-heap-memory” parts of the SeeFusion monitored load test. Firstly the bar chart showing a yellow warning status…

And here is the bottom right graph showing the rapid rise in memory usage as the tests progressed…

Code and SQL

At this point I consulted the considerable programming-architectural talents of Alagad. By using the SeeFusion data logged into a SQL Server database we began to focus on the slower responding requests. Here are some examples with some pointers to what we interpreted them to be.

In this sort of situation we can see that the “TotalTimeMS” and “TimeToFirstByteMs” are very close but that “TotalQueryTime” is low; these are all milliseconds by the way. What this suggests is that there is either a network-communication issue back to the browser or a lot of ColdFusion churn to produce the GUI.

Here we see a fairly long “TotalTimeMs” with “NumQueries” of 16. We identified this as possibly a lot of queries for a single request and were able to reduce this to 3 queries after investigation.

Here we see “TotalQueryTime” here shows as 19095 which of course is very long so we looked at the SQL involved in this query and were able to reduce the run time down to just over 2 seconds. We carried on identifying long running code and queries, doing this via the SeeFusion data, in reality there are usually not that many different slow requests and by using this methodology we improved performance dramatically by methodically dealing with each prioritized issue.

This was a typical, successful application tuning engagement. At the commencement of the assignment we had slow application response time. In our initial load testing a 50 concurrent vUser (virtual user) test produced average request times in excess of 10 seconds. By the end of the engagement we had average page response times at just over 500MS.

After tuning the JVM and the application, I wanted to go back and run some more tests on both the 1.5 and 1.6 JVM’s just to make sure that the results I had been getting throughout this assignment, were still borne out; that 1.5 was not more efficient than 1.6 in ongoing usage; I had NOT set out to prove that by the way, in fact it was the first thing I tried at the onset of load-testing was to switch to 1.5.

Firstly, here is the summary of the 50 vUser test on the default CF 8 JVM (Sun Java 6 or 1.6):

** Results per URL for complete test **
URL#1 (): Average Click Time 302 ms, 300 Clicks, 0 Errors
URL#2 (): Average Click Time 237 ms, 300 Clicks, 0 Errors
URL#3 (): Average Click Time 244 ms, 300 Clicks, 0 Errors
URL#4 (): Average Click Time 241 ms, 300 Clicks, 0 Errors
URL#5 (): Average Click Time 248 ms, 300 Clicks, 0 Errors
URL#6 (): Average Click Time 191 ms, 300 Clicks, 0 Errors
URL#7 (): Average Click Time 194 ms, 300 Clicks, 0 Errors
URL#8 (): Average Click Time 210 ms, 283 Clicks, 0 Errors
URL#9 (): Average Click Time 254 ms, 270 Clicks, 0 Errors
URL#10 (): Average Click Time 299 ms, 257 Clicks, 0 Errors
URL#11 (): Average Click Time 488 ms, 251 Clicks, 0 Errors
URL#12 (): Average Click Time 4,573 ms, 250 Clicks, 0 Errors
URL#13 (): Average Click Time 972 ms, 250 Clicks, 0 Errors
URL#14 (): Average Click Time 501 ms, 250 Clicks, 0 Errors
URL#15 (): Average Click Time 279 ms, 250 Clicks, 0 Errors
URL#16 (): Average Click Time 249 ms, 250 Clicks, 0 Errors
URL#17 (): Average Click Time 333 ms, 250 Clicks, 0 Errors
URL#18 (): Average Click Time 1,370 ms, 250 Clicks, 0 Errors
URL#19 (): Average Click Time 615 ms, 250 Clicks, 0 Errors
Total Number of Clicks: 5,161 (0 Errors)
Average Click Time of all URLs: 590 ms

Here is the summary of the same 50 vUser test on the Sun Java 5 or 1.5.

** Results per URL for complete test **
URL#1 (): Average Click Time 1,078 ms, 254 Clicks, 0 Errors
URL#2 (): Average Click Time 877 ms, 250 Clicks, 0 Errors
URL#3 (): Average Click Time 1,111 ms, 250 Clicks, 0 Errors
URL#4 (): Average Click Time 1,450 ms, 250 Clicks, 0 Errors
URL#5 (): Average Click Time 1,445 ms, 250 Clicks, 0 Errors
URL#6 (): Average Click Time 888 ms, 250 Clicks, 0 Errors
URL#7 (): Average Click Time 683 ms, 250 Clicks, 0 Errors
URL#8 (): Average Click Time 496 ms, 250 Clicks, 0 Errors
URL#9 (): Average Click Time 485 ms, 250 Clicks, 0 Errors
URL#10 (): Average Click Time 492 ms, 250 Clicks, 0 Errors
URL#11 (): Average Click Time 1,006 ms, 250 Clicks, 0 Errors
URL#12 (): Average Click Time 8,400 ms, 250 Clicks, 0 Errors
URL#13 (): Average Click Time 2,254 ms, 250 Clicks, 0 Errors
URL#14 (): Average Click Time 1,380 ms, 250 Clicks, 0 Errors
URL#15 (): Average Click Time 810 ms, 250 Clicks, 0 Errors
URL#16 (): Average Click Time 709 ms, 242 Clicks, 0 Errors
URL#17 (): Average Click Time 749 ms, 221 Clicks, 0 Errors
URL#18 (): Average Click Time 3,849 ms, 211 Clicks, 0 Errors
URL#19 (): Average Click Time 1,743 ms, 206 Clicks, 0 Errors
Total Number of Clicks: 4,634 (0 Errors)
Average Click Time of all URLs: 1,559 ms

I have found that the only way to really determine the correct garbage collector to use for your applications and what the correct arguments are to pass to the JVM, via the jvm.config file, to effectively manage garbage collections is to first determine how many Full GC’s are occurring and control that if there are too many. Then load-test the application whilst tuning the arguments passed to the JVM, typically via the jvm.config file.

The Tiers Of A Clone – High Availability Considerations

I am really trying hard not to make this blog post boring andas a died in the wool server engineer, that is fairly difficult for me. In reality though, this is a fairly important topic, in fact very important to anyone thinking about High Availability (HA). In reality all deployed websites have to have some aspect of HA. By the essence of what they are and do, they need to be available when needed. So need to be “highly-available”. What I amconsidering in this postare two relatedyet different overall concepts. The first is called “N-Tier” and here is a simple diagram showing a possible HA infrastructure…

n-tier

The first point of conjecture or thought here is how many tiers actually are there? My view is that where data passes from one device-segment to another we can class that as a tier. This is marginally inaccurate though as in the public Internet, there can be many tiers by that token. In this case and because we do not know exactly how many hops data will take in passing to and from End Users we can really only consider the Internet as a single tier. Here is some explanation of each tier as I have encountered them in the work I have done over the years.

  1. This is the End User tier this is where the request originates, typically this could be a person although not always, it could also be machine generated. Usually a request is issued to get return data of some kind but is some cases a request may be issue the start, pause or kill a process.
  2. The Public Internet in this case but it could also be an intranet networkor extranet. The way I would characterize that tier is that it contains no servers or data manipulation entities that return data to the End Users, except for possible network failure messages.
  3. This tier is the gateway-entrance to the application location. This tier contains routing, data protection (Firewall), traffic control and clustering devices. Again there are no servers in this tier in the sense that they would feed requested data back to the End Users, although, they could, once again, send back status-error messages.
  4. This tier is where the web-application servers sit and these are the servers which process the requests that come in fromEnd Users. Here is where my reference to the “clone” really becomes an essential element. In a clustered and perhaps load-balanced environment it is essential that all code is replicated in real time, otherwise if an End User is moved around different servers they could obviously get different results. In addition, maintaining state and any variables that relate to that in server memory is also an immense challenge. Imagine, for instance, a shoppingcart in the End Users session scope. If the End User isonce again moved around different servers they could lose the cart altogether.If the infrastructure is a J2EE/JavaEE onethere is something called Buddy servers. In this case session information can be clustered, although this can be tough to set up initially. The main point I am emphasizing here is that this typical High Availability (HA) environment is challenging to maintain yet nevertheless typical.
  5. These routers sit between the web-application servers and database servers and their job is to add a layer of added security to the data stored in the databases in tier 6.
  6. This is where the database servers sit and as with the web-application servers they are clustered. In my experience, clustering databases is considerably trickier than clustering web-application servers. Typically the clustering is via algorithms which come with the database software itself. If the requirement is an optimal active-active load balanced cluster where each DB server is in use, then mirroring of the data between DB serversis necessary. In the case of the system here; that is mirroring between 4 servers.

So in this illustration we have looked at a typical High Availability (HA) environment in an “N-Tier” paradigm. This is the most typical kind of HA infrastructure and mostly varies only in having more or less hardware. There is no doubt that lots of equipment like this offers good levels of availability althoughthat comes with challenges, some of which I have listed above. One thing that is certainly impacted by this sort of infrastructure, in my experience, is performance. The more pieces of hardware that data has to pass through the less efficient the performance. I realize that this sounds somewhat counter-intuitiveto others in this series of articles, this is not my intent if so. There is another school of thought emergingwhich is another way of looking at application-data architecture and flow. This is called “N-Layer” and in essence concentrates on the kind of data or state of data as it passes through an application. That will the subject of my next post in this series and it introduces some controversial notions and concepts.

IM Gateways and ColdSpring: Odd Behavior and a Workaround

I’ll blog about this later, but I’m writing a time tracker that uses a Model-Glue application for user and project management and reporting and a Google Talk IM gateway for input. The idea is to create a simple IM-based command line to allow us to input and log our tasks in a database as we go through our days and then be able to pull our timesheets directly from the database. My architecture is simple… it’s a basic implementation of the Command pattern and the commands are registered with ColdSpring so that the first word in a message sent to the gateway has to match a bean registered in ColdSpring.

Simple example: if you send the word “time” to the IM gateway, you get back the date and time. Internally, the gateway fetches the CS beanfactory from the application scope and calls beanfactory.getBean(“time”).execute() and returns the result (a string) to the IM user. I’ll get into the actual architecture in another blog post, but I needed to lay out that much background in order to get on with the rest of this post.

I was perpetually getting ColdSpring errors saying the beans weren’t available. It made no sense, because my HTML pages could get to the beans just fine… the only place I would get these errors was through the IM gateway. To make it even stranger, if I’d run the homepage for the app and then a test HTML page I created in the same directory as the gateway CFC, suddenly my beans would show up again. And, as things often get weirder before they get saner, I have all the calls to the getBean() wrapped in <cfif BeanFactory.containsBean(foo)> and that would pass!!

So I could get to the BeanFactory in the application scope, and containsBean() said it was in the config file, but getBean() would blow up claiming the bean didn’t exist… very odd. And I haven’t yet quite figured out what to do about the actual problem… but I DID find a workaround.

The fix I found was this: by adding lazy-init=”false” to my command beans within the ColdSpring config file, I was able to get these errors to go away.

The moral of the story? Know your Spring/ColdSpring config file options for the bean tags, because one day that bit of knowledge may save you a great deal of pain, frustration, and lost time.

Tag Cloud