The amazing adventures of Doug Hughes

In this post series, I would like to put forward a hypothetical situation involving poor ColdFusion application performance, the investigative steps to take to isolate the issues, and the remedial steps to perform in order to solve those issues. I would really like some feedback from readers as well here, to hypothesize on possible issues, possible resolutions, and supply other tools or methods which may identify or solve the issues we discuss. I hope that this post series will not only help you identify and deal with ColdFusion issues, but also help to identify database, network, or hardware issues as they may arise. Note: This hypothetical situation, while pulled from my experiences, is not a direct parallel to any of my previous customers, and is instead a combination of factors from several different projects. Lets call it Project X.

The Environment

Project X is setup to run across (4) Coldfusion 8 Enterprise edition servers, in a load balanced cluster behind a hardware load balancer. Sticky sessions are configured, so once a user makes a request to a given server, their subsequent requests should continue on the same server. Project X has a single MS SQL 2005 database on a 32 bit Windows 2003, which has 4 GB of ram. This server has (3) 15K 75 gig SCSI hard drives in RAID 5, upon which the operating system and the MS SQL binaries are installed. There is an iSCSI connected device which has (8) 10K 147 gig SCSI hard drives. The iSCSI device contains the MS SQL data and log files. Each Project X web server is a 32 bit Windows 2003 server with 4 GB of ram, and (3) 15K 75 gig SCSI hard drives in RAID 5. Each web server is running 2 instances of ColdFusion in a local cluster (using round robin to split requests between instances), and each ColdFusion instance is using the default JVM configuration that ships with ColdFusion 8. There is a shared folder on the MS SQL server which contains all shared page assets (files uploaded by the users, PDF documentation, and images). Each web server is running Apache, and has an alias pointing to the shared folder on the SQL box (using a UNC path).

All servers are connected to a 24 port gigabit switch, and is hosted on an OC3 line. Project X is a web based file sharing application which allows users to upload and share files of many types (images, pdf’s, office documents, and more). It makes use of Application.cfm to load site variables, and uses several CFC objects to encapsulate database queries and user information.

The Problem

For several years this configuration has worked fine for the customer, with stable servers and acceptable response times. Project X has recently run an ad campaign in the national media, which has increased their site traffic by a factor of 2. Since the campaign, users have been complaining of slow web response times, as well as error messages. Investigating the server logs also shows that the coldfusion instances have been crashing with out of memory errors.You are tasked with uncovering the issues that are causing the slow page rendering, and the out of memory server issues. In my next post I will share both my techniques I use to identify these issues, as well as a selection of idea’s in comment responses to this post. Thoughts?

Bonus

Now to encourage participation here, anyone who contributes in comments to this chain of blog posts by commenting (with a relevant comment to this discussion) will be entered into a drawing to win an Alagad backpack (trust me, these are the best backpacks ever, I use mine for travel, school, everything). I will do a drawing for the backpack in a connect room after this blog series is completed, so lets bring on the idea’s!

Comments on: "Troubleshooting Coldfusion Performance: The Problem" (26)

  1. ejholmgren said:

    Check for improperly scoped variables that may be causing memory leaks? The app’s initial usage level may have masked the issue by allowing enough time for garbage collection to occur before it filled up the heap.

    Like

  2. Alan McCollough said:

    when in doubt, blame the JVM.

    Like

  3. Charlie Griefer said:

    Can we just move ’em to the cloud and increase the number of instances after national ad campaigns? 🙂

    Like

  4. Dennis Clark said:

    I’d check for session replication as I’ve seen it as a source of scalability issues. Session replication causes every user session to consume memory on every ColdFusion instance in a cluster.

    In this scenario you have 8 instances of ColdFusion (4 servers x 2 instances per server), and I believe the default memory limit for 32-bit JVMs is about 1.5GB, so from a performance perspective session replication would be similar to a single application server with an 8-core processor and only 2GB of RAM. Out-of-memory issues is a likely symptom of such a setup.

    Like

  5. Michael Kelly said:

    I agree with Dennis, but would like to add to his solution.

    In addition to removing session replication, I would upgrade the servers to 64bit w/ CF8. You can then take advantage of the ability to increase the memory limit of the JVM to as much as the server can handle. This will allow both the ability to handle increased usage as well as give garbage collection a chance to do its job. Once memory is at GTE 95%, garbage collection doesn’t stand a chance, especially if usage is high.

    Like

  6. Mark Mandel said:

    I would set up the servers to do a heap snapshot on OutOfMemory exceptions occur.

    From there I would use a tool like the MAT Toolkit in Eclipse to introspect the heap and find out what is taking up the memory.

    From that data, I could probably recommend some changes to the code, the JVM, or the infrastructure (more RAM, more machines etc), depending on what was going on.

    Like

  7. Eric Pell said:

    Since I’m no CF/Java guru I’ll address some Windows and hardware concerns that can be remedied w/o upgrading the equipment. First off since the servers are running out of memory the combined jrun processes from each of the two instances of CF per web server are probably hitting up against Windows 2k3 32-bit’s 2GB user-mode process memory limit. Although the servers have 4GB of ram they can only use up to 2GB for non OS related business. Ideally, rebuild each of the 4 balanced servers using Windows 2k3 64-bit. If you need a more immediate patch then flip the /3GB in boot.ini to allow the JVM instances 50% more memory per webserver and reboot.

    Your slow page rendering can also be the cause of over utilization of the single fileserver/SQL server. Windows gets upset if too many SMB requests hit the fileserver at the same time and will quickly get unresponsive. At this point I would look into using Apache’s local caching abilities to locally store frequently accessed files from the primary file share on the web servers. One hardware caveat here that should be addressed is each webserver is running a 3 disk RAID 5. Since each of the 4 webservers should be a quickly redeployable copy of each other then I would rebuild each server with a RAID 0 array to further improve read/write performance of this system as disk redundancy isn’t a critical issue.

    Like

  8. Doug Hughes said:

    Just as a reminder to those reading this, this is a hypothetical situation which we’re exploring on the blog as a fun educational opportunity for the community at large. We’re not crowd-sourcing a solution for an actual client.

    We plan to have a few more fun contests like this to see if we can generate any really interesting conversations.

    Like

  9. Mark Ireland said:

    I would start by assuming that this is an issue with garbage collection failing so (like Mark M) I would setup an alert to email a snapshot when OutOfMemory exceptions occur. If double the traffic needs more than double the ram then some tuning is needed. (I would do a quick check to see if the number and size of files uploaded by users had doubled). In the absence of a test server I would trial some threshold values that trigger the alert and watch the Performance Monitor.

    Like

  10. Aaron Longnion said:

    Oh, where to begin?

    1) The default heap for JVM on CF8 out of the box is only 512Mg, raise it to about 1.4G.

    2) Update from the default JVM to 1.6 update 11+ to improve CFC performance: http://corfield.org/blog/index.cfm/do/blog.entry/entry/Java_6_and_ColdFusion_8

    3) Use SeeFusion of CF8 Server Monitor to look for slow running pages: often these are Gateways, Scheduled Tasks, or heavy processes that use cfthread, etc.

    4) Noting that a) there are a lot of file uploads/updates and b) the traffic load *used* to be fine, but is now the servers are slow now that traffic has doubled – often times this is an I/O issue, where under load a lot of writes to a single disk can cause slowness or file locking issues. If you have multiple instances on multiple servers, this compounds the problem because proper cflock’ing cannot be done on cffile operations in a cluster.

    5) upgrading to faster disks may help, but it would better to upgrade to a NAS solutions or other dedicated file system (not sharing with the database), so that you’re not sharing resources.

    6) some of the queries, that were performing fine before, may start performing worse over time due to a) more load, b) db index fragmentation [indexes need regular tuning and reindexing], and c) more data over time

    7) look more closely at the CF and Jrun logs to see if you can determine *which* requests the JVM out of memory errors come from

    8) in the CF Admin, since there’s a lot of file uploads, check that the “Request Throttle Memory” setting is appropriate for your system/load

    9) do load testing on a *exact* replica of your production system

    Like

  11. Gert Franz said:

    A tool I would use instantly is Fusion Reactor. It gives tons of information about the server. So my guesses:
    – having so many new sessions in the cluster might cause problems with the memory since, if the session timeout is too high and a lot of data is stored in the session scope, it explains the issues.
    – deadlocks might cause a problem as well, but this is of course a problem with the applications architecture
    – since sticky sessions might be a problem as well since they do not take care of the load of a certain server. If he is already under heavy load, he still gets requests from users due to the sticky session setting.
    – Besides that I would look at the debugging output and check what the reasons for the slow requests are, database or file accesses.
    – In addition of course tuning the JVM and looking for slow downs due to Garbage Collection is something I’d do as well

    Gert Franz
    Railo Technologies

    Like

  12. Maybe SQL queries hitting TempDB on disk? (leading to slow page loads / queued requests). I’ve noticed CF is pretty sensitive to database performance…

    Run SQL Server Profiler for a while and poke through all your queries to tune them.

    @Eric – I’ve found using the /3GB switch does not allow me to increase the JVM allocation (using 32-bit, W2k3, 4GB RAM, CF8 here) – the max I could get was about 1.8GB with or without the switch…

    Like

  13. I don’t know anything about administering servers and such, so here’s my stab in the dark. How about checking the SQL queries being run and seeing if there are any that are being processed longer than necessary? Maybe check the MS SQL configurations to see if there are holes being left open?

    Like

  14. Kevin Miller said:

    You haven’t detailed if or how sessions or client variables are being utilized in the application. Please do.

    Like

  15. Chris Peterson said:

    OK, awesome responses so far, let me see if I can briefly reply to everyone here =)

    ejholmgren – that is a great idea, unscoped variables can certainly cause grief, especially when you are observing out of memory issues.

    Alan McCollough – you said, ‘when in doubt, blame the JVM’. Can you clarify that? In a few instances I have found that tuning the JVM can help application performance, but most often the underlying issue has been a bad algorithm, or poor database performance, or under-provisioned hardware. Can you share any anecdotes on why you suspect the JVM first?

    Charlie Griefer – Yup, throwing hardware of any kind at a problem is always a solution. For this hypothetical situation, lets assume we have no budget and must work within the hardware we have.

    Dennis Clark – I agree, session replication can be a pig on several system resources. Project X is making use of sticky sessions, and lets say that they are not using session replication. Maybe this is causing some of the user errors when an instance runs out of memory and their session is gone?

    Michael Kelly – I agree that upgrading to 64 bit OS and Coldfusion would remove some of the memory constraints, and could be beneficial. Lets consider though, they are on a local 2-instance cluster, which necessitates an ‘admin’ or management instance (to route the requests between the 2 CF instances). So, in reality, we have (3) instances of ColdFusion running, all sharing a jvm.config (which is the default), so they could easily overrun the physical memory available on the system, even if we switched to 64 bit. For this hypothetical situation, lets assume that we have to stick with 32 bit OS and JVM.

    Mark Mandel – Thats a bingo for sure, the ability to generate a heap dump when the jvm runs out of space is awesome, and has been very useful to put onto production systems, as it lets you get a snapshot from real data. I will share some of the tools you can use to analyze the heap dump in the next post, as well as some applicable profiling jvm switches that I have found to be a huge help.

    Eric Pell – I am not certain that even flipping that switch would enable ColdFusion to allocate additional memory, but simply observing task manager or watching ‘top’ on a linux system would show you the memory usage of this system, and indicate if that would help or not. In this case, there are 3 different JVM instances, and while observing them in production, they all seem to be cycling from startup, to rapid memory growth, to crashing and restarting again. Lets say the OS is showing between 80% and 120% of physical memory in use (so its paging out). As as aside question, how would you determine if the SQL server filesystem is a point of contention?

    Mark Ireland – Its cooler than an alert even, the JVM will auto-dump a heap snapshot if the crash was caused by an OOM condition.

    Aaron Longnion – let me try and respond to your idea’s here. 1) While I agree that the heap size could be increased, in this scenario there are actually 3 ColdFusion instances trying to grab system memory here, and increasing their allocation would not solve the problem of running out of physical system ram. 2) I would agree in 95% of the cases, updating to the latest JVM will give you a speedup. I have seen a few instances where the latest is slower for some reason, always benchmark before and after! 3) Great plan, though I usually just enable slow page logging in the CF admin, then run it through a template to parse the resulting server.log to show me avg / min / max / count for each template that shows up in the log file. 4) I would totally agree about the possibility of an I/O issue. I will share some tricks to determine this with greater certainty, but care to share any methods you have to find out if this is the case? 5) Agree, its most likely more of a load on the database than it should be. 6) The database is always a contender for the ‘bad guy’, and should be investigated even if your DBA insists everything is perfect =) 7) Good idea, logs have great info when troubleshooting. Lets assume that you saw nothing consistent that stood out when you investigated them. 8) Makes sense, is there a way you use to determine what is an appropriate setting here? 9) That is always good (and difficult) to do, both in simulating the production server (especially when you have difficulty simulating the real database cluster), and in simulating the production traffic. I prefer non-invasive testing methods when I can, or a true mirror of production to simulated load any day.

    Gert Franz – Good idea’s here, and I agree that deadlocks or application / algorithm locks can easily be culprits. Many times I have seen a clever solution to a concurrency problem totally melt when subjected to heavy or unexpected load.

    Geoff – I 100% agree that ColdFusion is quite sensitive to database performance, and I have yet to find one that can not be sped up after some database tuning or query cleanup, paramaterization, or even query elimination. SQL profiler is a great tool, and I go over how to use that in my presentation at CFUnited.

    Lola LB – good idea for sure, I will share some tools to determine if the database is holding you up, as well as how to resolve the issues in upcoming posts.

    Kevin Miller – Lets assume that sessions are not being replicated, session scope has a few simple numeric values to hold user login state, and client variables are not used.

    Thanks everyone for the contribution to the discussion so far, keep em coming, and look for the follow up post today!

    Like

  16. Tom Forrest said:

    First off, I’d want to find out more about the out of memory errors. “java.lang.outofmemory” is a very generic message and the details that follow that error will usually point you into the right direction. Ways to look into this include turning on verbose GC, Fusion Reactor, or even CF’s built in monitors can shine a light to what’s going on under the hood.

    Depending on what’s going on, it might be worth the effort to branch some people off to review the code base to see if anything can be optimized from that side while server efforts are on-going.

    Sometimes simply opening up the JVM (setting Xmx to a higher value) to use more memory will fix several issues. But not always. Again, need more info about that out of memory error. Settings to the JVM may need to be adjusted for new space, old gen, or any of the other various memory pools that live inside the JVM. Using a different GC method that is optimized for the hardware can be helpful as well as making sure that you’re on the latest JDK that your system (CF) will support.

    CIFS sucks. Plain and simple. Mapping to your code base via UNC path for production systems is unacceptable. This item alone could be causing all of your issues. The longer CF has to wait for files the more the queue backs up. The more the queue backs up, the quicker CF will fail. I’d look to a file sync utility or DFS or if you’ve got coin to blow, something like GPFS from IBM to provide a consistent code base across all of your servers using a SAN.

    4 decently sized app servers with 2 instances each and a dedicated SQL server (if sized correctly to handle the load from 4 app servers), even in a 32 bit configuration ought to be plenty of horsepower for this project.

    However, in this dream world where money is no object, I’d be looking to upgrade every thing to quad processor, quad core systems. 16GB for the app nodes and 32+GB for the db server. No local disks. Everything comes from the SAN. 3 instances of CF per box to start (each sporting Xmx=4096m and Xms=4096m), clustered, with no session replication (as discussed above). A round of 64 bit Windows 2008 for everyone at the bar.

    Depending on what’s going on on the SQL box and your application, maybe either scale that box way up or add a 2nd machine to handle half the databases.

    Like

  17. 0) Update to the latest 1.6.0_11 Sun JVM
    1) Run VarScopper and fix issues
    2) increase max heap to something like 1128m and max perm size to 192m and make sure UseParallelGC is on.
    3) turn on trusted cache and save class files
    4) Ensure the “Request Size Limits” in CF admin are tuned to now overflow memory or queue indefinitely.
    5) Use software to view your JVM memory in more detail. You can
    6) Call Alagad if above does not work.

    Good post to get people thinking.

    Like

  18. Alan McCollough said:

    Chris, you ask why I say blame the JVM, so I’ll tell you. I’m not particulary adept at JVM tuning, so it’s an easy target for blame. Blame the unknown. Thinking of the ol’ 7-layer OSI model, your physical stuff looks good. Networking, good (gig across the board? What’s not to like?). Moving further up the model, I’m thinking this portion of the description “Each web server is running Apache, and has an alias pointing to the shared folder on the SQL box (using a UNC path). ” is possibly an issue, as I’m not a big fan of webserving across shared paths (more an IIS thing than an Apache thing), but it’s probably not it (why not? Because I think it so).

    So, the JVM being the one item in the formula that I’m not comfortable with, I’d blame it. I mean c’mon. How many people honestly, as in “answer the question without using Google” honest, know what those JVM settings are and what they do? I’ll admit it, I don’t. And seeing how they have a dramatic impact on the server’s performance, that’s exactly where I’d focus my efforts. Shamelessly google up all the info I could on the current settings and find out how to go about testing alternate settings to see if they make a difference. Blame the unknown. Blame the JVM.

    Like

  19. Tom Forrest said:

    Chris – in your reply to Michael Kelly:

    “Michael Kelly – I agree that upgrading to 64 bit OS and Coldfusion would remove some of the memory constraints, and could be beneficial. Lets consider though, they are on a local 2-instance cluster, which necessitates an ‘admin’ or management instance (to route the requests between the 2 CF instances). So, in reality, we have (3) instances of ColdFusion running, all sharing a jvm.config (which is the default), so they could easily overrun the physical memory available on the system, even if we switched to 64 bit. For this hypothetical situation, lets assume that we have to stick with 32 bit OS and JVM.”

    You don’t need an admin instance to route the traffic to other instances on the box. That’s the connector’s job. And I don’t recommend them all sharing the same jvm.config file. Doing so would prevent you from being able to do rolling upgrades in some cases. The default cfusion instance can be deleted. But in my experience, it’s best to leave the default cfusion folder on the system.

    Like

  20. I am to submit a report on this niche your post has been very very helpfull

    Regards

    Like

  21. You seem to have got the niche from the root, Awesome work

    Regards

    Like

  22. Hi Can some one plz help me..

    I’m setting up the JVM alert in server monitor in CF8, JVM threshold it is allowing max 2000MB but I want to have 3gig.
    can any one tell me how to increase the JVM threshold value in server monitoring.

    Appreciated your help.

    Thanks
    sekhar

    Like

  23. Fotis Papoulias…

    […]Troubleshooting Coldfusion Performance: The Problem – Alagad Ally[…]…

    Like

Comments are closed.

Tag Cloud