Where Oh Where Did My Bottleneck Come From?
At one point in my career I thought I knew what performance tuning was… I would have described it as “writing tight code.” (Remember that phrase? “Tight code”?) And if I were to be tasked with fixing performance issues, I would have jumped on my computer, gotten a local copy of the app going, and started hitting pages to find the slow ones, then look at the code for ways to fix the issue. That, however, was before I knew what performance engineering really is.
The problem with that is the fact that no application runs in a vacuum… that’s the whole point of the web.Between your users and your application are myriad other systems: browsers and the hardware that run them, copper, fibre, switches and routers oh my! Even between the local gateway and your application, there is probably a web server, maybe a load balancer, a switch or two and probably some sort of box to hook your 1GB Ethernet backbone up to the real interwebs. Essentially, your application server is an island in the midst of a massive sea of technology, any part of which can involve itself in the performance and reliability of your application. If that weren’t the case, my old methods would work fine… but it ain’t that way.
Meanwhile, one of the burning questions we have to answer is “what is performance?” and its evil twin “what is poor performance?” What do we use to benchmark whether or not an application performs acceptably? That, however, is a pretty easy question to answer: User perception of response times. At least initially. There are other more technical issues to be resolved down the line, but the first sign of trouble is when a user calls the support desk and complains that the login is taking 30 seconds to process.
So, where to start? As I said before, originally I would have started with the code, run the processes being complained about, “tightened up the code”, and pushed it to prod without a great deal more thought. Only later would I start to consider the DB connections, the NIC, switches, etc., but if one of those items is the problem then I just wasted 3 days rewriting a block of code for, what? Nothing. That code performed well enough as it was… and I never solved the problem.
I have heard it said that, by the time you’re ready to fix the problem, you have 99% confidence that you know exactly what it is and it’s true. 1% is a reasonable margin of error… I mean, there’s always going to be times like when you installed the new lightbulb in the basement and flipped the switch but nothing happened. Who thinks to check the fuse panel when a lightbulb goes out? The thing is, that’s a cheap mistake. Spending days implementing a performance fix that misses the mark is not.
So we’re back to process, process, process. And it all starts with user perception. So we hook up JMeter to run a test plan that models real-world usage patterns (and we base those on web server logs, right?) and we do see a performance issue… what does it tell us? Heh, well that depends, doesn’t it?
Yes, I’m about to talk about jmeter again… because it gives you great info: min/max/mean processing times, error rates, etc., all aggregated across as many requests as your test plan made. So if you run a jmeter test plan against the application from the same machine that’s hosting the it, you should be seeing the application in its peak performance capability… your baseline benchmark.
It’s often enough to confirm that the application itself is or is not the culprit, but if you don’t see the problem at the server, just move to the next network connection closer to the internet until you’re hooked up to the WIFI at Starbucks. Have some cups of coffee, eat a scone or three or… well, bummer. You found a dying NIC and never made it to Starbucks. At some point you should definitely see a dramatic, unreasonable slowdown and, at the least, narrow the field of potential causes. Personally I think the most critical part, however, is to record findings so that you have real, hard numbers to work with. I’ve actually taken to using a spreadsheet to track the numbers with the added benefit that I can use charts and graphs to demonstrate the issue to stakeholders who often can’t tell a “long-running query” from a crescent wrench. Plus I can run many different sets of benchmarks, recording their results over time, and compare them all based on anything from time of day to season of the year.
Ultimately, though, this is about “reasonable performance”. It takes work to build a performant application delivery system… there’s no sense in building it to serve 10,000,000 hits a day with <20ms response times when it only needs to serve 10,000 requests and response times are allowed to be up to 100ms. Until you can’t do that, you don’t have a problem…. what we’re trying to prevent is a user going to your domain and suffering through long and painful response times and deciding that your website sucks. We need to concern ourselves with the perceptionthat your application is horrendous and should be replaced with a teletype in a small room full of well-trained monkeys, however unfair that may be. Which is why, if you know what you’re doing, a tool like jmeter is really the place to start.
The point is that jmeter has the capacity to place a realistic load on your application, simulating as many users as you need so you can see how your application performs not just under load but from any of several different locations. You can run a tool like jmeter from a remote machine (you can even take your laptop home and use it from a consumer internet connection), from the local machine (to get a baseline of your applications actual performance for comparison with other locations) and from pretty well anywhere in between.
Because the best application in the world running on a broken network is nothing more than a broken application (from the user’s perspective!)… know what I mean?