I am happy (bordering on giddy) that our server engineer / administrator Matt Critcher is now blogging, dude is probably one of the sharpest guys I have ever met and he is an all around cool guy to hang with too, but beware of the fancy cheese he brings to dinner parties because you could find yourself in the emergency room on New Years Eve thanks to a long-standing penicillin allergy.
As some of you might know we made the transition to Virtualization a while back and have been extremely happy with the versatility it has brought us with our managed hosting and vps products that it has allowed us to bring to our clients, but with growth there can also be growing pains, it is for this reason that I am so glad we have Matt in our corner, dude knows his stuff and he can get to the bottom of an issue better than anyone I have ever worked with.
Lately we have been transitioning to VMware and have had some issues w/ websites that are slow to respond via browsers, but yet they still ping out okay. It’s been a weird week or so, here’s a post that Matt put together the other night about the issues, I thought maybe someone else could benefit from his findings down the road:
I posted a few weeks back that Pleth had transitioned some of their equipment over to VMware Server and for the most part it’s been a very smooth process. But, as of late we’ve ran into some slowdowns, especially on the VPS with Plesk (which happens to host several of our websites). After doing a bunch of research and spending many a late hour digging through tons of mpstat and other sysutils data I think I found the culprit(s).
VMware Server, unlike the ESX/ESXi products, does not run in a Type 1 Hypervisor. This means that the underlying OS (in our case Red Hat Enterprise Linuxwas tuned out of the box for a general all-purpose server. This configuration isn’t always optimal for a Type 2 Hypervisor. It works just fine as long as things are "normal," but as the new VMware server got a larger load (in terms of I/O and CPU) performance went downhill.
One of the major problems has to do with how VMware Server uses disk-backed memory files (*.vmem). There is great debate on the web whether or not you should disable them, but one thing that is clear — when a site is busy, the file will be updated with memory information to reflect the changing memory of the VPS in question. This is where the problem lies — servers with unga-bunga hardware RAID solutions with 15K RPM disks and tons of spindles have a less of a problem with it but moderate quad-core Xeon and SAS disks in a RAID1 configuration like we and most other webhosts our size have it is a bigger issue. All those writes causes a wait-state in the CPU and therefore a backlog of transactions to be processed causing said server slowdown.
One way to deal with this is to modify the /etc/sysctl.conf to add (or modify) the following parameters:
vm.dirty_background_ratio
vm.dirty_ratioI set my vm.dirty_background_ratio = 2 and vm.dirty_ratio = 85
Basically what these 2 parameters do is dictate the percentage of memory that can be "dirty" before it begins to flush (background_ratio) and the percentage of memory that can be "dirty" before a forced flush begins. When these files are updated, we can either have them done in the background (hence the low number for background ratio) with pdflush which allows other processes to continue to run, or we can have them queued up and wait for a synchronous (forced) write causing the iowait states (hence the large number for dirty_ratio). The big gap between background writes and synchronous is to try to keep the background writes coming consistently and avoid the synchronous writes as much as possible. You’ll have to play around with these figures to see what works best for you. See this page about half-way down for a little more in-depth explanation of these two parameters.
I also made some configuration changes to PHP and Apache to try to get a tad bit more performance out of each of them. I had written out a whole list of stuff that I’d modified to post here, and as I was looking for websites to help explain the modifications, I stumbled upon this website from IBM that lists pretty much every change that I made to Apache and PHP.
If you want to tune your MySQL database, this website is invaluable. It explains almost every parameter that you can possibly adjust and how to adjust them. One that it doesn’t really get into though is
innodb_flush_log_at_trx_commit
Setting this to "2" will force the system to write out any changes to the transaction log when the commit occurs but will only cause a flush of this data from memory to disk once every second (which gets stuck in the scheduler and is handled in the background by pdflush). The default setting of "1" will write out to file and flush this data from memory every time a commit happens. On really busy servers with InnoDB tables, this can cause slowdowns if your server really isn’t designed to handle a heavy DB load (most webservers aren’t). The drawback to this is that if the system crashes, you could lose 1 second of writes. Depending on what you are doing, this might be acceptable. Setting this to 0 will cause the write every second, but if the server crashes you might lose a ton of data because nothing is done at transaction commit. Scary, but fast (to me, scary outweighs speed in this case).
None of these changes should be taken without first thinking about what might happen. We have a test box in our office that basically mirrors our production server that I could test on beforehand. The Apache and PHP config changes are easy — no server reboots required, and you’ll know almost immediately if you mess them up. If you modify sysctl.conf incorrectly, the server might not boot. Better test a few things out (a VMware VM is a perfect testbed for these settings) BEFORE you have downtime.
VMware, Apache, MySQL, and PHP Performance Tuning | www.mcritch.com
Questions or Comments?