The website I work on recently received a large increase in visitors due to a marketing drive. This caused IIS to start recycling the worker process very often as memory consumption was increasing out of control. The user experience wasn’t affected by this problem as we run a load-balanced system, however we had to fix it of course.
2 weeks ago, I knew nothing about debugging a live server. Now I know a fraction more. Here’s a quick guide to what we did to identify the source of the problem, and then fix it.
Using perfmon we monitored various counters on the production servers. The useful ones were:
.NET CLR Memory
Gen Heap 0 Size
Gen Heap 1 Size
Gen Heap 2 Size
This showed us that over a period of time, lots of things were getting put into the 2nd generation heap and were never being garbage collected. But how do we see what objects are actually in that heap? Enter WinDbg – the Windows Debugger (obvious huh?) This app feels very intimidating with it’s hex output and cryptic commands, but once you get the hang of a few commands you’re away.
So, to debug our app I needed a memory dump. I copied (note: not installed, xcopy is fine) the Windows Debugging Tools onto the production server. I could then run “cscript.exe adplus.vbs -hang -iis” which creates a dump of the memory being used by IIS. This file was copied to my local machine for analysis.
I loaded up the WinDbg tool and opened the memory dump. Type “.load clr10sos” in order to load the .NET debugger extension – son of strike. Then, enter “!dumpheap -stat -gen 2” to get a list, by statistical popularity of the objects in the 2nd generation heap. I was able to see a large number of SqlString objects belonging to NHibernate, an open-source ORM, still lingering about. Using the ants profiler and OpenSTA – a load test tool – I was able to view the object hierarchy, i.e. which objects are holding a reference to the SqlStrings.
A little bit of investigation further and I found that NHibernate QueryTranslators which translate HQL to SQL are cached based, in part, on their unique HQL string. This meant that every different HQL string passed to NHibernate would create another cached QueryTranslator. With no scavenging going on, this was surely the problem cause. I replaced the internal hashtable with a cache implementation from the NHibernate environment and all was well.
Having tested and deployed this fix, memory consumption was drastically reduced, by still rising. After some more perfmon monitoring I noted that the ASP.NET cache was now increasing at a similar rate to the gen 2 heap. A bit more ants profiling and it turned out that items are being put into the cache with an expiration policy but are never expiring.
Then I spotted it. In the SysCache code when items are added they are added with an expiration policy, but if they are updated, they are added using the Cache shorthand notation “Cache[key] = value” which does not set an expiration policy. This meant popular items would cached correctly initially, but then subsequently their expiration policy would be overridden and they would remain in memory until the worker process was recycled.
After a quick code change to the SysCache provider this functionality was corrected, and today the updated version has been released and is behaving properly.
– http://blogs.msdn.com/tess/ – A blog written by one of the Microsoft team.