Its good to know I can still do stuff.
I used to work on cars, vans and motorcycles quite a bit. Anytime a friend’s vehicle would break down, I would get a call and generally I’d be very happy to fix it. I enjoyed wrenching on things. As I have gotten older I have been wrenching less and less. I have even started paying a mechanic, which is something my younger and more brazen self never would have done. A recent bust of car trouble has brought me out of retirement. Today, I replaced the alternator in my ‘01 Outback. Although the alternator is one of the easiest components to replace in a car, it still felt good fixing it on my own. I actually had the bad one out and the new one in in 20 minutes which I’d have to say I was proud of.
A couple of weeks back the wife and I were in a small town visting family about 50 north of Truckee, pretty much in the middle of nowhere. The ignition cylinder went out in our car. In order to replace it I had to order a part, which took a week, and disassemble the entire dashboard and paneling of the car outside while it was snowing. I got the thing back together, the car started and we got home.
Sometimes I forget that I am good at more things than just writing code. It is nice to know that I can still do stuff.
Infintite loop detection on the JVM using thread dumps. A real world example.
The following is an account of the process employed to diagnose a recent load issue and is intended to also function as an instructional for diagnosing issues on the jvm using thread dumps.
On a few occasions we have experienced sustained high load on an app server when there isn’t necessarily a correlating high volume of requests. When server load reaches a set limit, a nagios alert is triggered. The ops team springs into action and generally restarts the app which solves the problem before I can get a chance to dive in.
The difficulty of solving an intermittent problem is reproduction. Recently, I was was able to reproduce the issue through load testing in a staging environment. The load test finished and one of the servers sustained very high load while receiving no more requests.
First step, the server has high load; fire up trusty ol’ top and check which process is generating all the load.
As we can see from top, it is a single process (16533 our app) eating all the cpu. This is a staging environment that I know is not receiving thousands of requests to cause this kind of load so I know that our app is doing something strange.
Let’s fire up jvisualvm and check to see what the memory usage is like.
Nothing odd here. Very little memory used and garbage collection isn’t constantly running. If this high cpu was related to general application load I would expect to see high memory use. I am starting to think this could be some runaway threads.
Let’s go back to top and turn on threads with the ‘H’ command.
Very enlightening. We can see that there are about 9 threads consuming all the cpu ( %CPU ) and are long running ( TIME+ ). The next step is to figure out what these threads are doing and for that we need to trigger a few thread dumps. Before we take our thread dumps we need to make note of the thread ids we will look for in our dump. In top when you turn on thread monitoring with the ‘H’ command the PID column becomes the thread id. Write down a few of the big offenders. ( 27700, 19263, etc )
When taking thread dumps it is important that you capture a few. A thread dump records a stack trace for each thread at a point in time. This means that when you capture a dump, the offending threads may or may not actually be running. If you capture a few you are more likely to see a stack trace for the thread you are after; you can also compare two stack traces to see if the thread is making progress or is stuck because of a lock or loop.
There are a few ways to capture a thread dump, jvisualvm, jstack, kill -3. I use kill -3 because jstack is buggy and jvisualvm is already a memory hog and it doesn’t always include the os thread id in the dump.
Kill -3 <processId> will not kill the process. It will temporarily lock all threads and send the stack for each thread to STDOUT of the given process id. In order to capture the dump I need to send STDOUT for process 16533 to a file I create for each dump.
Now I can trigger a dump.
Now, do this a few more times, changing the file name between each dump. Then download the dump files for analysis. Before we start our analysis we need to convert the thread ids we wrote down earlier and convert them from decimal ( base 10 ) to hex ( base 16 ) because for whatever reason thread ids are in hex in the dumps. An easy way to do the conversions is WolframAlpha.
27700 = 0x6c34
Start by opening the dump files and search for the hex version of a thread id.
Great. We found thread 27700 in a few of the dump files and the stack traces are virtually identical. If we look at the top of the stack we see that the thread in the process of retrieving an entry from a map. At this point bells should start going off if your head. Maps are used in loops all the time and if loops become infinite, bingo, the result is a long running thread that does nothing but eat cpu. It looks like the map is being used from the ‘org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugins.addValidationMethods’ method ( line 124 ). Lets check it out and look for a poorly written loop or anything else that might catch the eye.
We can see that the method doesn’t contain a loop that could become infinite. Lets descend into the apache commons map class at the top of the stack trace to see if there is anything squirrelly there. Apache commons code tends to be super stable but it is important to follow the clues wherever they lead.
Boom. There’s our loop. The getEntry method at the top of the stack trace contains a while loop. Now if the entries get messed up and an entry.next points to a previous entry in the list then this would absolutely introduce an infinite loop. But how could the entries get messed up? I highly doubt an apache commons collections map would be that buggy. The docs for the map explicitly state it is not thread safe. Lets go back the DomainClassGrailsPlugin class check to see if it is being used in a thread safe manner.
We start by looking at where the map is created and we see that it isn’t actually the apache commons map but a class called SoftThreadLocalMap. Based on the class name and stack trace I assume that this class is a wrapper that stores the apache map in thread local so each thread has it’s own copy of the map preventing the need to synchronize on the map. Lets check it out.
Eureka! The SoftThreadLocalMap extends from InheritableThreadLocal which will pass the reference to the map to any child threads spawned from the one containing the reference. So by extending InheritableThreadLocal and not overriding childValue, the code will share the map with child threads but the internal map implementation is explicitly not thread safe.
We discovered the problem. A couple of solutions could fix this.
Change InheritableThreadLocal to ThreadLocal. This could change the semantics of map for existing users of the library so it could cause more issues than it solves.
Implement childValue and have it return a cloned map.
Wrap ‘new ReferenceMap’ in Collections.synchronizedMap( new ReferenceMap.. ).
Use groovy’s meta programming to implement one of the above without having to fork grails.