Thinking about multi-threaded code on a slow single core processor (from 2000) is subtly different than thinking about mutli-threaded code on a multi-core processor (from 2011), especially when using the following line of code:
int salt = Math.abs(new Random().nextInt());
A) Create a private static field:
private static final Random RANDOM = new Random(); //created once when the class is loaded
B) Change all the code to use the static instance, RANDOM, ensuring two threads are MUCH less likely to receive the same value:
int salt = Math.abs(RANDOM.nextInt());
So, there I was looking through the server logs of the most recent release of my team’s Java server application. And every other day or so, a weird SQL insert error would show up indicating the values of the primary key had already been inserted. Given the file based code in this area had been working for the past 11 years (in writes to the file system), the code in the new release had added pushing the records into to the SQL DB. And additional new IOException error (“is a directory”) started showing up occasionally claiming a file operation was attempted on a folder…except that I was not working with a folder, I was working with a file. Huh?!
For deeper context, let’s step back in time. The primary application was started in 1999/Nov. It’s a simple HTTP servlet application which accepts a small set of parameters (URI or post) and responds with XML (mime type text/xml). In order to debug system interactions, a copy of the XML is stored in local file system. The folder path consists of “servlet_name/year/month/day” (ex: AddressResponse/2011/12/04). And the file is named “hh_nn_ss-mmm(salt).xml” (ex: 04_58_37-987(2043482311).xml). To avoid possible collisions in case the four separate servers generating these files were to end up with the same time stamp to the millisecond for the file, a small salt value was added to the end of the file, the value in parenthesis, to reduce the possible collisions as much as possible. Implicit in this was the thinking that if two threads in the same server (process) happened to generate the same time stamp, the salt would eliminate to very small probability the two files would collide. The value is generated using java.util.Random. And the line of code which generates it is:
int salt = Math.abs(new Random().nextInt());
Ah, the folly of insufficiently thought out problems based on implicitly defaulted parameters. Sigh. I am sure some, and perhaps many, of you can already see where this is headed. Please remember, this was code written over 11 years ago which had long been forgotten about and “just worked”.
So back to the present and the errors/exceptions – a failed insertion into the SQL DB and a claim that attempting to open a new file to write results in an exception with the message text simply saying “is a directory”. I ran through the new DB code over and over again assuming I had made some “global context via Java statics” error even though I had been diligent to avoid statics and make EVERYTHING immutable (tysvm Joshua Bloch). Finally, I happened to notice that non-primary key data was slightly different between the data already existing in the DB and the data dumped into the exception log. This implied that two separate requests were ending up with an identical file name INCLUDING THE SALT. Statistically, that seemed impossible. The servers don’t receive enough requests to actually come near the 1 in +2.1 billion chance that the same salt value would be generated.
After carefully reading the code and checking for any sort of bleed of the text value between threads (hundreds of lines of code between a number of files), I pulled all my hair out. And barely slept that night. When I got to work the next day, I almost didn’t open Eclipse as I didn’t want to get stuck chasing this. I had way too many other things I had to work. However, since I didn’t understand what the problem was, I couldn’t be sure it was just contained to this one area. There might be some sort of data loss occurring which was silent and remained hidden. And it was occurring in production. So, ignoring it was not the right or responsible thing to do.
I opened Eclipse anyway and my focus shifted to where the salt value was being generated. It seemed it was the only point where the conflict could occur. Double checked to verify there was no static bleed. I went and looked at multi-threading issues around java.util.Random. There had been some prior to 1.5. However, the servers were both now on 1.6. And I didn’t have the problem back when they were using 1.3. Then, I looked at the line again:
int salt = Math.abs(new Random().nextInt());
Then, my subconscious finally indicated that maybe I ought to make explicit what is implicitly default about this line. Instantiating java.util.Random with an empty constructor (no seed) causes Random to use System.currentTimeMillis() as the default for the seed. So, the line really reads:
int salt = Math.abs(new Random(System.currentTimeMillis()).nextInt());
And walla, there’s the problem. How? First you must understand how java.util.Random works. It’s a pseudo random number generator. What it means is that it’s not really random. Given the same starting seed value, the nextInt() method will return exactly the same sequence of values. So, new Random(0).nextInt() will always return exactly the same value every time. Almost all of the time for almost all uses of Random, using the clock (i.e. System.currentTimeMillis()) is a great way to generate a random sequence. The problem here is that two separate servlet threads happen to make the call at exactly the same time which results in exactly the same value being generated as the default for the empty Random constructor. This results in the salt value in both threads being precisely the same. And this results in both threads having identical file names. So, the thread that is successful in getting the file stored to the file system “loses” in that the second thread automatically deletes it and then saves it’s own copy. However, the DB allows the first thread’s record to be inserted and then the DB rejects the second thread’s record. And as a result, the two errors above occur.
The first error, the one claiming that opening the file to write it to the file system (message of “is a directory”) was when one thread was still in the middle of writing the file (owned the system file lock) causing the second thread to receive the exception. And the second exception, the SQL DB insert error, was when either thread finally wrote it’s record to the DB successfully causing the lagging thread to “lose” when the DBs primary key (unique index) restricted the attempted duplicate insertion.
Now that I understood the problem, some XML responses were not being logged in the file system and the SQL DB, I could conclude the problem was non-critical and didn’t require an immediate fix. The fix could go into the next release which ought to occur within the next 2-4 months. It also meant that the problem has likely existed ever since this code was generated back in 2000 and has been silently occurring, albeit very infrequently, because the file saving mechanism was always deleting the first thread’s file write with the second thread’s file write (silent file delete needs to be elevated to a log output so this could have been caught sooner).
This issue actually showed there were several problems with the code design beyond just the java.util.Random default value issue. However, the fix for this particular part of the issue (with a minimum of redesign) is to change the class where the salt is generated by moving the creation of the Random to a private static value. This will ensure the Random is generated once when the class is loaded. And then the salt related code will call nextInt() on that instance of Random (and nextInt() is thread safe, I checked).
There is still the very remote possibility that two threads could both have an identical date/time stamp and nextInt() return the same (Math.abs()) value, twice in a row. However, we are now back at the 1 in +2.1 billion chance, a risk I’d easily take…for now. I will eventually revisit the code generating the file name to eliminate file name duplication in that rare event.
It is possible to design the logic more effectively to handle file name conflicts and just regenerate the full name for the file. The probability that the time stamp and the salt will be the same on the regenerate is so very small, it would rarely occur (like maybe once every 10 years given the current server loads). And even if it happened to occur, the expense for the recalculation would be close to negligible. And given the exception, just causing the thread with the collision to sleep for a couple of milliseconds before attempting the regenerate would noise up the results, especially of the chosen sleep time happened to be a random value between 1 and 20 and retrieved from the same instance of RANDOM.
After working through all of this, I then realized I might have used this particular Random pattern in other places in the code base. A quick use of Eclipse to search the code base turned up a use in a different area, and it was using the same basic implementation. The impact there was not large enough to even worry about it as there was already a mechanism for retrying around conflicts. Still, I will update it with the same solution (moving to a private static instance initialized at class load time) and virtually eliminate conflicts.
As servers continue to increase in speed, number of processors and number of cores, conflicts which were virtually impossible 10 years ago, are much more possible now. It does make me wonder what other kinds of assumptions I have made in my code from back then which will show up as odd server log exceptions. Whatever the case, I’m sure Murphy is working hard to enlighten me as soon as possible. Whoohoo!