Understanding the MongoDB code that people are laughing at

May 31, 2013

Recently my corner of the Twitter-verse has been moderately aflame with a number of people pointing and laughing at a chunk of MongoDB source code; this example is typical. At first I automatically accepted that the code was laughable (and you have to admit that seeing a call to random() in database client code is odd) but when the link kept cropping up I got curious enough to start reading the surrounding code. Unfortunately the more I read the more sane the code looked. Since the details don't exactly fit in a tweet, it's time to write a more substantial explanation of the code and why it's sane.

Let's start with the code that people are directly laughing about (or so I assume since this is the line that people have been linking to):

if (!((_ok) ? true : (Math.random() > 0.1))) {
   return res;
}

In the interests of cleverness, someone has made this if condition much too hard to read. Rewritten to be clear, it is '!_ok && Math.random() <= 0.1'. If _ok is false, this if and its immediate return will fire about 10% of the time. If _ok is true, it will never fire.

Now let's look at the surrounding code, or at least bits of it. The context goes like this:

abstract class ConnectionStatus {
  ....
  static abstract class UpdatableNode {
    ....
    public CommandResult update() {
      CommandResult res = null;
      try {
        ... try to talk to a server ...

        if (!_ok) {
          getLogger().log(Level.INFO, "Server seen up: " + _addr);
        }
        _ok = true
        ...
      } catch(Exception e) {
        if (!((_ok) ? true : (Math.random() > 0.1))) {
          return res;
        }

        ... log a message about the
            the server being down ...
        _ok = false;
      }

      return res;
    }
...

Let me translate this: the effect of the if with the Math.random() is that if there's an error and the server has already been marked as down, about 10% of the time the code doesn't bother to log a message that it's still down (and doesn't record the actual exception caught as part of that log message). If the server is marked up and there's an error, a message is always logged. In all cases res is returned, possibly as null, and I believe no exception is propagated; I assume that this is an intended part of the API.

(There may actually be a bug here; it may be intended that the message is skipped 90% of the time when the server is already marked as down, not 10%.)

This code does not strike me as laughable, crazy, or worthy of mockery. It may have some defects and you may disagree with what it's doing, but it sure seems to be sane code in general. You probably have worse things in some of your projects (I know that I do in mine). My personal view is that what the code is doing is perfectly sensible to avoid flooding logs, but since I don't know MongoDB I don't know how much log volume this would produce (or avoid) and how important that log information would be during problem diagnosis.

(Note that the code will always log 'server seen down' and 'server seen up' when it first notices the server down and when it first notices the server back up. The only question is how many 'server still down' messages it logs or should log between those two, in a range somewhere between 'none' and 'one for every attempt to talk to the server'.)

If there is a fault in this code it is not the use of Math.random() in database client code, it is that the if condition has been written in such an overly clever way. And if people want to point and laugh at overly clever code, well, there's a lot of fish in a lot of barrels (our own generally included).

Update: C J Silverio pointed me to the initial commit for this code. The initial commit is much simpler and makes it clear that the current 'log 90%' state is a bug.


Comments on this page:

From 108.255.61.252 at 2013-05-31 09:56:39:

I'm kind of surprised to see you defending this behavior, given your frequent descriptions of times you had to code dive Solaris et al to figure out WTF is going on. With your user's hat on, how could this behavior make sense without code diving? "Sometimes I get an error every time I try to connect to a dead server, other times I do not." Literally, at random.

My modest suggestion for the way I would solve the "log spam" problem, aka The Right Way:

  • Log at WARN every N seconds, rather than randomly
  • When you don't log a warning, log at DEBUG so that a user's first impulse to understand unexpected behavior -- turn up logging verbosity -- gives him complete information
From 108.255.61.252 at 2013-05-31 09:57:28:

Oh, I didn't sign my comment; this is Jonathan Ellis.

By cks at 2013-05-31 11:18:44:

My short answer is that I feel that it clearly requires MongoDB domain expertise to know if this behavior is desirable or a bad idea and I don't have that domain expertise. What to log depends very much on what's important or useful in troubleshooting systems and you can't tell without enough knowledge to do that troubleshooting. As a result this behavior is not clearly wrong and crazy in the way that say, silently swallowing errors and claiming that the operation had succeeded would be. If there is a problem here it's an operational issue not a basic sanity issue.

(Just as I've grumped about Solaris being uninformative, I've also grumped about things flooding my logs with pointless messages.)

With that said, I think there's a number of reasons why this could be okay. First, this is library code not application code (it appears to be the low-level DB connector that lets your Java application talk to MongoDB servers). I assume that callers see an error indication when things fail so in a sense it is the caller's job to log about problems if this actually is a problem. Second, since this code always logs the start and end of server problems it probably logs enough information to reconstruct the state of affairs from your full logs. I don't know if MongoDB applications really need low-level 'server is still down, just like I told you before' entries for every attempt in the mean time. Certainly in some environments you would very much not want your DB connector library to emit log messages for you, much less log a message every time things failed.

(I don't think the PostgreSQL or MySQL libraries log anything, for example, and no one is complaining about them; it's considered the application's responsibility.)

I suspect that some logging is sensible in the MongoDB case because this code doesn't appear to propagate detailed error information to the callers. The caller will know that this operation failed (again, I assume) but the log message is the only record of why and the why might be important.

(I also don't know if the why can vary in important ways once a server stops being okay here. If it does then you want to capture each different why in a log message; if the why is basically constant then the log messages are potentially strongly redundant. Again, I don't know if this is important in practice for troubleshooting systems that use MongoDB.)

By zvyn at 2015-07-21 22:59:01:

The code is still insane, if you want to log an error 10% of the time, log it 10% of the time and not maybe 10% of the time, maybe not at all, maybe every time. There are ways to do this without random. And skipping errors is not the only solution, you could for example re-phrase the massage to "received N connection errors in the last X seconds" and log it only once in X seconds.

When debugging you want to know whats going on and having to assume that when there is no error message 90% of the time, that the error might still happen 100% of the time is insane.

Written on 31 May 2013.
« I find Systemtap oddly frustrating
The mystery of POSTs with a zero Content-Length »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Fri May 31 00:02:32 2013
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.