Some thoughts on event logging

This essay is about event logging, sometimes called error logging. It’s not about specific framework, but just some random thoughts related to logging and API design.

My background? Head-less back-end systems and daemons (services), that is, programs that run 24/7 with no user interface and no user supervision on other people’s servers.

Okay? Let’s begin.

How does the usual error logging code looks like? Say, like this (example code is in C# and more or less ideas are focused on Microsoft .Net framework):

enum Severity // or Priority, whatever
{
    Error, Warning, Debug // etc.
}
interface ILogger
{
    void Log (Severity severity,
              string format,
              params object[] args);
}

(Most frameworks prefer one method per severity — LogError, LogWarning, etc. — but I prefer more concise and clean interfaces. Enough ranting, let’s continue.)

Backing implementation calls String.Format (format, args) and saves resulted text string in a text file, Windows’ EventLog, a database, syslog server, basically somewhere (or maybe on multiple places at once). Log will be called all the time, possibly simultaneously by multiple threads and a single logical operation will be divided into a multitude of event records, that might or might not be easily relatable with one another.

Problem diagnosis is everything but simple, and event recording is a key tool when dealing with a production system, where stepping over the code in a debugger is not an option. Besides code always work correctly on developer’s workstation.

Can we improve logging? I believe there is at least some room for improvement.

Event record meta-data

Event record is a log entry, possibly a text line, about something that just happened or is about to happen. For example

logger.Log (Severity.Debug,
            "about to process file {0} for task {1}...",
            fileName, taskId);

then later

logger.Log (Severity.Debug,
            taskId + ": finished processing of " + fileName);

One problem with the above fragment of code is the lack of consistency and subsequent difficulty of searching (or grepping, if you wish) during problem diagnostics. Even though the code above is intentionally written badly, consistent and easily parsable event messages are not easy to write, moreover each and every time.

One possible solution to that would be extraction of meta-data and attachment of meta-data and message text separately to the event record. In the above example, meta-data would be the the values of taskId and the fileName variables. Meta-data is possibly not the best word to use, but the point is, that structured data (or data model) should be separated from the text message. Let’s consider a different version of Log method:

void Log (Severity severity,
          string message,
          ??? metadata);

What data type to use for the meta-data parameter? There are multiple options (for .Net, that is):

anonymous types and reflection;
Dictionary<string, object> and iteration;
concrete types and a common serialization interface;
type-less sequence of keys and values, and iteration.

Let’s examine each.

With anonymous types logging API interface and usage goes like this:

void Log (Severity severity,
          string message,
          object metadata);

logger.Log (Severity.Debug,
            "about to process file",
            new { tag = tagId, file = fileName });

Beautiful, isn’t it? Information is extracted at a later time using reflection and anonymous types serve the same purpose as as dictionaries: property names are keys, property values are values:

string SerializeMetadata (object metadata)
{
    if (metadata == null)
        return String.Empty;
    StringBuilder s = new StringBuilder ();
    Type t = metadata.GetType ();
    foreach (PropertyInfo prop in t.GetProperties ())
    {
        if (s.Length > 0)
            s.Append (", ");
        s.Append (prop.Name)
            .Append ("=")
            .Append (prop.GetValue (metadata, null));
    }
    return s.ToString ();
}

I believe there is no need of further explanation for Dictionary<string, object> solution:

void Log (Severity severity,
          string format,
          Dictionary<string, object> metadata);

logger.Log (Severity.Debug,
            "about to process file",
            new Dictionary<string, object> ()
            {
                { "tag", tagId },
                { "file", fileName }
            });

Concrete types and a common serialization interface can also be used, if event records’ meta-data is too common; such approach will also provide great performance, but is cumbersome:

struct FooOperation : ISerializable
{
    FooOperation (string tagId, string fileName)
    { /* keep both in properties */ }
    // serialization implementation follows...
}
struct BarOperation : ISerializable ...

Finally there is type-less interface with no compile-time validation, where meta-data is passed along as a sequence of keys and values – keys at even indexes, values at odd ones (with respect to metadata parameter):

void Log (Severity severity,
          string format,
          params object[] metadata);

logger.Log (Severity.Debug,
            "about to process file",
            "tag", tagId,
            "file", fileName);

Enough about API design. Then what?

Meta-data information must be serialized and attached to the event itself: JSON and XML formats would work great, some sort of a text representation will be fine, too. For example above example could be recorded in a text file as the following line:

2013-03-30T19:32:21.345 dbg  program:{tag:"MA4XT",file:"/tmp/foo.bar"} about to process file

or maybe in a XML-format (works great with XPath):

<event ts="2013-03-30T19:32:21.345"
       severity="Debug"
       message="about to process file">
    <tag>MA4XT</tag>
    <file>/tmp/foo.bar</file>
</event>

What is a good candidate for event meta-data? Internal error codes (you do use error codes, right?), Exceptions, object IDs, URLs and file-system paths, everything someone would use to locate a set of events related to a particular problem. It should be noted, that ease of data extraction depends on the uniformity of used data format, as well as consistency of meta-data keywords, or tags; it works best when same tags are used for same things.

How this stacks up?

Pros:

consistency if controlled vocabulary is used (common set of tags/keys);
structured and machine-readable data (depending on serialization format);
facilitates searching and filtering;
with anonymous types: neat, readable code, easy to use;

Cons (with arguments):

some run-time overhead;
with anonymous types: larger binaries, reflection necessary to extract the data (but then again see scoping below);
with type-less interface: no compile-time validation, for example a tag name might be omitted by mistake.

Scoping, or logical operations, or context

Wouldn’t it be great if only problems would be logged, and in the same time in great detail, and in the same time no “noise” will be recorded while everything is working normally (while there are no errors)? Well, that is not impossible. Let’s introduce the concept of scope, sometimes called logical operation or simply operation context:

interface ILoggingScope
{
    void Log (Severity severity,
              string message,
              object metadata);
}

In fact this interface is exactly the same as ILogger used above and that is on purpose. It’s not the interface that is different, it is the implementation.

Let’s assume implementation is like this: each time Log is called, its arguments are recorded in a journal, and in case of an error that journal can be re-played, hence recovering the whole history up to the point of the error itself. Then what would the difference be compared to regular logging? The difference is, that if no error occurs that journal will be torn and burnt, and no informational or debug or trace events would be recorded.

class LoggingScope : ILoggingScope
{
    public void Log (Severity severity,
                     string message,
                     object metadata)
    {
        this._events.Add (new PastEvent (severity,
                                         message,
                                         metadata));
    }

    // add a way to access or replay the `_journal`

    class PastEvent
    {
        public PastEvent (Severity severity,
                          string message,
                          object metadata)
        {
            this.Timestamp = DateTime.Now;
            this.Severity = severity;
            this.Message = message;
            this.Metadata = metadata;
        }
        public DateTime Timestamp;
        public Severity Severity;
        public string Message;
        public object Metadata;
    }
    List<PastEvent> _journal;
}

Then at later time:

LoggingScope log = new LoggingScope ();
try
{
    // ... do some work, pile-up some event records
    // everything completed successfully? drop log
}
catch (Exception x)
{   // oops!
    log.Log (Severity.Error,
             "error doing some work”,
             new { evid = ErrorCode.ProcessFailed, x = x });
    Save (log);
}

How this stacks up?

Pros:

no “noise” (no informational messages) during normal operation – works nice with production systems (and they are the worst case when problem diagnostics is involved);
complete history of what happened in case of an error (as far as Log is called at the right places);

Cons (with arguments):

some overhead: generally one List<PastEvent> and multiple PastEvent instances, but meta-data is not serialized unless necessary;
a logger instance must be passed along the stack, however a ThreadStatic singleton may be sufficient, depending on the case.

Logical operation meta-data and auditing

Current logging frameworks usually control auditing – which event records are persistently stored – using event severities/priorities (or thresholds), possibly in a combination with module identifiers; for example:

log all errors or with greater importance, regardless of module name,
log everything related to billing.

Hmm, what about to track only billing events related to the customer Adam, how can we do that? Let’s add meta-data support to scope itself and enable logging of logical operations based tests against that meta-data:

interface ILoggingScope : ILogger
{
    void SetMetadata (string key, object value);
}

backed by implementation like this:

void SetMetadata (string key, object value)
{
    this._metadata[key] = value;
}
bool Test (string key, object test)
{
    object value;
    return (this._metadata.TryGetValue (key, out value) &&
            Object.Equals (test, value));
}
Dictionary<string, object> _metadata;

Then, later:

bool Test (LoggingScope log,
           Dictionary<string, object> rules)
{
    int n = 0;
    foreach (KeyValuePair<string, object> kvp in rules)
        n += log.Test (kvp.Key, kvp.Value) ? 1 : 0;
    return n == rules.Count;
}

LoggingScope log = new LoggingScope ();

// ... do some work, pile-up some event records

if (Test (log, auditRules))
    Save (log);

How would auditRules be initialized? Some sort of a configuration. For a larger system or something that is 24/7 it will paid-up to enable run-time modifications on audit rules, like dynamic configuration reload or maybe some sort of a RPC (or HTTP-interface). Several sets of audit rules are also easy to implement using a simple loop.

How this stacks up?

Pros:

allows run-time control of what’s recorded and what is not (regardless if error happened or not);
all events related to that operation could be recorded;

Cons (with arguments):

some overhead, not that much at all, considering the benefits; as usual, as larger the system, as cheaper the overhead.

Conclusion

Three ways to improve error logging has been proposed:

Inclusion of meta-data in event records;
The concept of logical operation;
Inclusion of meta-data in logical operations.

Hopefully use of above ideas will decrease sleepless nights and long debugging sessions.