Complexity of Logging
Posted on Wed 25 September 2019 in logs • 4 min read
I started here at LogDNA a few weeks ago, and it's funny how something as "simple" as logging can have a large load of complexity the farther down the rabbit hole you go. Now, I already had an appreciation for good, actionable logs coming in. I'm actually a big fan of the 12Factor App methodology, and good logging generated by previous developers have been lifelines for me when trying to understand and debug legacy systems. Sure, devs know, if they need to create an application log, they can pull in a logging library or use a built-in logging solution from their favorite language or the language-du-jour, pass a couple strings or an array, and then move on with their work. And sysadmins pour through selected parts of aggregated logs on a daily basis to monitor systems, troubleshoot servers, and otherwise understand the health of an environment. However, have you ever thought—seriously thought—about logs and logging systems and what makes the whole thing tick?
Logs may seem like a simple thing. The log files really are just simple sequences with data, the log line, being added at the end every time some event triggers the write mechanism. The process of writing to this append-only file generates an inherent timeline in the log file structure. Technically, the log line doesn't have to include a timestamp as a result. We include timestamps for each line, though, because we as humans think in terms of our calendars and clocks. We need time, whether relative or absolute, so we can correlate to external disruptions or changes and because the advent of asynchronous streams and concurrent, distributed systems means our computer systems are no longer bound by a linear path of function following function.
As a result of having distributed systems, though, we often have multiple log files from many different systems, and we have to manage all of them. At the top of the stack are our application logs, the ones that most devs are familiar with and that are generated by the application itself. At the bottom are the system logs, which come from the operating system or the BIOS underneath the operating system. Everywhere in between, from your networking calls to changes in permissions, are different logs like event logs, audit logs, and transactional logs. All of those logs follow the same basic principle of data added to an append-only file, and all of them are built around the small bit of data called the log line.
There's an art to writing a good, solid, actionable log line. Standards and ideals have changed over the years, too. Logs have evolved from a straight I/O buffer to strings with parsable data to tokenized arrays with human-readable messages as one of many datapoints. Those arrays of a single event can include a metric ton of data stored as callable keys with associated values that can then be parsed by a machine. We've collectively learned to generate logs in such a way that a separate system can aggregate the data, identify patterns and anomalies, determine the difference between a small blip and a fatal programmatic error, raise human-readable alerts—all in the time it might take for you to take a sip of your morning coffee (or tea, if you're a tea drinker like me).
We've added the concept of logging levels to the basic log line to make those massive generated logs parsible. We can choose to see only the most critical of logs: The fatal system error. We can dig into details by turning on debug levels. And, of course, we've created multiple standards, with the next great "unifying" standard always on the horizon as a real-life xkcd comic. It's human nature, after all, to want to create a unifying pattern and then disagree on exactly how to bring that pattern to life. In general, though, we collectively decided that having log levels for each line is a Good Idea™.
At their heart, though, log lines are the announcements of changes in state of your system. A system's state turned from off to on. So-and-so's state became logged in. This server turned from healthy to unhealthy. The choices around what to log is a constant source of debate among engineering teams. Some people think everything should be logged, just tagged carefully with the right log level. Others think that logs should be carefully curated to generate only the most useful, most critical information. I find the divide often follows the same arguments around commenting in code. However, really, the choice of what to log revolves around which state changes you need to monitor, and despite a lot of advice out there, only the creator, maintainer, and user would know which state changes are the most important and needed for decisive action.
All of this discussion about the complexity around logs and log lines doesn't even take into account the complexity of managing those logs. As a dev advocate at a log aggregation company, I'm getting a real look at the underbelly of the beast. The sheer amount of data that we take in on a regular basis is staggering. When the phrase "petabytes of data daily" makes your engineers shrug in a world where the majority of us are still thinking in gigabytes and terabytes, then you really are dealing with a mind-bendingly large amount of data. Once I've conquered my initial wow factor a bit, we can talk more about the complexities of log aggregation and management itself, especially with distributed systems and microarchitectures.
What are your thoughts about the complexity of logging? Join the conversation on Twitter.