Error logs are the first port of call for any outage. Great error logs provide context and cause to a mysterious, 3am outage. Engineers often treat error logs as an afterthought. However, with some up front planning, error logs can become incredibly powerful. Let’s get right into it.
Logging frameworks provide the concept of severity levels, where the developer specifies a priority with each log statement. Developers can filter logs higher than that severity level. Essentially, you set the severity level of the log file to filter out the noise of low level logs.
Your production system could be set to only log messages at levels at
FATAL, while your development environment can be set to log anything at
DEBUG or higher. Ideally, do not allow DEBUG in Production and set up an alerting on
FATAL logs only. It is important that developers use the correct logging level when writing log statements.
It is tempting for some developers to set the level of their logging code to higher level i.e.
WARN so they will definitely be displayed in other environments like UAT, SIT, Pre Prod so to assist with their testing. Unfortunately, this allows users to set the wrong log level and this can cause monitoring issues.
Therefore, as part of code reviews, the correct logging level should be checked.
One of the great benefits of a logging framework is the amount of context it attaches to a log message i.e. timestamp, thread name, class, method and the source line number.
Two pieces of context definitely worth considering is displaying the current timestamp in milliseconds and the current thread name. Both are invaluable when analyzing a large log file.
The thread name is particularly useful in an app-server, where multiple threads may be executing the same code at the same time. This allows you to run the GREP/search command on the logfile to see only the messages that apply to that thread name.
The Tech Leads should set in the coding standards, the format for all types of log statements for all developers to follow. This includes the format of the actual message. Ideally, all the logs for a section of functionality, class or application page should use the same message convention.
For example, the ‘Payments’ pages of an online banking website. Consider using the same naming convention for each payment log as demonstrated below:
logger.info(">>>>>>>> Sort code returned ok from web service! <<<<<<<<<<"); ... logger.warn("SrtCde web service is not responding!!!!!!", e); ... logger.error("The webservice to return the user's SC is not working???????", e); ... logger.fatal(logger.fatal("exception! WTF! Call out on-call!", e);
logger.info("Payments: Banking WS returned sort code ok."); ... logger.warn("Payments: Banking WS is not responding.", e); ... logger.error("Payments: Banking WS failed to return sort code.", e); ... logger.fatal(logger.fatal("Payments: Banking WS, uncaught exception when returning Sortcode", e);
By using the prefix ‘Payments:’ and standardising the text used for the Banking Web Service allows monitoring tools and GREP/search commands to select common functionality by searching for keywords. In this case, ‘Payments:’ (with the colon) to return all payment functionality logs and ‘Banking WS’ for all of the Banking web service logs.
Depending on your application, it is good logging practice to trace all the actions for individual users as they visit the different screens within a website. So each log statement could contain an unique identifier for each user i.e. the username. This allows support teams to easily trace all logs for an individual user.
This is very useful, if a customer contacts your company and reports an error which has not been widely reported. For example, a user finds an error on your page, but there are no failing website metrics. Your tracing system tells you which downstream system caused the issue.
What are the Top Ten most popular logs that you currently write to your log files? After every release into production, whether after a two week sprint or three month waterfall release, discover what are the Top Ten logs and review why they are being written?
These may be genuine errors or bad coding incorrectly setting the logging level. Do you need to know every
INFO log being written for every user hitting that page? Create a Confluence wiki page to store the search commands (or GREP/search commands) and display the Top Ten errors.
Look to fix or refine as many of the Top Ten Error Logs before the next release as part of reducing code debt. Also consider using a monitoring tool like Coralogix. This tool is able to automatically detect anomalies when you deploy a new version.
Remove any unnecessary characters, comments and useless lines serving no purpose from the log files. If your code base is several years old, the chances are there are probably some unnecessary characters lurking in your log files.
In order to see their latest line of code within the depths of the error logs, some developers may decide to add a few extra characters so they can quickly find their code in UAT when searching.
'>>>>>>>>>>>>>>>>>>> User successfully logged in!!!! <<<<<<<<<<<<<<<<<<'
Unfortunately, these extra characters can be checked into the code base and quickly forgotten, cluttering up the log files. This additional noise makes it more difficult to see the logs that truly matter. One trick for speeding up comprehension of your logs are to consider using short abbreviations like:
Another good practice is to review the logging and associated error handling code together:
INFOlog lines short.
Never ever write any passwords to the log files. The same applies for sensitive data like full credit card numbers, expiry dates, cvv numbers and other card details. Credit card numbers can be masked to just show the last 4 digits.
A project plan rarely accounts for logging. Therefore, development teams need to make sure that logging is not forgotten when writing low level code designs or user stories.
These logging requirements need to be reviewed and communicated to the support and monitoring teams so they know what changes will appear in the production logs after go live and can adjust their monitoring tools and support documentation.
If your development team performs code reviews before any code is implemented into production then don’t forget to include the log statements. Often, the logging code is skipped to discuss the ‘more important’ project code changes.
Error logs are crucial to successful troubleshooting. With time and attention, you will grow a set of reliable logs. These logs will become the cornerstone of your monitoring success.