An Introduction to Windows Event Logs

The value of log files goes far beyond their traditional remit of diagnosing and troubleshooting issues reported in production. 

They provide a wealth of information about your systems’ health and behavior, helping you spot issues as they emerge. By aggregating and monitoring log file data in real-time, you can proactively monitor your network, servers, user workstations, and applications for signs of trouble.

In this article, we’re looking specifically at Windows event logs – also known as system logs – from how they are generated to the insights they can offer, particularly in the all-important security realm.

Logging in Windows

If you’re reading this, your organization will likely run Windows on at least some of your machines. Windows event logs come from the length and breadth of your IT estate, whether that’s employee workstations, web servers running IIS, cluster managers enabling highly available services, Active Directory or Exchange servers, or databases running on SQL Server.

Windows has a built-in tool for viewing log files from the operating system and the applications and services running on it. 

Windows Event Viewer is available from the Control Panel’s Administrative Tools section or by running “eventvwr” from the command prompt. From Event Viewer, you can view the log files generated on the current machine and any log files forwarded from other machines on the network.

When you open Event Viewer, choose the log file you want to view (such as application, security, or system). A list of the log entries is displayed together with the log level (critical, error, warning, information, verbose). 

As you might expect, you can sort and filter the list by parameters such as date, source, and severity level. Selecting a particular log entry displays the details of that entry.

When using audit policies, you can control the types of logged events. Suppose you aim to identify and stop cyber-attacks at the earliest opportunity. In that case, it’s essential to apply your chosen policy settings to all machines in your organization, including individual workstations, as hackers often target these. 

Furthermore, if you’re responsible for archiving event logs for audit or regulatory purposes, ensure you check the properties for each log file and configure the log file location, retention period, and overwrite settings.

Working with windows event logs

Although Event Viewer gives you access to your log data, you can only view entries for individual machines, and you need to be logged into the machine in question to do so. 

However, for all but the most minor operations, logging into or connecting to individual machines to view log files regularly is impractical. 

At best, you can use this method to investigate past security incidents or diagnose known issues, but you miss out on the opportunity to use log data to spot early signs of trouble.

If you want to leverage log data for proactive monitoring of your systems and to perform early threat detection, you first need to forward your event logs to a central location and then analyze them in real-time. 

There are various ways to do this, including using Windows Event Forwarding to set up a subscription or installing an agent on devices to ship logs to your chosen destination.

Forwarding your logs to a central server also simplifies the retention and backup of log files. This is particularly helpful if regulatory schemes, like HIPAA, require storing log entries for several years after they were generated.

Monitoring events

When using log data proactively, it pays to cast a wide net. The types of events that can signal something unexpected or indicate a situation that deserves close monitoring include:

  • Any change in the Windows Firewall configuration. As all planned changes to your setup should be documented, anything unexpected should trigger alarm bells.
  • Any change to user groups or accounts, including creating new accounts. Once an attacker has compromised an account, they may try to increase their privileges.
  • Successful or failed login attempts and remote desktop connections, mainly if these are outside business hours or from unexpected IP addresses or locations.
  • Password lockouts. These may indicate a brute force attempt, so it’s always worth identifying the machine in question and further investigating whether it was just an honest mistake.
  • The application allows listing. Keep an eye out for scripts and processes that don’t usually run on your systems, as these may be added to facilitate an attack.
  • Changes to file system permissions. Look for changes to root directories or system files that should not routinely be modified.
  • Changes to the registry. While you can expect some registry keys, such as recently used files, to change regularly, others, such as those controlling the programs that run on startup, could indicate something more sinister. Similarly, any changes to permissions to the password hash store should be investigated.
  • Changes to audit policies. Hackers can ensure future activity stays under the radar by changing what events are logged.
  • Clearing event logs. It’s not uncommon for attackers to cover their tracks. If logs are being deleted locally, it’s worth finding out why (while breathing a sigh of relief, you had the entries forwarded to a central location automatically and haven’t lost any data).

While the above is an exhaustive list, it demonstrates that you should set your Windows audit policies to log more than just failures.

Wrapping up

Windows event log analysis can capture a wide range of activities and provide valuable insights into the health of your system. 

We’ve discussed some events to look out for and which you may want to be alerted to automatically. Still, cybersecurity is a constantly moving target, with new attack vectors emerging. You must continuously be on the lookout for anything unusual to protect your organization.

By collating your Windows event logs in a central location and applying machine learning, you can offload much effort to detect anomalies. Coralogix uses machine learning to detect unusual behavior while filtering out false positives. Learn more about log analytics with Coralogix, or start shipping your Windows Event logs now.

Proactive Monitoring vs. Reactive Monitoring

Log monitoring is a fundamental pillar of modern software development. With the advent of modern software architectures like microservices, the demand for high-performance monitoring and alerting shifted from useful to mandatory. Combine this with an average outage cost of $5,600 per minute, and you’ve got a compelling case for investing in your monitoring capability. However, many organizations are still simply reacting to incidents as they see them, and they never achieve the next stage of operational excellence: proactive monitoring. Let’s explore the difference between reactive and proactive monitoring and how you can move to the next level of modern software resilience.

Proactive vs. Reactive: What’s the difference?

Reactive monitoring is the classic model of software troubleshooting. If the system is working, leave it alone and focus on new features. This represents a long history of monitoring that simply focused on responding quickly to outages and is still the default position for most organizations that are maintaining in-house software.

Proactive monitoring builds on top of your reactive monitoring practices. It uses many of the same technical components. Still, it has one key difference: rather than waiting for your system to reach out with an alarm, it allows you to interrogate your observability data to develop new, ad hoc insights about your platform’s operational and commercial success.

Understanding Reactive Monitoring

Reactive monitoring is the easiest to explain. If your database runs out of disk space, a reactive monitoring platform would inform you that your database is no longer working. Suppose a customer calls and explains the website is no longer working. You may use your reactive monitoring toolset to check for apparent errors.

Reactive monitoring looks like alerts that monitor key metrics. If these metrics exceed a threshold, this will trigger an alarm and inform your engineers that something is broken. You may also capture logs, metrics, and traces, but never look at them until one of your alarms has told you you need to. These behaviors are the basic tenets of reactive monitoring. 

So what are the limitations of reactive monitoring?

The limitations of reactive monitoring are clear, but a reactive-only strategy has some more subtle consequences. The obvious implication is that you’re reacting to incidents rather than preventing them. This leads to service disruptions and customer impact. However, it also means more time troubleshooting issues. Interruptions can constitute up to 6 hours of your time on a typical working day. These interruptions, plus potentially expensive outages, can add to a lot of lost revenue and reputational damage that may impact your ability to attract talented employees and new customers.

What is Proactive Application Monitoring?

So what is proactive monitoring? We’ll go further into this, but proactive monitoring looks like this:

  • Multiple alert levels for different severity events. For example, google advises three levels – notify, ticket, and page.
    • Notify is simply a piece of information, like the database uses more CPU than usual. 
    • Ticket is an issue that isn’t pressing but should be dealt with soon. 
    • A page alert is an actual alarm when something is broken. 
  • Predictive alarms tell you when something is going to happen rather than when it has already happened. Prometheus supports this with its predict_linear functionality, for example. This is a straightforward implementation but illustrates the idea perfectly.
  • Interrogating your observability data regularly to understand how your system is behaving. For example, using Lucene to query your elasticsearch cluster or PromQL to generate insights from your Prometheus data. 

Machine learning is a powerful component of proactive monitoring

AIOps has found its place in the observability tech stack, and machine learning-driven alerts are a cornerstone of proactive monitoring. You’re setting alarms around your “known knowns” and your “known unknowns” with a traditional alert. These are the issues that you’re either aware of, for example, a spike in database CPU usage following increased user traffic, or an issue you know about but haven’t diagnosed yet, like a sudden slow down in HTTP latency. 

You’re more broadly looking for anomalous behavior with a machine-learning alert. You can raise an alert if your system begins to exhibit behavior that it hasn’t shown before, especially around a specific metric or type of log. This is incredibly powerful because it adds a safety net to your existing traditional alerts and can detect new issues that fall into the “unknown unknown” category. An example might be an error that only manifests when a series of other criteria are true, like time of day, number of on-site users, and load on the system. 

These issues are challenging to detect and drive reactive monitoring behaviors – “we’ll just wait until it happens next time.” With a machine learning alert, you can catch these incidents in their early stages and analyze the anomalous behavior to gain new and powerful insights into the behavior of your system.

Summary

Proactive monitoring is the difference between fire fighting and forward-fixing. Proactive monitoring approaches will give your system a voice and not rely on incidents or outages before your system is surfacing data. When this approach is coupled with a machine learning strategy, you’ve got a system that informs you of undesirable (but not critical) behavior, plus alarms that will tell you about new, potentially unwanted behavior that you never considered in the first place. This allows you to leverage your observability data to help you achieve your operational and commercial goals.