Writing Effective Suricata Rules with Examples [Best Practices]

This post will help you write effective Suricata Rules to materially improve your security posture. We’ll begin with a breakdown of how a Rule is constructed and then explore best practices with examples in order to capture as many malicious activities as possible while using as few rules as possible.

Suricata is an open-source network intrusion detection system (NIDS) that provides real-time packet analysis and is part of the Coralogix STA solution. If you’re a Coralogix STA customer,  be sure to also check my earlier post on How to Modify an STA Suricata Rule

Anatomy of Suricata Rules

Before diving into the different strategies for writing your best Suricata rules, let’s start off by dissecting an example Suricata Rule:

alert tcp $EXTERNAL_NET any -> 10.200.0.0/24 80 (msg:"WEB-IIS CodeRed v2 root.exe access"; flow:to_server,established; uricontent:"/root.exe"; nocase; classtype:web application-attack; reference:url,www.cert.org/advisories/CA-2001 19.html; sid:1255; rev:7;)

alert: tells Suricata to report this behavior as an alert (it’s mandatory in rules created for the STA).

tcp: means that this rule will only apply to traffic in TCP.

$EXTERNAL_NET: this is a variable defined in Suricata. By default, the variable HOME_NET is defined as any IP within these ranges: 192.168.0.0/16,10.0.0.0/8,172.16.0.0/12 and EXTERNAL_NET is defined as any IP outside of these ranges. You can specify IP addresses either by specifying a single IP like 10.200.0.0, an IP CIDR range like 192.168.0.0/16 or a list of IPs like [192.168.0.0/16,10.0.0.0/8]. Just note that spaces within the list are not allowed.

any: in this context, it means “from any source port”, then there’s an arrow ‘->’ which means “a connection to” (there isn’t a ‘<-‘ operator, but you can simply flip the arguments around the operator. You can use the ‘<>’ operator to indicate that the connection direction is irrelevant for this rule), then an IP range which indicates the destination IP address and then the port. You can indicate a port range by using colon like 0:1024 which means 0-1024. In the round parenthesis, there are some directives for setting the alert message, metadata about the rule, as well as additional checks.

msg: is a directive that simply sets the message that will be sent (to Coralogix in the STA case) in case a matching traffic will be detected.

flow: is a directive that indicates whether the content we’re about to define as our signature needs to appear in the communication to the server (“to_server”) or to the client (“to_client”). This can be very useful if, for example, we’d like to detect the server response that indicates that it has been breached.

established: is a directive that will cause Suricata to limit its search for packets matching this signature to packets that are part of established connections only. This is useful to minimize the load on Suricata.

uricontent: is a directive that instructs Suricata to look for a certain text in the normalized HTTP URI content. In this example, we’re looking for a url that is exactly the text “/root.exe”.

nocase: is a directive that indicates that we’d like Suricata to conduct a case insensitive search.

classtype: is a directive that is a metadata attribute indicating which type of activity this rule detects.

reference: is a directive that is a metadata attribute that links to another system for more information. In our example, the value url,<https://….> links to a URL on the Internet.

sid: is a directive that is a metadata attribute that indicates the signature ID. If you are creating your own signature (even if you’re just replacing a built-in rule), use a value above 9,000,000 to prevent a collision with another pre-existing rule.

rev: is a directive that indicates the version of the rule.

There’s a lot more to learn about Suricata rules which supports RegEx parsing, protocol-specific parsing (just like uricontent for HTTP), looking for binary (non-textual) data by using bytes hex values, and much much more. If you’d like to know more you can start here.

Best Practices to Writing Effective Suricata Rules

  1. Target the vulnerability, not the exploit – Avoid writing rules for detecting a specific exploit kit because there are countless exploits for the same vulnerability and we can be sure that new ones are being written as you’re reading this. For example, many of the early signatures for detecting buffer overrun attacks looked like this:
    alert tcp $EXTERNAL_NET any -> $HOME_NET 80 (content:"AAAAAAAAAAAAAA", msg:"Buffer overrun detected.")

    The reason for that is of course that to launch a successful buffer overrun attack, the attacker needs to fill the buffer of a certain variable and add his malicious payload at the end so that it would become executable. The characters he chooses to use to fill the buffer are completely insignificant and indeed, after such signatures appeared, many attack toolkits simply used a different letter or letters to fill the buffer and completely evaded this type of signature detection. A much better way would be to attempt to detect these kind of attacks by detecting incorrect input to fields based on their type and length.

  2. Your peculiarity is your best asset, so use it – Every organization has things that make it unique. Many of these can be quite useful when you try to catch malicious activity in your organization – both external and internal. By using this deep internal knowledge to your advantage, you’ll essentially convert the problem from a technological one to a pure old-school intelligence problem, forcing attackers to have a much more intimate understanding of your organization in order to be able to hide their tracks effectively. These things can be technological in nature or based on your organization’s particular working habits and internal guidelines. Here are some examples:
    • Typical Working Hours: Some organizations I worked at did not allow employees to work from home at all and the majority of employees would have already left the office by 19:00. For similar organizations, it would make sense to set an alert to notify you of connections from the office after a certain hour. An attacker that would install malicious software in your organization would have to know that behavior and tune his malware to communicate with its Command & Control servers at precisely the same time such communications would go unnoticed.
    • Typical Browser: Suppose your organization has decided to use the Brave browser as its main browser and it gets installed on every new corporate laptop automatically and you have removed the desktop shortcuts to IE/Edge browser from all of your corporate laptops. If this is the case, a connection from the organization, both to an internal as well as external addresses that use a different browser such as IE/Edge should be configured to raise an alert.
    • IP Ranges based on Roles: If it’s possible for you to assign different IP ranges for different servers based on their role, for example, to have all DB servers on 192.168.0.0/24, the web servers on 192.168.1.0/24, etc then it would be possible and even easy to set up clever rules based on the expected behavior of your servers based on their role. For example, database servers usually don’t connect to other servers on their own, printers don’t try to connect to your domain controllers, etc.
    • Unusual Connection Attempts: Many organizations use a public share on a file server to help their users share files between them and use network enabled printers to allow their users to print directly from their computers. That means that client computers should not connect to each other, even if you have (wisely) chosen to block such access in a firewall or at the switch, the very attempt to connect from one client to another client computer should raise an alert and be thoroughly investigated.
    • Uncommon Ports: Some organizations use a special library for communication optimizations between services so that all HTTP communication between servers uses a different port than the common ones (such as 80, 443, 8080, etc). In this case, it’s a good idea to create a rule that would be triggered by any communication on these normally common ports.
  3. Honeytokens – In a battle field like the Internet where everyone can be just about anyone, deception, works well for defenders just as well as it does for the attackers, if not better. Tricks like renaming the built in administrator account to a different, less attractive name and creating a new account named Administrator which you’ll never use and create a Suricata rule for detecting if this user name, email or password are ever used on the network. It would be next to impossible for attackers to notice that Suricata has detected their attempts to use the fake administrator user. Another example is to create fake products, customers, users, and credit card records in the database and then matching Suricata rules for detecting them in the network traffic.

We hope you found this information helpful. For more information on the Coralogix STA, check out the latest features we recently released.

The Cloud Network Security Gap

Before we dive into the gap in cloud network security, let’s take a step back. If you’ve been in Operations for a while, you might remember how it used to be. “Network” was a team. When you needed to open a port on the network, you had to provide an exhaustive definition of the change, explaining what port you needed, what external addresses should be able to reach it, and where it should be routed to internally. This was sometimes a physical document you had to fill out with a pen. This document was dutifully routed to the Network Manager for authorization, and then the head of operations. The requested change was scheduled for the next change management window, an unlucky nighttime hour when the company calculated the probability of damage to be minimal, in case something went wrong. During that window, the on-call network engineer would connect to the relevant, physical network devices, and update their configuration.

In rare cases, he might fail due to some limitation of the hardware or configuration, and your requested change would be sent back to the drawing board. In other rare cases, he might not implement your change exactly as requested, and your requested connectivity might be limited or passing your traffic to the wrong network segment. If everything went well, your requested change was implemented the way you wanted, and you were free to test it the following workday. If you forgot that you needed an additional segment connected for your test environment, you had to start the process all over again. 

Most of us no longer live this Byzantine nightmare, as we’ve survived the tumultuous years of cloud adoption, and now live in the bright future where cloud adoption is mainstream. Even the most painfully bureaucratic government agencies are slowly adopting the cloud. We have DevOps and SRE teams and our infrastructure is as agile as our development. A new network segment is a matter of a few clicks in the cloud dashboard. Start-ups no longer need to purchase expensive hardware co-located in data centers in the United States. We can try new things quickly and delete them if they didn’t work out. Sometimes, it’s hard to believe where we came from and how quickly we got here.

However, this agility in infrastructure has some obvious costs. I’ve never worked at a cloud-hosted start-up without some cloud network security rules that nobody is sure where they came from, and if we need them or not. There might be some instances running in the cloud that was spun up manually for a POC 8 months ago, weren’t properly registered with the internal inventory system, and nobody knows that an additional 200 dollars a month is being paid for instances that aren’t in use. There are probably container images in the cloud repositories that were uploaded back when the company was first dipping its toes into containerization, which are hilariously not optimized, but still available. 

The resulting problem is that most cloud-hosted companies don’t have a complete and accurate picture of what’s going on in their cloud deployment. They also probably don’t have someone who specializes in networks on staff, as this is handled ably by most DevOps teams. This issue only compounds itself as companies branch out to hybrid multi-cloud deployments or even a mix of on-prem and multi-cloud. This field is served by hundreds of security and optimization start-ups, and all of them claim to surprise customers during POCs, showing them lists of dusty, unused inventory, and forgotten network segments.

The implications are of course, financial, as this inventory costs money to run, whether it’s in use or not; but there are also significant cloud network security implications. When opening a port in the firewall to production is as easy as a single click in the cloud dashboard, it can almost happen by accident. When we throw all of our cloud resources in the same VPC with the same network, lacking segmentation and internal authentication, we run significant risks. Why do we do this? Obviously, because its easy, but another reason might be the overarching belief that the cloud is more secure, and by and large, this is absolutely correct.

When we contract with cloud providers, some of the security considerations are always the responsibility of the provider. The provider is responsible for the physical security of the hardware, ensuring that the correct cables are connected in the right ports, and making sure everything is patched and up to date. This is the very lifeblood of their business, and they are excellent at it.

Network is one of the core components of their business, not a business enabler, and gets the relevant resources. So, we’re not worried about our firewall being breached because it wasn’t patched, we have DDoS protection built in, we trust our cloud network, and we should! The issue is with our own visibility and management of our network. What ports are open to the world? What resources can be directly accessed from outside the network? Do we even have network segments? What internal authentication are we using to ensure that communication within our network is expected and secure? The cloud provider can’t help us here. 

So how do we solve this problem? How can we ensure our cloud deployment, home of our valuable IP and even more valuable customer data, is secure? There are a lot of ways to answer this question, but it boils down to:

  1. A culture of building things securely, often called DevSec and DevSecOps, 
  2. An investment in visibility and detection, and
  3. Empowered enforcement

Each one of these topics is worth several books on its own, but we’ll summarize. 

DevSec and DevSecOps

This first point is a difficult one. It’s easy to say, on paper, that our DevOps team isn’t going to make manual and undocumented changes in the cloud dashboard anymore. Everything will be templated in Terraform, changes will be submitted as pull requests, reviewed by a principal engineer and the manager, and only after merging the changes will we deploy the updated infrastructure. Problem solved, right? Then one night, at 3:42AM, the on-call DevOps engineer is woken up by a customer success engineer, who needs network access for a new integration for a strategic client, who is mad as hell that the integration isn’t working.

The on-call engineer, in a sleepy daze, opens traffic to the internet on the relevant port, and goes back to sleep. From painful experience, I can tell you that if this hypothetical engineer has been on-call for a few days already, they may not even remember the call in the morning. The same goes for the dev team. It’s a lot easier to spin up new services in the Kubernetes cluster without ensuring some sort of encryption and authentication between the services. The deadline for the new service is in two weeks, and the solution is going to be a REST API over HTTP. Implementing gRPC or setting up TLS for AMQP is a time-consuming process, and if the dev team wasn’t asked to do that from day one, they aren’t going to do it at all. Even if they do implement something, where are they managing their certificates? Who is responsible for certificate rotation? Every security “solution” leaves us with a new security challenge.

The situation isn’t hopeless, these examples are here to show that the change needs to be cultural. Hiring a security expert to join the DevOps team two years after founding the company is better than nothing, but the real solution is to provide our DevOps team with cloud network security requirements from day one, and making sure our DevOps team leader is enforcing security practices and implementing a security-focused culture. Sure, it’s going to take a little bit more time in the early stages, but it will save an enormous amount of trouble downstream, when the company officers realize that the early “POC” environment has become the de-facto production environment, and decide that its time to formalize production processes. Once again, the same goes for our dev team. Obviously, there’s a lot of pressure to produce working services for company POCs and the faster we get an MVP running, the faster we can start bringing in money.

We must learn to balance these business requirements with a culture of developing securely. Let’s ensure that there’s a “security considerations” section in every design document, which the designer has thoughtfully filled out. It’s ok if we don’t have a security lifecycle defined on day one, but we can at least make sure our services are communicating over HTTPS, or that our AMQP messages use TLS, and someone on the team knows that managing certificates is their job. Someone has set a reminder in the company calendar two weeks before the certificate expires, so we aren’t blacked out and unable to communicate in a year, when everyone has forgotten all about it, and encrypted service communication is what we’re used to. These early investments in security culture translate to hundred of hours of time saved, after the company has grown and is about to land a strategic government client, but now it’s time to meet some compliance standard and we have to refactor everything to communicate securely.

Visibility and Detection

There’s a saying in security: “You can’t protect what you can’t see”. Visibility is absolutely crucial to ensuring security. We must know what the inventory on the network is, what our network topology is, and we must know what it should be. If we don’t know that one of our services, which is dutifully providing value to a customer, shouldn’t be accessible from the internet, it doesn’t matter if we see that it is. This potentially malicious traffic will fly under the radar. If we aren’t aware that none of our services are supposed to be communicating on a specific port, then we may be looking at traffic due to a malware infection, but unable to diagnose it. So, how do we get useful visibility into our cloud network? 

First, we have to create a useful and accurate map of our network. What ports should be open to the world? How does traffic flow from our load balancers to our backend services? Via what protocols? The network map must be an objective source of truth. It cannot be based on what IT believes to be true about the network. Generally, a good way to start is by looking at the outputs of our network traffic and charting their flow.

Once we have our map, we need to see what traffic is actually flowing in our network. We can set network taps and traps at strategic locations, like our load balancers and backend services, or use tools provided by the cloud provider, like VPC Traffic Mirroring in AWS. This is probably a large amount of data, and we need to usefully aggregate it in a way that provides us with insight. Nobody wants to read traffic captures for hours looking for unusual protocols or other anomalies.

Sending our traffic to quality tools is as important as capturing the traffic in the first place. Once we know what our traffic should look like, and then capture our traffic, we can finally start to get useful security insights. We can create beautiful cloud network security dashboards, which will show us anomalous behavior at a glance, and define useful alerts for critical issues. What will you discover? From experience, there’s always something. Maybe a metrics system is exposed to the world on an unexpected port? Is there a spike in database traffic at 2AM every day? Why?

Empowered Enforcement

This one is always surprising whenever it comes up for discussion. Let’s imagine we’ve found a security issue. There’s some misconfiguration in our firewall, which is allowing traffic from the entire internet into our system. However, when installing the product for a client, we made use of this availability to provide the customer with service. Fixing the breach means a service action on the customer’s side to correctly route their traffic, and updating our firewall rules. This means a service disruption for the customer, and they might understand that there was some security issue with their integration.

Faced with this dilemma, some companies might choose to do nothing, deciding that as long as the system is working, they can accept a weakened security posture. Enforcing security can mean disrupting service and uncomfortable explanations to our clients, and sometimes, it might just be a pain in the neck. There’s always pushback to security actions from some sector. The cloud has gotten us used to immediate gratification, infrastructure that “just works”, and does so right away. Some of us still recoil from overmanaged operations, the bad old days of physical change management forms looming in the back of our minds. 

As always, this is another question of culture and balance. Professionalism allows us to prioritize cloud network security incidents, and having clear processes for handling them in place helps set expectations. Discovering a security incident should trigger a short investigation to ascertain the dimensions of the issue, a short meeting of relevant stakeholders to decide what the priority is, and from there, the next steps should be a matter of protocol. Maybe the risk is low, and we don’t need to immediately disrupt service during peak hours, but we do have a plan to fix the problem during the next low tide, or over the weekend. Companies who invest in every other step, but do not enforce security when issues arise, may as well save their money. 

There’s no need to spend thousands on detection if you don’t plan on remediation. The cost of a breach is often far more than the temporary difficulty of resolving security incidents. It’s much harder to explain to a customer why their data was leaked than it is to explain that they’ll need to suffer a short service interruption. 

The point of this article is to spur thought and discussion. I’m sure that everyone reading this from the operations and dev realms can relate to some parts of the article. Maybe you’re in the process of implementing some of these ideas, or you’re shaking your head while trying to mentally map your production environment. Most teams at most companies will find themselves somewhere along the scale, in the process of grappling with these issues, and anyone who tells you they have perfectly managed environments with perfect security controls is 100% lying, or completely clueless. Implementing proper infrastructure management and security is a constant process, and making the best effort is far better than doing nothing at all. Define where your first steps should be, using your existing resources, and watch as your company starts to reap the financial and operational rewards.

Author: Tomer Hoter

Strategically Managing Cloud Resources for Security, Fun, and Profit

The first time I created a cloud compute instance, then called a “Cloud VM”, was an almost transcendent moment. It was like magic. I was at my first organization which had adopted the cloud, in my first DevOps position, and I immediately knew that the world had changed. If you’ve ever seen a “tech evangelist” gushing about the cloud, and you weren’t there and didn’t experience this transformation, you may not really understand what everyone is still so excited about. Cloud cost management data center infrastructure used to be so hard. If you run a single-tenant infrastructure, getting a new client up and running could take weeks. You had to have an entire team managing the purchasing of new servers, their delivery, and installation. The constant replacement of parts as disks failed and servers went dark was draining. Most companies had to “overprovision”, with servers online and ready to go, but not providing services, ready for the case of a traffic peak or a new client. There were so many downsides, but it was all anyone could do. 

The Challenge with Cloud Resources

The cloud resources quickly changed it all, and the rewards for those who migrated were immense. Spinning up a new datacenter could be a matter of minutes if you had a good DevOps team. New customer in the EU who needs fast response times? No problem! A few clicks later and a new EU based environment is now available. Setting up a new datacenter in the EU was a multi-year process before. If you’re paying attention, you may already have a good idea of what the problem is, the black pit yawning open next to this highway of progress. We went, as a profession, from managing static, physical infrastructure, easily itemized and jealously husbanded, to “managing” virtual infrastructure around the world which we could create almost as an afterthought. Companies just starting out with a dream, no DevOps, and two developers, could have a hundred servers in the air within a day. The first bill doesn’t arrive for another month! No naming convention, no oversight, nothing required. 

Sometimes called “cloud sprawl”, this situation has drastically worsened over time. At first, the cloud was an easy way to create managed VMs and their networks. Over time, the cloud offering has grown exponentially. Nearly every service necessary for the modern enterprise is offered as a managed service and billed differently. Some services bill by network usage, some by CPU usage, some by data egress (but data ingress is free!). Itemized cloud bills easily run into the thousands of lines. According to Flexera’s (formerly Rightscale) 2020 State of the Cloud Report, executives estimate that 30% of their cloud spending is wasted. On average, organizations spend 23% more than they budget for cloud spending, while forecasting that their cloud spending will only grow year over year. 93% of enterprise respondents were using a multi-cloud architecture, further complicating matters. Setting aside the operational cost of cloud sprawl, there’s another reason to properly manage cloud infrastructure. The security implications of unmanaged infrastructure are severe. Infrastructure we’re not aware of is, by definition, not monitored. If it suddenly starts to behave strangely, will we even notice? If a cloud instance the organization no longer needs is running a malicious cryptocurrency miner, with sixteen cores screaming at 100% utilization, will we believe it to be a business workload? 

So, the case is made. The cloud is a wonderful thing, an incredible business enabler, but, as a profession, we’re not the best at managing it, and this has security and operational implications. Luckily, all is not lost, and we don’t need to rush to spend more money resolving the problem. Every major cloud provider provides free tools to help us manage our cloud infrastructure, and making strategic efforts to implement management strategies can have a massive payoff, both in our operational spend and in our security posture. Most articles on this topic try to provide a list of “feel-good” tasks, like terminating unused compute instances (how novel!). While this is certainly a necessary task, it takes a “tactical” view of the problem. If we focus on specific culprits and clean them up in a one-time effort, we’ve missed the forest for the trees. You might have heard DevOps referred to as a continuous feedback loop. DevOps engineers plan, build, integrate and deploy, monitor, and based on feedback, go back to planning.

Solving The Challenge

We need to apply this same approach when we consider cloud management, which helpfully often falls within the purview of DevOps teams. Strategic cloud management is a feedback loop of planning, tracking, optimizing, and improvement. It’s never too late. Every organization I’ve worked at did not start out with a plan for cloud infrastructure. We have a product that needs some amount of computing resources, network resources, and data resources. We’ll probably pick the cloud services that, in the early days of the company, were the easiest for us to understand, not necessarily the ones best suited to our workloads. So, we build what we need, and years later, when spending is out of control and our cloud is broadly overprovisioned, we have to organize. So, let’s apply the “DevOps” model to cloud infrastructure:

  • Now that we have an existing deployment, we have to try to Plan our cloud infrastructure.
  • Next, we need to Implement our plan!
  • We need to effectively Monitor our cloud deployment. Visibility is the key to both security and effective operations. 
  • Finally, we need to Improve, based on the data we’ve collected and new cloud offerings. Once we’ve iterated, the cycle starts anew.

Let’s discuss how to implement each one of these steps. Keep in mind that “strategic” usually means “will take a while”. This process will take several months at least for the first iteration. 

Step 0 – Homework!

Like any good general, before we can begin applying strategy, we need to have the best picture possible of reality, and we need to do everything we can to maintain this picture over time. The wonderful thing about reality is that it’s objective, and can be conclusively determined, especially when it comes to our existing cloud infrastructure. We’ve decided that we need to get our cloud sprawl under control, and we’ve tasked the DevOps team, or IT team, or whatever another stakeholder (maybe the CTO!) with this effort. The first step, is assigning this responsibility. Once we’ve done that, we need to determine reality. This is often a truly difficult and complex step, but it’s of critical importance. Before we can do anything else, we need to know what our existing cloud infrastructure is, and ensure it’s all tagged, tracked, and monitored. Including everything. Create a model for tagging based on tagging best practices, and apply it. If you have your own monitoring solution, make sure it’s installed on everything. If you have unmonitored resources, enable your cloud provider’s monitoring solutions. We have to know how much of our resources we’re utilizing, to determine where our opportunities are. It’s very difficult to do this perfectly, but if we’ve gone from 0% managed to 70% managed, we’re doing an amazing job. While we do this, it’s a great time to audit our cloud network and make sure that what we’ve provisioned in the cloud is correctly serving our needs, and not letting prying eyes into our production deployments. Auditing network traffic can also help us discover services deployed in the cloud we might otherwise miss!

During this process, it’s likely that we will discover unused resources, security loopholes, and unmonitored legacy resources. We want to handle all of these issues, but it’s important not to get bogged down here. If we go into a multi-month cleanup process before we’ve implemented proper tagging, tracking, and monitoring, our picture of our cloud infrastructure will quickly decay, and we haven’t even gotten to step 1 yet! So, our team is exhausted and bloody, but we’re already in a stance thousands of times better than we were before we started, and we’re ready to improve our security and operations. 

Step 1 – Plan

It’s time to let our expertise shine! We’ve tagged and monitored our cloud, and now we have a lot of data. We can apply our knowledge of cloud services and capabilities to this data, to extract improved operations. Is our current deployment serving our business need exactly? Our security and confidentiality requirements? How about business continuity? Compliance? Utilization? We need to move, workload by workload, through our cloud deployment, and tailor our solution to our needs. This is the time to sit down with other stakeholders in the company. Is the database our developers selected in the early days of the company still the best fit for our product? Can we migrate to an offering that will give us better performance for fewer resources? Does our company roadmap have an upcoming compliance requirement that our current services don’t meet? Did a security audit recently turn up troubling network configuration? The cloud serves the company, and many places in the company can have helpful input when we plan our cloud resources.

Cloud providers usually provide us with two charging models for cloud services: allocation-based and consumption-based. Allocation-based models are what most of us are familiar with. We provision a certain amount of cloud resources, and those resources are statically always available to us, whether they’re in use or not. Their cloud cost management is also static, whether they’re in use or not. This model is best suited to very stable applications, not coupled to consumption or prone to spikes in usage or traffic. In most SaaS companies, this is often not the best solution for our cloud services! Consumption-based services are not “pre-provisioned”, but generally provide some baseline of service, which can expand and contract based on user-defined metrics, and are charged based on usage. This is often the best solution for many workloads, especially in SaaS companies, where utilization can drop close to 0% during the user’s off-hours. If your organization lacks the expertise to choose the best services for you, this may be a great place to consult with a professional cloud architect. If we’ve done our jobs correctly, we can even try to create a cost forecast! It will probably be wildly incorrect, but that’s why this is an iterative process.  

Step 2 – Implement

It’s time to implement our plan. This part is where the excitement starts to show and we reap the rewards of weeks of hard work. Reporting to the entire company that we downsized several hundred instances and saved hundreds of thousands of dollars a year looks and feels good. This can also happen a lot faster than you might expect. The same nature of the cloud that allowed us to create wasteful and unmanaged resources so quickly allows us to streamline with the same speed. Resizing instances to match utilization, scaling clusters up and down, these are usually processes built into the cloud, not requiring downtime, just a few clicks in the cloud dashboard, or an updated Terraform template, and we’ve improved our operational stance immensely. Its also impossible to overstate the improvement to the organization’s security posture after undertaking a process like this. 

First, the security risk of unmanaged and unmonitored infrastructure is greatly reduced. Those same resources which provided a sneaky foothold into our production networks are gone, and the resources we want are now visible to us. It’s not for nothing that good operations and good security go hand in hand, visibility and management is the key to both. Next, we’ve implemented processes that make it harder for unmanaged infrastructure to crop up again. We’ve audited our network, and tagged and grouped our security rules. New security rules which don’t meet our new conventions will be immediately suspect. We can now identify suspicious network traffic that doesn’t meet what we expect, because we know what to expect. We can identify a workload behaving suspiciously (remember the cryptominer?) because we know what it’s provisioned for and what it’s supposed to be running. Though a less tangible benefit for the company than the operational savings, improved security is still an obvious plus. In addition, as opposed to implementing restrictive security tools that hamper productivity and cause user pain, properly managed infrastructure provides us with a strong security benefit for less cost and less pain. Unless, of course, we’re considering the headache for the DevOps team, but, from experience, that’s a cost most businesses are willing to pay. 

Step 3 – Monitor

Now that we’ve deployed our new cloud infrastructure, and everyone’s gotten a bonus and a pat on the back for their hard work, it’s time to make sure what we’ve done actually works. All that monitoring should be aggregated into useful dashboards which can hopefully tell us at a glance if we’re meeting our resource provisioning goals if we’re still overprovisioned, or if some of our workloads are starved for more resources. Are we handling the growth of the company well? As the company grows, more users mean more resources need to come online. If this process is happening automatically, we’re doing a great job! If not, maybe there’s an opportunity to shift more workloads into consumption-based resources. 

Monitoring is often treated as a process for finding faults exclusively, but if we have good control of our cloud resources, monitoring is a tool for identifying opportunity. Monitoring is also a process that requires maintenance and constant work. As the company develops new features and services, new dashboards need to come online. There will always be some drift between reality and what’s actually visible, and one of our constant struggles must be the continuous effort to improve our monitoring and close this gap. These aren’t new ideas. One of the most important and formative articles I read as a larval, 22-year-old engineer, way back in 2011, is “Measure Anything, Measure Everything”, from “Code as Craft”, Etsy’s engineering blog. One of the article’s conclusions is: “tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy”. One of the opportunities we need to identify is how to make monitoring easy, so this part of our process improves with us, and we don’t lose sight of reality.

Step 4 – Improve

So, we’ve come to the end of our first cycle. If we were super effective and there were no surprises and major upheavals along the way, it’s been a few months since we started with Step 0. Most likely, it’s been more than six months. We’ve saved the company immense operational costs, improved our security posture, and brought our sprawl under control. If we stop here, go take a well-deserved nap, and neglect our cloud infrastructure again, we’ll be back at Step 0 in a matter of months. It’s time to go back to Step 1, armed with experience and data, and plan once again. How can we optimize our new deployment even further? What tools have the cloud provider released since we started the process, and are they relevant to us? Did we correctly assess the needs of all our workloads? Have we scaled effectively with the company? How off was our cost prediction and why?    

 The main conclusion of this article is that, when dealing with something as core to our profession and business as our production infrastructure, our thinking needs to be as agile, iterative, and strategic as any other part of our organization. The products we create are based on long-term strategic roadmaps, fed by customer feedback, and maintained by teams of professionals. Why should our living, breathing, cloud infrastructure be any different? This is one of the leaps in thinking we need to make. We’ve left behind static, managed infrastructure for the cloud and DevOps methodologies, but we didn’t apply the same agility to our infrastructure we did to our code. It’s time to close that gap.

Author: Tomer Hoter

How to automate VPC Mirroring for Coralogix STA

After installing the Coralogix Security Traffic Analyzer (STA) and choosing a mirroring strategy suitable for your organization needs (if not, you can start by reading this) the next step would be to set the mirroring configuration in AWS. However, the configuration of VPC Traffic Mirroring in AWS is tedious and cumbersome – it requires you to create a mirror session per network interface of every mirrored instance, and just to add an insult to injury, if that instance terminates for some reason and a new one replaces it you’ll have to re-create the mirroring configuration from scratch. If you, like many others, use auto-scaling groups to automatically scale your services up and down based on the actual need, the situation becomes completely unmanageable.

Luckily for you, we at Coralogix have already prepared a solution for that problem. In the next few lines I’ll present a tool we’ve written to address that specific issue as well as few use-cases for it.

The tool we’ve developed can run as a pod in Kubernetes or inside a Docker container. It is written in Go to be as efficient as possible and will require only minimal set of resources to run properly. While it is running it will read its configuration from a simple JSON file and will select AWS EC2 instances by tags and then will select network interfaces on those instances and will create VPC Traffic Mirroring sessions for each network interface selected to the configured VPC Mirroring Target using the configured VPC Mirroring Filter.

The configuration used in this document will instruct the sta-vpc-mirroring-manager to look for AWS instances that have the tags “sta.coralogixstg.wpengine.com/mirror-filter-id” and “sta.coralogixstg.wpengine.com/mirror-target-id” (regardless of the value of those tags), collect the IDs of their first network interfaces (that are connected as eth0) and attempt to create a mirror session for each network interface collected to the mirror target specified by the tag “sta.coralogixstg.wpengine.com/mirror-target-id” using the filter ID specified by the tag “sta.coralogixstg.wpengine.com/mirror-filter-id” on the instance that network interface is connected to.

To function properly, the instance hosting this pod should have an IAM role attached to it (or the AWS credentials provided to this pod/container should contain a default profile) with the following permissions:

  1. ec2:Describe* on *
  2. elasticloadbalancing:Describe* on *
  3. autoscaling:Describe* on *
  4. ec2:ModifyTrafficMirrorSession on *
  5. ec2:DeleteTrafficMirrorSession on *
  6. ec2:CreateTrafficMirrorSession on *

Installation

This tool can be installed either as a kubernetes pod or a docker container. Here are the detailed instructions for installing it:

Installation as a docker container:

  1. To download the docker image use the following command:
    docker pull coralogixrepo/sta-vpc-mirroring-config-manager:latest
  2. On the docker host, create a config file for the tool with the following content (if you would like the tool to report to the log what is about to be done without actually modifying anything set “dry_run” to true):
    {
      "service_config": {
        "rules_evaluation_interval": 10000,
        "metrics_exporter_port": ":8080",
        "dry_run": false
      },
      "rules": [
        {
          "conditions": [
            {
              "type": "tag-exists",
              "tag_name": "sta.coralogixstg.wpengine.com/mirror-target-id"
            },
            {
              "type": "tag-exists",
              "tag_name": "sta.coralogixstg.wpengine.com/mirror-filter-id"
            }
          ],
          "source_nics_matching": [
            {
              "type": "by-nic-index",
              "nic_index": 0
            }
          ],
          "traffic_filters": [
            {
              "type": "by-instance-tag-value",
              "tag_name": "sta.coralogixstg.wpengine.com/mirror-filter-id"
            }
          ],
          "mirror_target": {
            "type": "by-instance-tag-value",
            "tag_name": "sta.coralogixstg.wpengine.com/mirror-target-id"
          }
        }
      ]
    }
    
  3. Use the following command to start the container:
    docker run -d 
       -p <prometheus_exporter_port>:8080 
       -v <local_path_to_config_file>:/etc/sta-pmm/sta-pmm.conf 
       -v <local_path_to_aws_profile>/.aws:/root/.aws 
       -e "STA_PM_CONFIG_FILE=/etc/sta-pmm/sta-pmm.conf" 
       coralogixrepo/sta-vpc-mirroring-config-manager:latest

Installation as a Kubernetes deployment:

    1. Use the following config map and deployment configurations:
      apiVersion: v1
      kind: ConfigMap
      data:
        sta-pmm.conf: |
          {
            "service_config": {
              "rules_evaluation_interval": 10000,
              "metrics_exporter_port": 8080,
              "dry_run": true
            },
            "rules": [
              {
                "conditions": [
                  {
                    "type": "tag-exists",
                    "tag_name": "sta.coralogixstg.wpengine.com/mirror-target-id"
                  },
                  {
                    "type": "tag-exists",
                    "tag_name": "sta.coralogixstg.wpengine.com/mirror-filter-id"
                  }
                ],
                "source_nics_matching": [
                  {
                    "type": "by-nic-index",
                    "nic_index": 0
                  }
                ],
                "traffic_filters": [
                  {
                    "type": "by-instance-tag-value",
                    "tag_name": "sta.coralogixstg.wpengine.com/mirror-filter-id"
                  }
                ],
                "mirror_target": {
                  "type": "by-instance-tag-value",
                  "tag_name": "sta.coralogixstg.wpengine.com/mirror-target-id"
                }
              }
            ]
          }
      
      metadata:
        labels:
          app.kubernetes.io/component: sta-pmm
          app.kubernetes.io/name: sta-pmm
          app.kubernetes.io/part-of: coralogix
          app.kubernetes.io/version: '1.0.0-2'
        name: sta-pmm
        namespace: coralogix
      
      ----------
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        labels:
          app.kubernetes.io/component: sta-pmm
          app.kubernetes.io/name: sta-pmm
          app.kubernetes.io/part-of: coralogix
          app.kubernetes.io/version: '1.0.0-2'
        name: sta-pmm
        namespace: coralogix
      spec:
        selector:
          matchLabels:
            app.kubernetes.io/component: sta-pmm
            app.kubernetes.io/name: sta-pmm
            app.kubernetes.io/part-of: coralogix
        template:
          metadata:
            labels:
              app.kubernetes.io/component: sta-pmm
              app.kubernetes.io/name: sta-pmm
              app.kubernetes.io/part-of: coralogix
              app.kubernetes.io/version: '1.0.0-2'
            name: sta-pmm
          spec:
            containers:
              - env:
                  - name: STA_PM_CONFIG_FILE
                    value: /etc/sta-pmm/sta-pmm.conf
                  - name: AWS_ACCESS_KEY_ID
                    value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                  - name: AWS_SECRET_ACCESS_KEY
                    value: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
                image: coralogixrepo/sta-vpc-mirroring-config-manager:latest
                imagePullPolicy: IfNotPresent
                livenessProbe:
                  httpGet:
                    path: "/"
                    port: 8080
                  initialDelaySeconds: 5
                  timeoutSeconds: 1
                name: sta-pmm
                ports:
                  - containerPort: 8080
                    name: sta-pmm-prometheus-exporter
                    protocol: TCP
                volumeMounts:
                  - mountPath: /etc/sta-pmm/sta-pmm.conf
                    name: sta-pmm-config
                    subPath: sta-pmm.conf
            volumes:
              - configMap:
                  name: sta-pmm-config
                name: sta-pmm-config
      

Configuration

To configure instances for mirroring, all you have to do is to make sure that the instances you would like their traffic to be mirrored to your STA, will have the tags “sta.coralogixstg.wpengine.com/mirror-filter-id” and “sta.coralogixstg.wpengine.com/mirror-target-id” pointing at the correct IDs of the mirror filter and target respectively. To find out the IDs of the mirror target and mirror filter that were created as part of the installation of the STA, enter the CloudFormation Stacks page in AWS Console and search for “TrafficMirrorTarget” and for “TrafficMirrorFilter” in the Resources tab:

To assign different mirroring policy to different instances, for example to mirror traffic on port 80 from some instances and mirror traffic on port 53 from other instances, simply create a VPC Traffic Mirror Filter manually with the correct filtering rules (just like in a firewall) and assign its ID to the “sta.coralogixstg.wpengine.com/mirror-filter-id” tag of the relevant instances.

Pro Tip: You can use AWS “Resource Groups & Tag Editor” to quickly assign tags to multiple instances based on an arbitrary criteria.

Good luck!

How SIEM is evolving in 2020

The evolution of Security Information and Event Management (SIEM) is deeply intertwined with cloud computing, both in terms of technological breakthroughs the cloud provided and from its inherent security challenges. 

What are the Security Challenges in the Cloud?

With the rise of cloud computing, we no longer rely on long-lived resources. An ephemeral infrastructure obscures the identity of the components and, even if you do have the visibility it doesn’t necessarily mean you can comprehend the meaning behind the components. In fact, with containers and functions being an integral part of modern cloud-native applications the visibility is now even harder.

The cost, in particular of the network and bandwidth, also creates an unexpected security challenge. With on-premise infrastructure, there is no need to pay to mirror network traffic (required to monitor) but, in contrast, public cloud providers charge a hefty sum for this functionality. As an example, in AWS you pay for each VPC mirroring session and also for the bandwidth going from the mirror to the monitoring system

Raising logging and metric data costs put pressure to collect fewer data and has a negative impact on the effectiveness of the security monitoring solution because it’s almost impossible to predict what the future needs of an investigation will be. On the other hand, the technical ability to properly scale using cloud and ingest more data had a tremendous positive impact. 

However, even that ended up creating analyst fatigue, with most SIEMs having low satisfaction rates due to false positives.

The scalability of public cloud, combined with gross misconfigurations and over-exposure of resources to the internet, also widens the scope of attacks and security threats. Today, advanced persistent threats (APT), ransomware and geo-distributed Denial of Service (DDoS/DDoD) attacks among many other security issues pose real peril to organizations of any size. The need for proper enterprise detection and response capabilities and services is vital to tackle the cloud security challenges, and a good SIEM solution is one of the best weapons you can get.

SIEM solutions and their evolution

The term Security Information and Event Management (SIEM) is not new but, during the last few years has become incredibly popular. 

The concept behind it is rather simple yet ambitious, a SIEM solution collects and analyzes events – in both real-time and by looking back at historical data – from multiple data sources within an organization to provide capabilities such as threat detection, security incident management, analytics and overall visibility across the systems in an organization. 

SIEM-like systems have existed for several years but it was only in the early 2000s that the term really came to life. At the time, cloud computing was practically non-existent with only some providers giving first steps in the domain. 

The first generation of SIEM was then fully on-premise, incredibly expensive and suffering from huge problems related to data source integrations and mostly scalability (vertical was the only way to go) without being able to properly cope with ever-growing amounts of data.

With the rise of cloud computing popularity and maturity, also SIEM solutions evolved and were able to benefit from it. The second generation of SIEM solution was then born and being able to leverage those technology advancements, such as near-unlimited growth storage capacity and elastic computing resources, the scalability limitations were then solved by enabling growth to happen horizontally and therefore coping with the constant data growths. 

However, a new problem emerged: too much data. Security professionals and engineers using SIEM solutions became drowned with so much information that it became hard to find and understand what was actually relevant. A next-generation of SIEM solutions started to appear with focus on the operational capabilities and properly leverage analytics & machine learning algorithms to find those needles in the haystack. 

What should you be looking for in a SIEM solution?

The new generation of SIEM effectively enables teams to gain meaningful visibility over the infrastructure and take relevant actions. 

Even people without a security background are now able to use a modern SIEM solution. The dashboards, alerts and recommended actions are now made to be comprehensive, straightforward and user friendly, while still enabling security professionals to leverage forensic capabilities within the solution if needed.

When looking for a SIEM solution, you should obviously look for one that can truly cope with data growth, and cloud-based managed solutions are naturally a good fit for this, but scalability by itself is not enough. One aspect to keep in mind is the data optimization capability, such as the ability to process log insights without actually retaining the log data which is a prime example of a key functionality that enables massive cost savings in the long run. 

Integrations and Data Correlation

Without data, a SIEM is rendered useless and, with the current wide technology landscape is becoming incredibly hard to connect to all the existing data sources within an organization. Having good ready available integration capabilities with 3rd party software & data sources is important functionality to take into account. The value of SIEM comes from being able to correlate events (e.g. network traffic) with applications, cloud infrastructure, assets, transactions, and security feeds.

In the end, you want to get as much visibility and insights as possible into your cloud infrastructure by relying on raw traffic data (e.g. using traffic mirroring), because that enables you to detect anomalies such as advanced persistent threats (APT’s) without the attacker even noticing it. Raw network traffic data capture and deep packet inspection is something that one might often overlook. 

Normally, web server logs wouldn’t contain the actual data (payload) sent to the server (e.g. database). For example, if someone sends a malicious query containing a SQL injection, it will go unnoticed unless your SIEM solution is capable of capturing and inspecting that raw data.

Enriching the Value of Existing Data

While the collection and correlation of the information you have within the organization – especially network traffic events – is the most important data point for a SIEM solution, quite often it is not enough to reason only with that data in a manner that can provide actionable intelligible alerts and recommendations.

A SIEM solution should be capable of enriching the organization’s data with additional relevant information. The techniques may vary, but some of the obvious ones are to extract the geolocation of IP addresses, detect suspicious strings that may indicate usage of Domain Generation Algorithms (DGA), public information regarding domain names (e.g. WHOIS data), and enriching log data with infrastructure information based on the cloud provider. 

A big part of that enrichment is only made possible if the SIEM solution subscribes and imports additional data sources, such as IP blacklists and 3rd party security feeds. These feeds are extremely important and often dictate the effectiveness of the SIEM solution, therefore having quality built-in feeds from cybersecurity vendors and other organizations is crucial.

Intelligence, Alerting and Insights

From the next generation of SIEM solutions we need to expect nothing less than meaningful alerts that leave little to interpretation. A solution capable of using AI and Machine Learning algorithms to power analytics combined with strong security knowledge will have a clear advantage. 

The traditional and old school SIEM solutions relied on static thresholds to produce alerts, which causes fatigue and quite often useless alerts and false positives. Although there were improvements like somewhat dynamic thresholds, this only reduced false positives, but left the company open to more attacks. 

Instead, by leveraging machine learning algorithms coupled with expert security knowledge, a modern SIEM solution will learn from the behavior and natural patterns of a system and users and only alert you true causes for concern.

In fact, User and Entity Behaviour Analytics, or simply UEBA, is nowadays one of the must-have capabilities for a next-generation SIEM. 

UEBA, a term coined by Gartner a few years back is the concept of detecting anomalous behavior of users and machines against a defined normal baseline behavior. Profiling the user of a given system is incredibly helpful. As an example, one could expect that a machine or system used by a Developer can execute several PowerShell commands in a given period of time, while the same exact behavior from a machine used by someone working in Marketing would be very suspicious.

Similarly, the entity (i.e. machine) profiling is equally important. If a machine that is usually only used during working hours suddenly initiates network connections (e.g. VPN) to a distant remote location at 4 AM is a strong indicator that something is wrong and likely to raise suspicion.

Conclusion 

The evolution of SIEM solutions has been nothing short of remarkable. A major driver has been as mentioned above cloud computing, both with the technology innovations and developments that it gave us but also the inherent security challenges. 

As you move forward in developing your existing SIEM capabilities or adopting a whole new solution, it is very important to choose a solution that brings true value to the organization and provides you actionable intelligible alerts and recommendations. As we covered in this article, scalability and good integration capabilities with the 3rd party systems and data sources your organization has are key, as is the information enrichment, intelligence, adopting the UEBA concept combined with security expertise. 

Want to read more about security? There is a lot more on our blog, including this article on AWS Security. Check it out!