Whenever engineers discover a new security issue, the question arises every time: is this an exploit or vulnerability? What is a software vulnerability? How does it…
Before we dive into the gap in cloud network security, let’s take a step back. If you’ve been in Operations for a while, you might remember how it used to be. “Network” was a team. When you needed to open a port on the network, you had to provide an exhaustive definition of the change, explaining what port you needed, what external addresses should be able to reach it, and where it should be routed to internally. This was sometimes a physical document you had to fill out with a pen. This document was dutifully routed to the Network Manager for authorization, and then the head of operations. The requested change was scheduled for the next change management window, an unlucky nighttime hour when the company calculated the probability of damage to be minimal, in case something went wrong. During that window, the on-call network engineer would connect to the relevant, physical network devices, and update their configuration.
In rare cases, he might fail due to some limitation of the hardware or configuration, and your requested change would be sent back to the drawing board. In other rare cases, he might not implement your change exactly as requested, and your requested connectivity might be limited or passing your traffic to the wrong network segment. If everything went well, your requested change was implemented the way you wanted, and you were free to test it the following workday. If you forgot that you needed an additional segment connected for your test environment, you had to start the process all over again.
Most of us no longer live this Byzantine nightmare, as we’ve survived the tumultuous years of cloud adoption, and now live in the bright future where cloud adoption is mainstream. Even the most painfully bureaucratic government agencies are slowly adopting the cloud. We have DevOps and SRE teams and our infrastructure is as agile as our development. A new network segment is a matter of a few clicks in the cloud dashboard. Start-ups no longer need to purchase expensive hardware co-located in data centers in the United States. We can try new things quickly and delete them if they didn’t work out. Sometimes, it’s hard to believe where we came from and how quickly we got here.
However, this agility in infrastructure has some obvious costs. I’ve never worked at a cloud-hosted start-up without some cloud network security rules that nobody is sure where they came from, and if we need them or not. There might be some instances running in the cloud that was spun up manually for a POC 8 months ago, weren’t properly registered with the internal inventory system, and nobody knows that an additional 200 dollars a month is being paid for instances that aren’t in use. There are probably container images in the cloud repositories that were uploaded back when the company was first dipping its toes into containerization, which are hilariously not optimized, but still available.
The resulting problem is that most cloud-hosted companies don’t have a complete and accurate picture of what’s going on in their cloud deployment. They also probably don’t have someone who specializes in networks on staff, as this is handled ably by most DevOps teams. This issue only compounds itself as companies branch out to hybrid multi-cloud deployments or even a mix of on-prem and multi-cloud. This field is served by hundreds of security and optimization start-ups, and all of them claim to surprise customers during POCs, showing them lists of dusty, unused inventory, and forgotten network segments.
The implications are of course, financial, as this inventory costs money to run, whether it’s in use or not; but there are also significant cloud network security implications. When opening a port in the firewall to production is as easy as a single click in the cloud dashboard, it can almost happen by accident. When we throw all of our cloud resources in the same VPC with the same network, lacking segmentation and internal authentication, we run significant risks. Why do we do this? Obviously, because its easy, but another reason might be the overarching belief that the cloud is more secure, and by and large, this is absolutely correct.
When we contract with cloud providers, some of the security considerations are always the responsibility of the provider. The provider is responsible for the physical security of the hardware, ensuring that the correct cables are connected in the right ports, and making sure everything is patched and up to date. This is the very lifeblood of their business, and they are excellent at it.
Network is one of the core components of their business, not a business enabler, and gets the relevant resources. So, we’re not worried about our firewall being breached because it wasn’t patched, we have DDoS protection built in, we trust our cloud network, and we should! The issue is with our own visibility and management of our network. What ports are open to the world? What resources can be directly accessed from outside the network? Do we even have network segments? What internal authentication are we using to ensure that communication within our network is expected and secure? The cloud provider can’t help us here.
So how do we solve this problem? How can we ensure our cloud deployment, home of our valuable IP and even more valuable customer data, is secure? There are a lot of ways to answer this question, but it boils down to:
Each one of these topics is worth several books on its own, but we’ll summarize.
This first point is a difficult one. It’s easy to say, on paper, that our DevOps team isn’t going to make manual and undocumented changes in the cloud dashboard anymore. Everything will be templated in Terraform, changes will be submitted as pull requests, reviewed by a principal engineer and the manager, and only after merging the changes will we deploy the updated infrastructure. Problem solved, right? Then one night, at 3:42AM, the on-call DevOps engineer is woken up by a customer success engineer, who needs network access for a new integration for a strategic client, who is mad as hell that the integration isn’t working.
The on-call engineer, in a sleepy daze, opens traffic to the internet on the relevant port, and goes back to sleep. From painful experience, I can tell you that if this hypothetical engineer has been on-call for a few days already, they may not even remember the call in the morning. The same goes for the dev team. It’s a lot easier to spin up new services in the Kubernetes cluster without ensuring some sort of encryption and authentication between the services. The deadline for the new service is in two weeks, and the solution is going to be a REST API over HTTP. Implementing gRPC or setting up TLS for AMQP is a time-consuming process, and if the dev team wasn’t asked to do that from day one, they aren’t going to do it at all. Even if they do implement something, where are they managing their certificates? Who is responsible for certificate rotation? Every security “solution” leaves us with a new security challenge.
The situation isn’t hopeless, these examples are here to show that the change needs to be cultural. Hiring a security expert to join the DevOps team two years after founding the company is better than nothing, but the real solution is to provide our DevOps team with cloud network security requirements from day one, and making sure our DevOps team leader is enforcing security practices and implementing a security-focused culture. Sure, it’s going to take a little bit more time in the early stages, but it will save an enormous amount of trouble downstream, when the company officers realize that the early “POC” environment has become the de-facto production environment, and decide that its time to formalize production processes. Once again, the same goes for our dev team. Obviously, there’s a lot of pressure to produce working services for company POCs and the faster we get an MVP running, the faster we can start bringing in money.
We must learn to balance these business requirements with a culture of developing securely. Let’s ensure that there’s a “security considerations” section in every design document, which the designer has thoughtfully filled out. It’s ok if we don’t have a security lifecycle defined on day one, but we can at least make sure our services are communicating over HTTPS, or that our AMQP messages use TLS, and someone on the team knows that managing certificates is their job. Someone has set a reminder in the company calendar two weeks before the certificate expires, so we aren’t blacked out and unable to communicate in a year, when everyone has forgotten all about it, and encrypted service communication is what we’re used to. These early investments in security culture translate to hundred of hours of time saved, after the company has grown and is about to land a strategic government client, but now it’s time to meet some compliance standard and we have to refactor everything to communicate securely.
There’s a saying in security: “You can’t protect what you can’t see”. Visibility is absolutely crucial to ensuring security. We must know what the inventory on the network is, what our network topology is, and we must know what it should be. If we don’t know that one of our services, which is dutifully providing value to a customer, shouldn’t be accessible from the internet, it doesn’t matter if we see that it is. This potentially malicious traffic will fly under the radar. If we aren’t aware that none of our services are supposed to be communicating on a specific port, then we may be looking at traffic due to a malware infection, but unable to diagnose it. So, how do we get useful visibility into our cloud network?
First, we have to create a useful and accurate map of our network. What ports should be open to the world? How does traffic flow from our load balancers to our backend services? Via what protocols? The network map must be an objective source of truth. It cannot be based on what IT believes to be true about the network. Generally, a good way to start is by looking at the outputs of our network traffic and charting their flow.
Once we have our map, we need to see what traffic is actually flowing in our network. We can set network taps and traps at strategic locations, like our load balancers and backend services, or use tools provided by the cloud provider, like VPC Traffic Mirroring in AWS. This is probably a large amount of data, and we need to usefully aggregate it in a way that provides us with insight. Nobody wants to read traffic captures for hours looking for unusual protocols or other anomalies.
Sending our traffic to quality tools is as important as capturing the traffic in the first place. Once we know what our traffic should look like, and then capture our traffic, we can finally start to get useful security insights. We can create beautiful cloud network security dashboards, which will show us anomalous behavior at a glance, and define useful alerts for critical issues. What will you discover? From experience, there’s always something. Maybe a metrics system is exposed to the world on an unexpected port? Is there a spike in database traffic at 2AM every day? Why?
This one is always surprising whenever it comes up for discussion. Let’s imagine we’ve found a security issue. There’s some misconfiguration in our firewall, which is allowing traffic from the entire internet into our system. However, when installing the product for a client, we made use of this availability to provide the customer with service. Fixing the breach means a service action on the customer’s side to correctly route their traffic, and updating our firewall rules. This means a service disruption for the customer, and they might understand that there was some security issue with their integration.
Faced with this dilemma, some companies might choose to do nothing, deciding that as long as the system is working, they can accept a weakened security posture. Enforcing security can mean disrupting service and uncomfortable explanations to our clients, and sometimes, it might just be a pain in the neck. There’s always pushback to security actions from some sector. The cloud has gotten us used to immediate gratification, infrastructure that “just works”, and does so right away. Some of us still recoil from overmanaged operations, the bad old days of physical change management forms looming in the back of our minds.
As always, this is another question of culture and balance. Professionalism allows us to prioritize cloud network security incidents, and having clear processes for handling them in place helps set expectations. Discovering a security incident should trigger a short investigation to ascertain the dimensions of the issue, a short meeting of relevant stakeholders to decide what the priority is, and from there, the next steps should be a matter of protocol. Maybe the risk is low, and we don’t need to immediately disrupt service during peak hours, but we do have a plan to fix the problem during the next low tide, or over the weekend. Companies who invest in every other step, but do not enforce security when issues arise, may as well save their money.
There’s no need to spend thousands on detection if you don’t plan on remediation. The cost of a breach is often far more than the temporary difficulty of resolving security incidents. It’s much harder to explain to a customer why their data was leaked than it is to explain that they’ll need to suffer a short service interruption.
The point of this article is to spur thought and discussion. I’m sure that everyone reading this from the operations and dev realms can relate to some parts of the article. Maybe you’re in the process of implementing some of these ideas, or you’re shaking your head while trying to mentally map your production environment. Most teams at most companies will find themselves somewhere along the scale, in the process of grappling with these issues, and anyone who tells you they have perfectly managed environments with perfect security controls is 100% lying, or completely clueless. Implementing proper infrastructure management and security is a constant process, and making the best effort is far better than doing nothing at all. Define where your first steps should be, using your existing resources, and watch as your company starts to reap the financial and operational rewards.
Author: Tomer Hoter