The first time I created a cloud compute instance, then called a “Cloud VM”, was an almost transcendent moment. It was like magic. I was at my first organization which had adopted the cloud, in my first DevOps position, and I immediately knew that the world had changed. If you’ve ever seen a “tech evangelist” gushing about the cloud, and you weren’t there and didn’t experience this transformation, you may not really understand what everyone is still so excited about. Managing data center infrastructure used to be so hard. If you run a single-tenant infrastructure, getting a new client up and running could take weeks. You had to have an entire team managing the purchasing of new servers, their delivery, and installation. The constant replacement of parts as disks failed and servers went dark was draining. Most companies had to “overprovision”, with servers online and ready to go, but not providing services, ready for the case of a traffic peak or a new client. There were so many downsides, but it was all anyone could do.
The Challenge with Cloud Resources
The cloud resources quickly changed it all, and the rewards for those who migrated were immense. Spinning up a new datacenter could be a matter of minutes if you had a good DevOps team. New customer in the EU who needs fast response times? No problem! A few clicks later and a new EU based environment is now available. Setting up a new datacenter in the EU was a multi-year process before. If you’re paying attention, you may already have a good idea of what the problem is, the black pit yawning open next to this highway of progress. We went, as a profession, from managing static, physical infrastructure, easily itemized and jealously husbanded, to “managing” virtual infrastructure around the world which we could create almost as an afterthought. Companies just starting out with a dream, no DevOps, and two developers, could have a hundred servers in the air within a day. The first bill doesn’t arrive for another month! No naming convention, no oversight, nothing required.
Sometimes called “cloud sprawl”, this situation has drastically worsened over time. At first, the cloud was an easy way to create managed VMs and their networks. Over time, the cloud offering has grown exponentially. Nearly every service necessary for the modern enterprise is offered as a managed service and billed differently. Some services bill by network usage, some by CPU usage, some by data egress (but data ingress is free!). Itemized cloud bills easily run into the thousands of lines. According to Flexera’s (formerly Rightscale) 2020 State of the Cloud Report, executives estimate that 30% of their cloud spending is wasted. On average, organizations spend 23% more than they budget for cloud spending, while forecasting that their cloud spending will only grow year over year. 93% of enterprise respondents were using a multi-cloud architecture, further complicating matters. Setting aside the operational cost of cloud sprawl, there’s another reason to properly manage cloud infrastructure. The security implications of unmanaged infrastructure are severe. Infrastructure we’re not aware of is, by definition, not monitored. If it suddenly starts to behave strangely, will we even notice? If a cloud instance the organization no longer needs is running a malicious cryptocurrency miner, with sixteen cores screaming at 100% utilization, will we believe it to be a business workload?
So, the case is made. The cloud is a wonderful thing, an incredible business enabler, but, as a profession, we’re not the best at managing it, and this has security and operational implications. Luckily, all is not lost, and we don’t need to rush to spend more money resolving the problem. Every major cloud provider provides free tools to help us manage our cloud infrastructure, and making strategic efforts to implement management strategies can have a massive payoff, both in our operational spend and in our security posture. Most articles on this topic try to provide a list of “feel-good” tasks, like terminating unused compute instances (how novel!). While this is certainly a necessary task, it takes a “tactical” view of the problem. If we focus on specific culprits and clean them up in a one-time effort, we’ve missed the forest for the trees. You might have heard DevOps referred to as a continuous feedback loop. DevOps engineers plan, build, integrate and deploy, monitor, and based on feedback, go back to planning.
Solving The Challenge
We need to apply this same approach when we consider cloud management, which helpfully often falls within the purview of DevOps teams. Strategic cloud management is a feedback loop of planning, tracking, optimizing, and improvement. It’s never too late. Every organization I’ve worked at did not start out with a plan for cloud infrastructure. We have a product that needs some amount of computing resources, network resources, and data resources. We’ll probably pick the cloud services that, in the early days of the company, were the easiest for us to understand, not necessarily the ones best suited to our workloads. So, we build what we need, and years later, when spending is out of control and our cloud is broadly overprovisioned, we have to organize. So, let’s apply the “DevOps” model to cloud infrastructure:
- Now that we have an existing deployment, we have to try to Plan our cloud infrastructure.
- Next, we need to Implement our plan!
- We need to effectively Monitor our cloud deployment. Visibility is the key to both security and effective operations.
- Finally, we need to Improve, based on the data we’ve collected and new cloud offerings. Once we’ve iterated, the cycle starts anew.
Let’s discuss how to implement each one of these steps. Keep in mind that “strategic” usually means “will take a while”. This process will take several months at least for the first iteration.
Step 0 – Homework!
Like any good general, before we can begin applying strategy, we need to have the best picture possible of reality, and we need to do everything we can to maintain this picture over time. The wonderful thing about reality is that it’s objective, and can be conclusively determined, especially when it comes to our existing cloud infrastructure. We’ve decided that we need to get our cloud sprawl under control, and we’ve tasked the DevOps team, or IT team, or whatever another stakeholder (maybe the CTO!) with this effort. The first step, is assigning this responsibility. Once we’ve done that, we need to determine reality. This is often a truly difficult and complex step, but it’s of critical importance. Before we can do anything else, we need to know what our existing cloud infrastructure is, and ensure it’s all tagged, tracked, and monitored. Including everything. Create a model for tagging based on tagging best practices, and apply it. If you have your own monitoring solution, make sure it’s installed on everything. If you have unmonitored resources, enable your cloud provider’s monitoring solutions. We have to know how much of our resources we’re utilizing, to determine where our opportunities are. It’s very difficult to do this perfectly, but if we’ve gone from 0% managed to 70% managed, we’re doing an amazing job. While we do this, it’s a great time to audit our cloud network and make sure that what we’ve provisioned in the cloud is correctly serving our needs, and not letting prying eyes into our production deployments. Auditing network traffic can also help us discover services deployed in the cloud we might otherwise miss!
During this process, it’s likely that we will discover unused resources, security loopholes, and unmonitored legacy resources. We want to handle all of these issues, but it’s important not to get bogged down here. If we go into a multi-month cleanup process before we’ve implemented proper tagging, tracking, and monitoring, our picture of our cloud infrastructure will quickly decay, and we haven’t even gotten to step 1 yet! So, our team is exhausted and bloody, but we’re already in a stance thousands of times better than we were before we started, and we’re ready to improve our security and operations.
Step 1 – Plan
It’s time to let our expertise shine! We’ve tagged and monitored our cloud, and now we have a lot of data. We can apply our knowledge of cloud services and capabilities to this data, to extract improved operations. Is our current deployment serving our business need exactly? Our security and confidentiality requirements? How about business continuity? Compliance? Utilization? We need to move, workload by workload, through our cloud deployment, and tailor our solution to our needs. This is the time to sit down with other stakeholders in the company. Is the database our developers selected in the early days of the company still the best fit for our product? Can we migrate to an offering that will give us better performance for fewer resources? Does our company roadmap have an upcoming compliance requirement that our current services don’t meet? Did a security audit recently turn up troubling network configuration? The cloud serves the company, and many places in the company can have helpful input when we plan our cloud resources.
Cloud providers usually provide us with two charging models for cloud services: allocation-based and consumption-based. Allocation-based models are what most of us are familiar with. We provision a certain amount of cloud resources, and those resources are statically always available to us, whether they’re in use or not. Their cost is also static, whether they’re in use or not. This model is best suited to very stable applications, not coupled to consumption or prone to spikes in usage or traffic. In most SaaS companies, this is often not the best solution for our cloud services! Consumption-based services are not “pre-provisioned”, but generally provide some baseline of service, which can expand and contract based on user-defined metrics, and are charged based on usage. This is often the best solution for many workloads, especially in SaaS companies, where utilization can drop close to 0% during the user’s off-hours. If your organization lacks the expertise to choose the best services for you, this may be a great place to consult with a professional cloud architect. If we’ve done our jobs correctly, we can even try to create a cost forecast! It will probably be wildly incorrect, but that’s why this is an iterative process.
Step 2 – Implement
It’s time to implement our plan. This part is where the excitement starts to show and we reap the rewards of weeks of hard work. Reporting to the entire company that we downsized several hundred instances and saved hundreds of thousands of dollars a year looks and feels good. This can also happen a lot faster than you might expect. The same nature of the cloud that allowed us to create wasteful and unmanaged resources so quickly allows us to streamline with the same speed. Resizing instances to match utilization, scaling clusters up and down, these are usually processes built into the cloud, not requiring downtime, just a few clicks in the cloud dashboard, or an updated Terraform template, and we’ve improved our operational stance immensely. Its also impossible to overstate the improvement to the organization’s security posture after undertaking a process like this.
First, the security risk of unmanaged and unmonitored infrastructure is greatly reduced. Those same resources which provided a sneaky foothold into our production networks are gone, and the resources we want are now visible to us. It’s not for nothing that good operations and good security go hand in hand, visibility and management is the key to both. Next, we’ve implemented processes that make it harder for unmanaged infrastructure to crop up again. We’ve audited our network, and tagged and grouped our security rules. New security rules which don’t meet our new conventions will be immediately suspect. We can now identify suspicious network traffic that doesn’t meet what we expect, because we know what to expect. We can identify a workload behaving suspiciously (remember the cryptominer?) because we know what it’s provisioned for and what it’s supposed to be running. Though a less tangible benefit for the company than the operational savings, improved security is still an obvious plus. In addition, as opposed to implementing restrictive security tools that hamper productivity and cause user pain, properly managed infrastructure provides us with a strong security benefit for less cost and less pain. Unless, of course, we’re considering the headache for the DevOps team, but, from experience, that’s a cost most businesses are willing to pay.
Step 3 – Monitor
Now that we’ve deployed our new cloud infrastructure, and everyone’s gotten a bonus and a pat on the back for their hard work, it’s time to make sure what we’ve done actually works. All that monitoring should be aggregated into useful dashboards which can hopefully tell us at a glance if we’re meeting our resource provisioning goals if we’re still overprovisioned, or if some of our workloads are starved for more resources. Are we handling the growth of the company well? As the company grows, more users mean more resources need to come online. If this process is happening automatically, we’re doing a great job! If not, maybe there’s an opportunity to shift more workloads into consumption-based resources.
Monitoring is often treated as a process for finding faults exclusively, but if we have good control of our cloud resources, monitoring is a tool for identifying opportunity. Monitoring is also a process that requires maintenance and constant work. As the company develops new features and services, new dashboards need to come online. There will always be some drift between reality and what’s actually visible, and one of our constant struggles must be the continuous effort to improve our monitoring and close this gap. These aren’t new ideas. One of the most important and formative articles I read as a larval, 22-year-old engineer, way back in 2011, is “Measure Anything, Measure Everything”, from “Code as Craft”, Etsy’s engineering blog. One of the article’s conclusions is: “tracking everything is key to moving fast, but the only way to do it is to make tracking anything easy”. One of the opportunities we need to identify is how to make monitoring easy, so this part of our process improves with us, and we don’t lose sight of reality.
Step 4 – Improve
So, we’ve come to the end of our first cycle. If we were super effective and there were no surprises and major upheavals along the way, it’s been a few months since we started with Step 0. Most likely, it’s been more than six months. We’ve saved the company immense operational costs, improved our security posture, and brought our sprawl under control. If we stop here, go take a well-deserved nap, and neglect our cloud infrastructure again, we’ll be back at Step 0 in a matter of months. It’s time to go back to Step 1, armed with experience and data, and plan once again. How can we optimize our new deployment even further? What tools have the cloud provider released since we started the process, and are they relevant to us? Did we correctly assess the needs of all our workloads? Have we scaled effectively with the company? How off was our cost prediction and why?
The main conclusion of this article is that, when dealing with something as core to our profession and business as our production infrastructure, our thinking needs to be as agile, iterative, and strategic as any other part of our organization. The products we create are based on long-term strategic roadmaps, fed by customer feedback, and maintained by teams of professionals. Why should our living, breathing, cloud infrastructure be any different? This is one of the leaps in thinking we need to make. We’ve left behind static, managed infrastructure for the cloud and DevOps methodologies, but we didn’t apply the same agility to our infrastructure we did to our code. It’s time to close that gap.