Tracing vs. Logging: What You Need To Know

Log tracking, trace log, or logging traces…

Although these three terms are easy to interchange (the wordplay certainly doesn’t help!), compare tracing vs. logging, and you’ll find they are quite distinct. Logs monitoring, traces, and metrics are the three pillars of observability, and they all work together to measure application performance effectively. 

Let’s first understand what logging is.

What is logging?

Logging is the most basic form of application monitoring and is the first line of defense to identify incidents or bugs. It involves recording timestamped data of different applications or services at regular intervals. Since logs can get pretty complex (and massive) in distributed systems with many services, we typically use log levels to filter out important information from these logs. The most common levels are FATAL, ERROR, WARN, DEBUG, INFO, TRACE, and ALL. The amount of data logged on each log level also varies based on how critical it is to store that information for troubleshooting and auditing applications.

Most logs are highly detailed with relative information about a particular microservice, function, or application. You’ll need to collate and analyze multiple log entries to understand how the application functions normally. And since logs are often unstructured, reading them from a text file on your server is not the best idea. 

But we’ve come far with how we handle log data. You can easily link your logs from any source and in any language to Coralogix’s log monitoring platform. With our advanced data visualization tools and clustering capabilities, we can help you proactively identify unusual system behavior and trigger real-time alerts for effective investigation.

Now that you understand what logging is let’s look at what is tracing and why it’s essential for distributed systems.

What is tracing?

In modern distributed software architectures, you have a dozen — if not hundreds of applications calling each other. Although analyzing logs can help you understand how individual applications perform, it does not track how they interact with each other. And often, especially in microservices, that’s where the problem lies. 

For instance, in the case of an authentication service, the trigger is typically a user interaction — such as trying to access data with restricted access levels. The problem can be in the authentication protocol, the backend server that hosts the data, or how the server sends data to the front end.

Thus, seeing how the services connect and how your request flows through the entire architecture is essential. That provides context to the problem. Once the problematic application is identified, the appropriate team can be alerted for a faster resolution.

This is where tracing comes in — an essential subset of observability. A trace follows a request from start to end and how your data moves through the entire system. It can record which services it interacted with and each service’s latency. With this data, you can chain events together to analyze any deviations from normal application behavior. Once the anomaly is pinpointed, you can link log data from events you’ve identified, the duration of the event, and the specific function calls that caused the event — thereby identifying the root cause of the error within a few attempts.

Okay, so now that we understand the basics of what is tracing, let’s look at when you should use tracing vs. logging.

When should you use tracing vs. logging?

Let’s understand this with an example. Imagine you’ve joined the end-to-end testing team of an e-commerce company. Customers complain about intermittent slowness while purchasing shoes. To resolve this, you must identify which application is triggering the issue — is it the payment module? Is it the billing service? Or is it how the billing service interacts with the fulfillment service?

You require both logging and tracing to understand the root cause of the issue. Logs help you identify the issue, while a trace helps you attribute it to specific applications. 

An end-to-end monitoring workflow would look like this: Use a log management platform like Coralogix to get alerts if any of your performance metrics fail. You can then send a trace that emulates your customer behavior from start to end. 

In our e-commerce example, the trace would add a product to the cart, click checkout, add a shipping address, and so on. While doing each step, it would record the time it took for each service to respond to the request. And then, with the trace, you can pinpoint which service is failing and then go back to the logs to find any errors.

Logging is essential for application monitoring and should always be enabled. In contrast, trying to trace continuously means that you’d bogging down the system with unnecessary requests, which can cause performance issues. It’s better to send sample requests if the logs show behavior anomalies. 

So, to sum up, if you have to choose tracing vs. logging for daily monitoring, logging should be your go-to! And conversely, if you need to debug a defect, you can rely on tracing to get to the root cause faster.  

Tracing vs. Logging: Which one to choose?

Although distributed architectures are great for scale, they introduce additional complexity and require heavy monitoring to provide a seamless user experience. Therefore, we wouldn’t recommend you choose tracing vs. logging — instead, your microservice observability strategy should have room for both. While logging is like a toolbox you need daily, tracing is the handy drill that helps you dig down into issues you need to fix. 

How do Observability and Security Work Together?

There’s no question that the last 18 months have seen a pronounced increase in the sophistication of cyber threats. The technology industry is seeing a macro effect of global events propelling ransomware and wiperware development further into the future, rendering enterprise security systems useless. This is where enterprise network monitoring comes in.

Here at Coralogix, we’re passionate about data observability and security and what the former can do for the latter. We’ve previously outlined key cyber threat trends such as trojans/supply chain threats, ransomware, the hybrid cloud attack vector, insider threats, and more. 

This article will revisit some of those threats and highlight new ones while showing why observability and security should be considered codependent. 

Firewall Observability

Firewalls are a critical part of any network’s security. They can give some of the most helpful information regarding your system’s security. A firewall is different from an intrusion detection system (which we discuss below) – you can think of a firewall as your front door and the intrusion detection system as the internal motion sensors. 

Firewalls are typically configured based on a series of user-proscribed or pre-configured rules to block unauthorized network traffic.

Layer 3 vs. Layer 7 Firewalls

Two types of firewalls are common in the market today: Layer 3 and Layer 7. Layer 3 firewalls typically block specific IP addresses, either from a vendor-supplied list that is automatically updated for the user or a custom-made allow/deny list. A mixture of the two is also typical, allowing customers to benefit from global intelligence on malicious IP addresses while being able to block specific addresses that have previously attempted DDoS attacks, for example. 

Layer 7 firewalls are more advanced. They can analyze data entering and leaving your network at a packet level and filter the contents of those packets. Initially, this capability filters malware signatures, preventing malicious actors from disrupting or encrypting a system. Today, more organizations are using layer 7 firewalls to prevent data and ingress. This is particularly useful in protecting against data breaches, insider threats, and ransomware when data may be leaving your network. 

Given that it’s best practice to have a layer 3 and a layer 7 firewall, and the amount of data generated by the latter, having an observability platform like Coralogix to collate and contextualize this data is critical.

Just a piece of the puzzle

Given that a firewall is just one tool in a security team’s arsenal, it’s essential to be able to correlate events at a firewall level with other system events, such as database failures, malware detection, or data egress. Fortunately, Coralogix ingests firewall logs and metrics using either Logstash or its own syslog agent, which means that it can work with a wide variety of firewalls. Additionally, Coralogix’s advanced log parsing and visualization technologies allow security teams to overlay firewall events with other security metrics simply. Coralogix also provides some bespoke integrations to a number of the most popular firewalls. 

Firewall data in isolation isn’t that helpful. It can tell you what malicious traffic you’ve successfully blocked, but not what you’ve missed. That’s why adding context from other security tools is vital.

Intrusion Detection Systems and Observability

As mentioned above, if firewalls are the first defense, then intrusion detection systems are next in line. Intrusion detection is key because it can tell you the nature of the threat that’s breached your system and highlight what your firewall might have missed. Remember, a firewall will only be able to tell you what didn’t get in or what was let in. 

Adding an intrusion detection system allows you to assess and neutralize threats that bypass other network security controls. Some intrusion detection systems pull data from OWASP to hunt for the most common malware and vulnerabilities, while others use crowdsourced data. 

By layering intrusion detection data, like that from Suricata, your SRE or security team will be able to detect attacks and identify the point of entry. Such context is vital in reengineering cyber defenses after an attack.

Kubernetes Observability and Security 

55% of Kubernetes deployments are slowed down due to security concerns, says a recent Red Hat survey. The same study says that 93% of respondents experienced some sort of security incident in a Kubernetes environment over the last year. 

Those two statistics tell you everything you need to know. Kubernetes security is important. Monitoring Kubernetes is vital to maintaining cluster security, as we will explore below.

Pod Configuration Security

By default, there is no configured network security rule which permits pods to communicate with each other. Pod security is heavily defined by role-based access control (RBAC). It’s possible to monitor the security permissions assigned to a given user to ensure there isn’t over-provisioning of access.

Malicious Code

A common attack vector to a Kubernetes cluster is via the containerized application itself. By monitoring the host level or IP requests, you can limit your vulnerability to DDOS attacks, which would otherwise take the cluster offline. Using Prometheus for operational, enterprise network monitoring tool is a good way of picking up vital metrics from containerized environments. 

Runtime Monitoring

A container’s runtime metrics will give you a good idea of whether it’s also running a secondary, malicious process. Runtime metrics to look out for include network connections, endpoints, and audit logs. By monitoring these metrics and using an ML-powered log analyzer, such as Loggregation, you can spot any anomalies which may indicate malicious activity.

Monitoring for protection

With Kubernetes, several off-the-shelf security products may aid a more secure deployment. However, as you can see above, there is no substitute for effective monitoring for Kubernetes security.

Network Traffic Observability and Security

It should be abundantly clear why an effective observability strategy for your network traffic is critical. On top of the fundamentals discussed so far, Coralogix has many bespoke integrations designed to assist your network security and observability. 

Zeek

Zeek is an open-source network monitoring tool designed to enhance security through open-source and community participation. You can ship Zeek logs to Coralogix via Filebeat so that every time Zeek performs a scan, results are pushed to a single dashboard overlaid with other network metrics.

Cloudflare

Organizations around the world use Cloudflare for DDOS and other network security protection. However, your network is only as secure as the tools you use to secure it. Using the Coralogix audit log integration for Cloudflare, you can ensure that access to Cloudflare is monitored and any changes are flagged in a network security dashboard. 

Security Traffic Analyzer

Coralogix has built a traffic analyzer specifically for monitoring the security of your AWS infrastructure. The Security Traffic Analyzer connects directly to your AWS environment and collates information from numerous AWS services, including network load balancers and VPC traffic.

Application-level Observability

Often overlooked, application-level security is more important than ever. With zero-day exploits like Log4j becoming more and more common, having a robust approach to security from the code level up is vital. You guessed it, though – observability can help. 

To the edge, and beyond

Edge computing and serverless infrastructure are just two examples of the growing complexities you must consider with application-level security. Running applications on the edge can generate vast amounts of data, requiring advanced observability solutions to identify anomalies. Equally, serverless applications can lead to security and IAM issues, which have been the causes of some of the world’s biggest data breaches. 

Observability for Hybrid Cloud Security

In the world of hybrid cloud, observability and security are closely intertwined. The complexities of running systems in a mixture of on-premise and cloud environments give malicious actors, and your own security teams, a lot to work with. 

Centralized Logging

It’s unlikely that the security tooling for your cloud environments will be the same as that used on-premise. Across different systems, vendors will likely have different security tools, all with varied log outputs. A single repository for these outputs, which will also parse them in a standardized fashion, is a key part of effective defense. Without this, your security teams may be spending unnecessary time decrypting the nuances in two different products’ logs, trying to find a connection. 

Dashboarding

A single pane of glass is the only way to implement observability in a complex environment. Dashboards help spot trends and identify outliers, making sure that two teams with different perspectives are “singing from the same hymn sheet. Combine effective dashboarding with holistic data collection, and you’re onto a winner.

Observability is Security 

At Coralogix, we firmly believe that the most important tool in your security arsenal is effective monitoring and observability. But it’s not just effectiveness that’s key, but also pragmatism. We firmly believe in the value of collecting holistic data, such as from Slack and PagerDuty, to tackle security incidents as well as to detect them. 

The bulk of this piece has been regarding how observability can help detect malicious actors and security incidents. Our breadth of out-of-the-box integrations and the openness of our platform give organizations free rein to build security-centered SIEM tools. However, by analyzing otherwise overlooked data, such as from internal communications, website information, and marketing data, in conjunction with traditionally monitored metrics, you can really supercharge your defense and response.

Summary

Hopefully, you can see that security and observability are no longer separate concepts. As companies exploit increasingly complex technologies, generate more data, and deploy more applications, observability and security become bywords for one another. However, for your observability strategy to become part of your security strategy, you need the right platform. That platform will collate logs for you automatically while highlighting anomalies, integrate with every security tool in your arsenal and contextualize their data into one dashboard, and bring your engineers together to combat the technical threats facing your organization. 

Kubernetes Security Best Practices

As the container orchestration platform of choice for many enterprises, Kubernetes (or K8s, as it’s often written) is an obvious target for cybercriminals. In its early days, the sheer complexity of managing your own Kubernetes deployment meant it was easy to miss security flaws and introduce loopholes.

Now that the platform has evolved and been managed, Kubernetes monitoring services are available from all major cloud vendors, and Kubernetes security best practices have been developed and defined. While no security measure will provide absolute protection from attack, applying these techniques consistently and correctly will certainly decrease the likelihood of your containerized deployment being hacked.

Applying defense in depth

The recommended approach to securing a Kubernetes deployment uses a layered strategy, modeled on the defense in depth paradigm (DiD). In the context of information technology, defense in depth is a security pattern that uses multiple layers of redundancy to protect a system from attack.

Rather than relying on a single security perimeter to protect against all attacks, a defense in depth approach acknowledges the risk that defenses may be breached and deploys additional protections at intermediate and lower levels of the architecture. That way, if one line of defense is breached, there are additional obstacles in place to impede an attacker’s progress.

So how does this apply to Kubernetes? Kubernetes is deployed to a computing cluster that is made up of multiple worker nodes together with nodes hosting the control plane components (including the API server and database). 

Each worker node is simply a machine hosting one or more pods, together with the K8s agent (kubelet), network proxy (kube-proxy), and container runtime. Each pod hosts a container that runs some of your application code. Finally, as a cloud-native platform, K8s is typically deployed to cloud-hosted infrastructure, which means you can easily increase the number of nodes in the cluster to meet demands.

You can think of a Kubernetes deployment in terms of these four layers – your code, the containers the code runs in, the cluster used to deploy the containers, and the cloud (or on-premise) infrastructure hosting the cluster – the four Cs of cloud-native security. Applying Kubernetes security best practices at each of these levels helps to create defense in depth.

K8s security best practices

Securing your code

Kubernetes makes it easier to deploy application code using containers and enables you to leverage the benefits of cloud infrastructure for hosting those containers. The code you run in your containers is both an obvious attack vector and the layer over which you have the most control.

When securing your code, building security considerations into your software development process early on – also known as “shifting security to the left” – is more efficient than waiting until the functionality has been developed before checking for security flaws. 

One example of doing this to scanning your code changes regularly (either as you write or as an early step in the CI/CD pipeline) with static code analyzers and software composition analysis tools. These help to catch known exploits in your chosen framework and third-party dependencies, which could otherwise leave your application vulnerable to attack.

When developing new features for a containerized application, you also need to consider how your containers will communicate with each other. This includes ensuring communications between containers are encrypted and limiting exposed ports. Taking a zero-trust approach here helps protect your application and your data; if an attacker finds a way in, at least they won’t immediately gain unfettered access to your entire system.

Container protections

When Kubernetes deploys an instance of a new container, it first has to fetch the container image from a container registry. This can be the Docker public registry, another specified public registry, or a private container registry. Unfortunately, public container registries have become a popular attack vector. 

This is because open-source container images provide a convenient way to evade an organization’s security perimeter and deploy malicious code directly onto a cluster, such as crypto-mining operations and bot farms. Scanning container images for known vulnerabilities and establishing a secure chain of trust for the images you deploy to your cluster is essential.

When building containers, applying the principle of least privilege will help to prevent malicious actors that have managed to gain access to your cluster from accessing sensitive data or modifying the configuration to suit their own ends. 

As a minimum, configure the container to use a user with minimal privileges (rather than root access) and disable privilege escalation. If some root permissions are required, grant those specific capabilities rather than all. With Kubernetes, these settings can be configured for containers or pods using the security context. This makes it easier to apply security settings consistently across all pods and containers in your cluster.

You may also want to consider setting resource limits to restrict the number of pods or services that can be created, and the amount of CPU, memory, and disk space that can be consumed, according to your application’s needs. This reduces the scope for misuse of your infrastructure and mitigates the impact of denial-of-service attacks.

Cluster-level security

A Kubernetes cluster is made up of the control plane and data plane elements. The control plane is responsible for coordinating the cluster, whereas the data plane consists of the worker nodes hosting the pods, K8s agent (kubelet), and other elements required for the containers to run.

On the control plane side, both the Kubernetes API and the key-value store (etcd) require specific attention. All communications – from end-users, cluster elements, and external resources – are routed through the K8s API. Ideally, all calls to the API, from inside and outside the cluster, should be encrypted with TLS, authenticated, and authorized before being allowed through.

When you set up the cluster, you should specify the authentication mechanisms to be used for human users and service worker accounts. Once authenticated, requests should be authorized using the built-in role-based access control (RBAC) component.

Kubernetes requires a key-value store for all cluster data. Access to the data store effectively grants access to the whole cluster, as you can view and (if you have write access) modify the configuration details, pod settings, and running workloads. 

It’s therefore essential to restrict access to the database and secure your database backups. Support for encrypting secret data at rest was promoted from beta in late 2020 and should also be enabled where possible.

Within the data plane, it’s good practice to restrict access to the Kubelet API, which is used to control each worker node and the containers it hosts. By default, anonymous access is permitted, so this must be disabled for production deployments at the very least.

For particularly sensitive workloads, you may also want to consider a sandboxed or virtualized container runtime for increased security. These reduce the attack surface, but at the cost of reduced performance compared to mainstream runtimes such as Docker or CRI-O.

You can learn more about securing your cluster from the Kubernetes documentation.

Cloud or on-premise infrastructure

K8s is cloud-native, but it’s possible to run it on-premises too. When using a managed K8s service, such as Amazon EKS or Microsoft AKS, your cloud provider will handle the physical security of your infrastructure and many aspects of the cybersecurity too.

If you’re running your own Kubernetes deployment, either in the cloud or hosted on-premise, you need to ensure you’re applying infrastructure security best practices. For cloud-hosted deployments, follow your cloud provider’s guidance on security and implement user account protocols to avoid unused accounts remaining active, restrict permissions, and require multi-factor authentication. 

For on-premise infrastructure, you’ll also need to keep servers patched and up-to-date, maintain a firewall and implement other network security, potentially use IP allow lists, or block lists to limit access, and ensure physical security.

Wrapping up

As a container orchestration platform, Kubernetes is both powerful and flexible. While this allows organizations to customize it to their needs, it also places the burden of security on IT admins and SecOps staff. A good understanding of Kubernetes security best practices, including how security can be built in at every level of a K8s deployment, and the specific needs of your organization and the application, are essential.

Cybersecurity is not a fire-and-forget exercise. Once you have architected and deployed your cluster with security in mind, the next phase is to ensure your defenses are working as expected. Building observability into your Kubernetes deployment will help you to develop and maintain a good understanding of how your system is operating and monitor running workloads.

Introducing Cloud Native Observability

The term ‘cloud native’ has become a much-used buzz phrase in the software industry over the last decade. But what does cloud-native mean? The Cloud Native Computing Foundation’s official definition is:


Cloud-native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds…These techniques enable loosely coupled systems that are resilient, manageable, and observable.”

From this definition, we can differentiate between cloud-native systems and monoliths which are a single service run on a continuously available server. Like Amazon’s AWS or Google Azure, large cloud providers can run serverless and cloud-native systems. Serverless systems are a subset of cloud-native systems where the hardware settings are completely abstracted from developers. Private servers maintained by private companies can also run cloud-native services.

The critical point to consider is that cloud-native solutions have unique observability problems. Developing, troubleshooting, and maintaining monolithic services is quite different from troubleshooting cloud-native services. This article will introduce some of the unique issues presented in cloud-native computing, and tools that allow users to gain cloud-native observability. 

Challenges and opportunities of the cloud

Cloud-native infrastructure provides many benefits over traditional monolithic architecture. These distributed computing solutions provide scalable, high-performing systems that can be upgraded rapidly without system downtime. Monoliths were sufficient for demand in the earlier computer-era days but could not scale well. The tradeoff for cloud native’s agility is an obfuscated troubleshooting process. 

With new techniques in software infrastructure also came new techniques in cloud-native observability. Tools to centralize and track data as it flows through a system have become paramount to troubleshooting and understanding where issues might arise. 

User expectations

Cloud-native systems by design make scaling easier by simply adding more infrastructure for existing software to run on. Scaling may occur by adding more physical servers or increasing cloud computing capacity with your chosen provider. Businesses need to be able to detect when scaling is necessary to accommodate all potential customers.

Along with scaling systems, adding additional features to your offering is crucial to growing your business. Cloud-native solutions allow development teams to produce features that can plug onto existing services quickly. Even with significant testing, features can have unseen flaws until they are deployed and running in the system for some time.

It is crucial to have cloud observability tools to monitor the system and alert the appropriate team when issues arise. Without such a service, users will be the first to discover issues that will hurt the business. 

Distributed Systems

Cloud-native services run as distributed systems with many different software pieces interacting with each other. Systems may have many containers, compute functions, databases, and queues interacting in different ways to make up a unified feature. As well, different features may use the same infrastructure.

When an issue arises in a system or a connection, isolating the issue can be difficult. Logging remains as crucial as it was with monolithic solutions. However, logging alone is not sufficient to have complete cloud-native observability. Systems also need to understand how microservices are working together. Traces can be used to track requests through a system. Metrics can also be used to understand how many requests or events are occurring so teams can quickly detect and isolate issues.

New tools are being introduced into the software industry to help teams rapidly fix production issues. Since distributed systems are ideal for rapid changes, fixes may be rapid as well. The more significant problem is that detecting issues becomes much more complex, primarily when developers have not implemented observability tools. Using a combination of logs, metrics, and traces and having all records stored in a centralized tool like Coralogix’s log analytics platform can help teams quickly jump over the troubleshooting hurdle to isolate the issues.

Ephemeral infrastructure

Cloud-native observability tools are available to deal with the ephemeral nature of systems. Cloud-native deployments are run on temporary containers when using cloud native. The containers will spin up and shut down automatically as system requirements change. If an issue occurs, the container will likely be gone by the time troubleshooting needs to occur. 

If systems use a serverless framework, teams are even more abstracted away from the hardware issues that may cause failures. Services like AWS and Azure can take complete control over the handling of servers. This abstraction allows companies to focus on a software core competency rather than managing servers both physically and through capacity and security settings. Without knowing how services run, systems have a limited ability to know what failed. Metrics and traces become critical tools to cloud-native observability in these cases.

Elastic scalability

Cloud-native systems typically use a setup that will scale services as user requirements ebb and flow. With higher usage, storage and computing require more capacity. This higher usage may not be consistent over time. When usage decreases, capacity should decrease in turn. Scaling capacity in this way allows businesses to be very cost-efficient, paying only for what they use. It can also allow them to allocate private servers to what is needed at that time, scaling back capacity for computing that is not time-critical. 

Cloud-native observability must include tracking the system’s elastic scaling. Metrics can be helpful to understand how many users or events are accessing a given service at any given time. Developers can use this information to fine-tune capacity settings to increase efficiency further. Metrics can also help to understand if part of the system has failed due to a capacity or scaling issue.

Monitoring usage

Cloud-native systems follow a new manner of designing and implementing systems. Since the construction of the systems is new, professionals need also to consider new methods of implementing security practices. Monitoring is key to securing your cloud-native solution.

With monolithic deployments, security practices partially focussed on securing endpoints and the perimeter of the service. With cloud-native, services are more susceptible to attacks than previously. Shifting to securing data centers and services rather than only endpoints is critical. Detecting an attack also requires tools as dynamic as your service to observe every part of the infrastructure.

Security and monitoring tools should scale without compromising performance. These tools should also be able to contain nefarious behavior before it can spread to an entire system. Cloud-native observability tools are designed to help companies track where they may be vulnerable and, in some cases, even to detect an attack themselves.

Conclusion

Cloud-native solutions allow companies to create new features and fix issues quickly. Observability tools are key in cloud-native solutions since issues can be more obfuscated than in traditional software designs. Developers should implement observability tools into their systems from the first inception. Initial integration ensures tools are compatible with cloud providers and still leaves the ability to augment cloud-native observability tools in the future.

Microservices built on a cloud-native architecture can have multiple teams working on different features. Teams should implement observability tools to notify the appropriate person or team when an issue is detected. Tools that allow all team members to understand the health of the system are ideal. 

What to Consider When Monitoring Hybrid Cloud Architecture

Hybrid cloud architectures provide the flexibility to utilize both public and cloud environments in the same infrastructure. This enables scalability and power that is easy and cost-effective to leverage. However, an ecosystem containing components with dependencies layered across multiple clouds has its own unique challenges.

Adopting a hybrid log monitoring strategy doesn’t mean you need to start from scratch, but it does require a shift in focus and some additional considerations. You don’t need to reinvent the wheel as much as realign it.

In this article, we’ll take a look at what to consider when building a CDN monitoring stack or solution for a hybrid cloud environment.

Hybrid Problems Need Hybrid Solutions

Modern architectures are complex and fluid with rapid deployments and continuous integration of new components. This makes system management an arduous task, especially if your engineers and admins can’t rely on an efficient monitoring stack. Moving to a hybrid cloud architecture without overhauling your monitoring tools will only complicate this further, making the process disjointed and stressful.

Fortunately, there are many tools available for creating a monitoring stack that provides visibility in a hybrid cloud environment. With the right solutions implemented, you can unlock the astounding potential of infrastructures based in multiple cloud environments.

Mirroring Cloud Traffic On-Premise

When implementing your hybrid monitoring stack, covering blind spots is a top priority. This is true for any visibility-focused engineering, but blind spots are especially problematic in distributed systems. It’s difficult to trace and isolate root causes of performance issues with data flowing across multiple environments. Doubly so if some of those environments are dark to your central monitoring stack.

One way to overcome this is to mirror all traffic to/between external clouds to your on-premise environment. Using a vTAP (short for virtual tap), capture and copy data flowing between cloud components and feed the ‘mirrored’ data into your on-premise monitoring stack.

Traffic mirroring with implemented vTAP software solutions ensures that all system and network traffic is visible, regardless of origin or destination. The ‘big 3’ public cloud providers (AWS, Azure, Google Cloud) offer features that enable mirroring at a packet level, and there are many 3rd party and open source vTAP solutions readily available on the market.

Packet-Level Monitoring for Visibility at the Point of Data Transmission

As mentioned, the features and tools offered by the top cloud providers allow traffic mirroring down to the packet level. This is very deliberate on their part. Monitoring traffic and data at a packet level is vital for any effective visibility solution in a hybrid environment.

In a hybrid environment, data travels back and forth between public and on-premise regions of your architecture regularly. This can make tracing, logging, and (most importantly) finding the origin points of errors a challenge. Monitoring your architecture at a packet level makes tracing the journey of your data a lot easier.

For example, monitoring at the packet level picks up on failed cyclic redundancy checks and checksums on data traveling between public and on-premise components. Compromised data is filtered upon arrival. What’s more, automated alerts when packet loss spikes allow your engineers to isolate and fix the offending component before the problem potentially spirals into a system-wide outage.

Verifying data integrity and authenticity in real-time quickly identifies faulty components or vulnerabilities by implementing data visibility at the point of transmission. There are much higher levels of data transmission in hybrid environments. As such, any effective monitoring solution must ensure that data isn’t invisible while in transit.

Overcoming the Topology Gulf

Monitoring data in motion is key, and full visibility of where that data is traveling from and to is just as important. A topology of your hybrid architecture is far more critical than it is in a wholly on-premise infrastructure (where it is already indispensable). Without an established map of components in the ecosystem, your monitoring stack will struggle to add value.

Creating and maintaining an up-to-date topology of a hybrid-architecture is a unique challenge. Many legacy monitoring tools lack scope beyond on-premise infrastructure, and most cloud-native tools offer visibility only within their hosted service. Full end-to-end discovery must overcome the gap between on-premise and public monitoring capabilities.

On the surface, it requires a lot of code change and manual reconfigurations to integrate the two. Fortunately, there are ways to mitigate this, and they can be implemented from the early conceptual stages of your hybrid-cloud transformation.

Hybrid Monitoring by Design

Implementing a hybrid monitoring solution post-design phase is an arduous process. It’s difficult to achieve end-to-end visibility if the components of your architecture are deployed and in use.

One of the advantages of having components in the public cloud is the flexibility afforded by access to an ever-growing library of components and services. However, utilizing this flexibility means your infrastructure is almost constantly changing, making both discovery and mapping troublesome. Tackling this in the design stage ensures that leveraging the flexibility of your hybrid architecture doesn’t disrupt the efficacy of your monitoring stack.

By addressing real-time topology and discovery in the design stage, your hybrid-architecture and all associated operational tooling will be built to scale in a complimentary manner. Designing your hybrid-architecture with automated end-to-end component/environment discovery as part of a centralized monitoring solution, as an example, keeps all components in your infrastructure visible regardless of how large and complex your hybrid environment grows.

Avoiding Strategy Silos

Addressing monitoring at the design stage ensures that your stack can scale with your infrastructure with minimal manual reconfiguration. It also helps avoid another common obstacle when monitoring hybrid-cloud environments, that of siloed monitoring strategies.

Having a clearly established, centralized monitoring strategy keeps you from approaching monitoring on an environment-by-environment basis. Why should you avoid an environment-by-environment approach? Because it quickly leads to siloed monitoring stacks, and separate strategies for your on-premise and publicly hosted systems.

While the processes differ and tools vary, the underpinning methodology behind how you approach monitoring both your on-premise and public components should be the same. You and your team should have a clearly defined monitoring strategy to which anything implemented adheres to and contributes towards. Using different strategies for different environments quickly leads to fragmented processes, poor component integration, and ineffective architecture-wide monitoring.

Native Tools in a Hybrid Architecture

AWS, Azure, and Google all offer native monitoring solutions — AWS Cloudwatch, Google Stackdriver, and Azure Monitor. Each of these tools enables access to operational data, observability, and monitoring in their respective environments. Full end-to-end visibility would be impossible without them. While they are a necessary part of any hybrid-monitoring stack, they can also lead to vendor reliance and the aforementioned siloed strategies we are trying to avoid.

In a hybrid cloud environment, these tools should be part of your centralized monitoring stack, but they should not define it. Native tools are great at metrics collection in their hosted environments. What they lack in a hybrid context, however, is the ability to provide insight across the entire infrastructure.

Relying solely on native tools won’t provide comprehensive, end-to-end visibility. They can only provide an insight into the public portions of your hybrid-architecture. What you should aim for is interoperability with these components. Effective hybrid monitoring taps into these components to use them as valuable data sources in conjunction with the centralized stack.

Defining ‘Normal’

What ‘normal’ looks like in your hybrid environment will be unique to your architecture. It can only be established when analyzing the infrastructure as a whole. While the visibility of your public cloud components is vital, it is only by analyzing your architecture as a whole that you can define what shape ‘normal’ takes.

Without understanding and defining ‘normal’ operational parameters it is incredibly difficult to detect anomalies or trace problems to their root cause. Creating a centralized monitoring stack that sits across both your on-premise and public cloud environments enables you to embed this definition into your systems.

Once your system is aware of what ‘normal’ looks like as operational data, processes can be put in place to solidify the efficiency of your stack across the architecture. This can be achieved in many ways, from automating anomaly detection to setting up automated alerts.

You Can’t Monitor What You Can’t See

These are just a few of the considerations you should take when it comes to monitoring a hybrid-cloud architecture. The exact challenges you face will be unique to your architecture.

If there’s one principle to remember at all times, it’s this: you can’t monitor what you can’t see.

When things start to become overcomplicated, return to this principle. No matter how complex your system is, this will always be true. Visibility is the goal when creating a monitoring stack for hybrid cloud architecture, the same as it is with any other.

11 Tips for Avoiding Cloud Vendor Lock-In 

Cloud vendor lock-in. In cloud computing, software or computing infrastructure is commonly outsourced to cloud vendors. When the cost and effort of switching to a new vendor is too high, you can become “locked in” to a single cloud vendor. 

Once a vendor’s software is incorporated into your business, it’s easy to become dependent upon that software and the knowledge needed to operate it. It is also very difficult to move databases once live, especially in a cloud migration to move data to a different vendor which may involve data reformatting. Ending contracts early can also suffer heavy financial penalties.

This article will explore 11 ways you can avoid cloud vendor lock-in and optimize your cloud costs.

Tips For Avoiding Cloud Vendor Lock-In

There are ways to reduce the risk of vendor lock-in by following these best practices:

1. Engage All Stakeholders

Stakeholder engagement is crucial to understand the unique risks of cloud vendor lock-in. Initially, the architects should drive the discussion around benefits and drawbacks cloud computing will bring to the organization. These discussions should engage all stakeholders.

The architects and technical teams should be aware of the business implications of their technical decision choices. A solution in the form of applications, workloads and architecture should align with business requirements and risk. Before migrating to a cloud service, carefully evaluate lock-in concerns specific to the vendor and the contractual obligations.

2. Review the Existing Technology Stack

The architects and technical teams should also review the existing technology stack. If the workloads are designed to operate on legacy technologies, the choice of cloud platforms and infrastructure will likely be limited.

3. Identify Common Characteristics

Identify what is compatible across cloud vendors, your existing technology stack and the technical requirements. These common characteristics are key to determining the best solution for your business needs. The lack of compatibles may highlight the need to rethink your strategy, the decision to migrate to cloud and the technical requirements of your workloads.

4. Investigate Upgrading Before Migrating

If your applications are compatible with only a limited set of cloud technologies, you may want to first consider upgrading your applications before migrating to a cloud environment. If your applications and workloads will only work with legacy technologies that are supported by a small number of vendors, the chances of future cloud vendor lock-in is higher.

Once signed up to one of these vendors, any future requirements to move to another vendor may incur a high financial cost or technical challenges.

5. Implement DevOps Tools and Processes

DevOps tools are increasingly being implemented to maximize code portability.

Container technology provided by companies like Docker help isolate software from its environment and abstract dependencies away from the cloud provider. Since most vendors support standard container formats, it should be easy to transfer your application to a new cloud vendor if necessary.

Also, configuration management tools help automate the configuration of the infrastructure on which your applications run. This allows the deployment of your applications to various environments, which can reduce the difficulty of moving to a new vendor.

These technologies reduce the cloud vendor lock-in risks that stem from proprietary configurations and can ease the transition from one vendor to another.

6. Design Your Architecture to be Loosely Coupled 

To minimize the risk of vendor lock-in, your applications should be built or migrated to be as flexible and loosely coupled as possible. Cloud application components should be loosely linked with the application components that interact with them. Abstract the applications from the underlying proprietary cloud infrastructure by incorporating REST APIs with popular industry standards like HTTP, JSON, and OAuth. 

Also, any business logic should be separated from the application logic, clearly defined and documented. This will avoid the need to decipher business rules in case a migration to a new vendor occurs.

Using loosely coupled cloud designs, not only reduces the risk of lock-in to a single vendor, but it also gives your application interoperability that’s required for fast migration of workloads and multi-cloud environments.

7. Make Applications Portable Using Open Standards 

The portability of an application describes its flexibility to be implemented on an array of different platforms and operating systems without making major changes to the underlying code.

Building portable applications can also help organizations avoid cloud vendor lock-in. If you develop a business-critical application whose core functionality depends on platform-specific functionality, you could be locked into that vendor.

The solution is to: 

  • Build portable applications that are loosely coupled using open standards with cloud application components. 
  • Avoid hard coding external dependencies in third-party proprietary applications.
  • Maximize the portability of your data, choose a standardized format and avoid proprietary formatting. 
  • Describe data models as clearly as possible, using applicable schema standards to create detailed, readable documentation.

8. Develop a Multi-Cloud Strategy  

A multi-cloud strategy is where an organization uses two or more cloud services from different vendors and maintains the ability to allocate workloads between them as required. This model is becoming increasingly popular.

Not only does this strategy help organizations avoid cloud vendor lock-in, it also means that they can take advantage of the best available pricing, features, and infrastructure components across providers. An organization can cherry-pick offerings from each vendor so to implement the best services into their applications.

The key to an effective multi-cloud strategy is ensuring that both data and applications are sufficiently portable across cloud platforms and operating environments. By going multi-cloud, an organization becomes less dependent on one vendor for all of its needs. 

There are some disadvantages to a multi-cloud design, such as an increased workload on development teams and more security risk but these outweigh the greater risk of cloud vendor lock-in.

9. Retain Ownership of Your Data 

As your data increases in size while stored with a single vendor, the cost and duration of migrating that data could increase, eventually becoming prohibitive and resulting in cloud vendor lock-in. 

It is worth considering a cloud-attached data storage solution to retain ownership of your data, protect sensitive data and ensure portability should you wish to change vendors.

10. Develop a Clear Exit Strategy 

To help your organization avoid cloud vendor lock-in, the best time to create an exit strategy from a contract with a vendor is before signing the initial service agreement.

While you plan your implementation strategy, agree an exit plan in writing, including:

  • What happens if the organization needs to switch vendors?
  • How can the vendor assist with deconversion if the organization decides to move somewhere else?
  • What are the termination clauses for the agreement? 
  • How much notice is required?
  • Will the service agreement renew automatically?
  • What are the exit costs?

The exit strategy should also clearly define roles and responsibilities. Your organization should clearly understand what’s required to terminate the agreement.

11. Complete Due Diligence

Before you select your cloud vendor, gather a deep understanding of your potential vendor to mitigate the risk of cloud vendor lock-in.

The following items should be a part of your due diligence strategy:

  • Determine your goals of migrating to the cloud.
  • Establish a thorough and accurate understanding of your technology and business requirements. What is expected to change and how?
  • Determine the specific cloud components necessary. 
  • Assess the cloud vendor market. Understand trends in the cloud market, the business models and the future of cloud services.
  • Audit their service history, market reputation, as well as the experiences of their business customers.
  • Select the correct type of cloud environment needed: public, private, hybrid, multi?.
  • Assess your current IT situation, including a thorough audit of your current infrastructure and cost and resource levels.
  • Consider all of the vendor pitches to see if they match your needs. 
  • Look at the different pricing models to determine the cost savings. 
  • Choose the right cloud provider for your organization.
  • Read the fine print and understand their service level agreements.
  • Consider their data transfer processes and costs.
  • Agree to Service Level Agreement (SLA) terms and contractual obligations that limit the possibility of lock-in.

Summary

This article discussed several factors that an organization should consider in order to mitigate cloud vendor lock-in. These included implementing DevOps tools, designing loosely coupled and portable applications, considering a multi-cloud strategy, planning early for an exit and performing due diligence.

While cloud vendor lock-in is a real concern, the benefits of cloud computing outweigh the risks. 

What to consider when choosing a cloud provider

These days, it seems like platform and infrastructure services are more available than ever. With 4 large cloud providers (AWS, Azure, GCE and Softlayer) and countless others with specialties (DigitalOcean, Packet.net, and more). You’d be swamped with comparison tables, charts, and experiences, enough to make your head spin. We’d like to offer an opinionated list, suggesting what makes every offering worth picking over others. Caveat emptor, this is at best a starting point or an educated bet.

Without further ado, let’s start!

AWS:

The market leader carries with it several benefits: There is more talent pool familiar with it than other clouds, the breadth of services is bewildering and there are even GPU and FPGA instances.
Barring an obvious reliance on GPUs and FPGAs, AWS is a great choice if you don’t mind locking into value added services in return for much quicker execution. These services range from convenience and workflow to line of business, especially around AI and Amazon Alexa.

Azure:

The runner-up relies on the technology advantages of Microsoft. Microsoft’s been putting their might behind their cloud offering and it shows.
Obviously, If you want to run windows workloads or use windows technologies, you will get many integration benefits. .net together with .net core offer a great alternative to Java with a growing array of open source libraries to support.

But even if you don’t, Microsoft’s alliance with Docker and windows container support helps sweeten the deal when it comes to developing in a Microsoft environment. This, coupled with vast enterprise integration experience, leads to a formidable choice.
Azure would be a great choice if you develop for windows or integrating with enterprises, windows services, and environments, or would like to tap into large .net talent pool.

GCP:

Google is known for its infrastructure for a very good cause, and GCP exposes some (we don’t really know how much) of that richness to potential customers. If you’re willing to go the google way, you will find some high-performance value-added services at your disposal, such as BigQuery, Pub/Sub, CloudSQL, and GAE. Expertise in the google platform is harder to find but those who are vested in it, swear by it.

GCP would be a great choice if you believe in Google’s technologies and their ability to deliver scalability and ease of development and are willing to invest some time in learning the ropes.

Softlayer:

IBM made its cloud play when it purchased SoftLayer in 2013. IBM has its game up with offerings such as Bluemix for workflow but more interestingly, Watson-based technologies for AI and value-added line of business services.
If you’re considering a hybrid cloud model or would like to set your sights on Watson’s AI technologies as a driver, IBM SoftLayer would make a fine choice for you.

Digital Ocean:

Digital Ocean is well known for being a VPS host. So it was only natural that they would extend their services into a full-fledged cloud. Their ability to keep pricing down and keep a loyal user base has been quite remarkable, to say the least.
If you know you’re going to start with a VPS and are comfortable working with it and extending as you go along, Digital Ocean makes for a very prudent choice.

Packet.net

To be honest, if you’re looking into packet.net you will probably do deeper research than this article has to offer. Packet.net is for the hardcore professionals who are willing to trade convenience for the power of bare metal machines and customized networking options.

Packet.net would be a great choice if you have the capability or need to push the pedal to the metal when it comes to server and networking performance, perhaps if you’re doing things like live video streaming.

Summary:

Looking at this market it seems like there are two dominant players, Azure for enterprise and AWS for all the rest. We chose to run our entire system on fully controlled Docker instances which allowed us the elasticity to serve different clients on different clouds. But many companies rely on PaaS and for them, AWS provides the most extensive support. Just keep in mind that as the prices on IaaS are in a race to the bottom, PaaS is the way for cloud providers to increase their revenue so being bound to a provider due to a usage of his services or platforms can be pricey. 

5 Intermediate-Level Tips to Reduce Cloud Costs

Cloud cost optimization. Everybody wants to keep them down as much as possible while maintaining reliability and service quality.
In this post, we’d like to go beyond the more obvious recommendations, yet remain at mostly simple tips which you may have overlooked.

Use a reseller with added value:

After you design your deployment and decide on a cloud provider, one of your first instincts is to go and sign up for the service and start executing. In truth, unless you expect very high volumes, by going with an authorized reseller you may get to enjoy their volume discount. Some resellers may offer added value services to watch your bill, such as Datapipe’s Newvem offering. Don’t be afraid to shop around and see who’s offering what at which price.

Reservations are not just for EC2 and RDS

Sure, everybody knows that you can reserve EC2 and RDS instances, and RedShift too. Upfront, Non-upfront, 1 year, 3 years… but did you also know that you can reserve DynamoDB capacity?
DynamoDB reservations work by allocating read and write capacities, so
once you get a handle on DynamoDB pricing and your needs, making a reservation can reduce up to 40% of DynamoDB costs.
And AWS keeps on updating their offerings, so make sure you look for additional reservation options beyond the straightforward ones.

Not just self-service reservations:

EC2, RDS, Redshift and DynamoDB offer self-service reservations via AWS console and APIs. Does that mean that you can’t reserve capacity for other services? Well, yes and no. If you check out the pricing page for some services, such as S3, you will see that starting from a certain volume, the price changes to ‘contact us’. This is a good sign – it means that if you can plan for the capacity, AWS wants to hear from you. And even if you are somewhat shy of qualifying, it’s worth checking up with your account manager if AWS is willing to offer a discount for your volume of usage.

Careful data transfer:

One of the more opaque cost items in the AWS bill is the data transfer fee. It’s tricky to diagnose and keep down. But by sticking to proper guidelines, you should be able to keep that factor down. Some worthwhile tricks to consider:
Cloudfront traffic is cheaper than plain data out. Want something downloaded off your servers or services? Use Cloudfront, even if there is no caching involved.
Need to exchange files between AWS availability zones? Use AWS EFS. Aside from being highly convenient, it waives the data transfer fee.
Also, make due care not ever to use external IP addresses unless needed. This will incur data transfer costs for traffic that may otherwise be totally free.

Prudent object lifecycle

One major source of possible excess costs is the presence of data beyond that which is needed. Old archives and files, old releases, old logs… The problem is, usually the time it takes to discern whether they can safely be erased or not, outweighs the value of just keeping them there. A good compromise in between is to have AWS automatically move them to cheaper and less available storage tiers, like IA and Glacier.

You may be tempted to perform some of these manually, but keep in mind that changing a file’s storage class manually carries the cost of a POST operation, so it’s better to let policy handle that.