Monitoring Office365 and Azure Health Status with Coralogix

Life is all about perspective, and the way we look at things often defines us as individuals, professionals, business entities, and products. How you understand the world is influenced by many details, or in the case of your application – many data sources.

At Coralogix, we not only preach comprehensive data analysis but strive to enable it by continuously adding new ways to collect data. With more comprehensive data collection, you can reach a more accurate perception of your entire system and its state.

During one of our POCs, a customer wanted to take advantage of Coralogix’s advanced alerting mechanisms for monitoring Azure/Office 365 health status. We allow users to easily ingest contextual data from a wide range of contextual data sources such as status pages, feeds, and even Slack channels. Unfortunately, Office 365 does not have a public status page or API.

In this post, we will demonstrate the simple steps required to monitor the status of your Office 365 application with Coralogix.

Expose Health Status with Microsoft Graph

To get around the lack of public status page or API with Office 365 and Azure, we can leverage Microsoft Graph, which exposes the health status of the current subscription services. We will use Microsoft Graph to query the API and get the information we need.

Your Office 365 health status can be checked by signing into your Office 365 admin account and navigating to the Service Health submenu:

You will notice that you can view the issues of any services that you have enabled in your subscription.

Sending the Data to Coralogix

Now let’s get to the fun part.

The first thing we need to do is navigate to App Registration in Azure Active Directory and register a new application. This will facilitate the authentication with 365 to sample it in a “machine-like manner.”

Once we register the new application, we will see the following screen to access its details. We will need to create a new secret for our application, we can click on the right and create a new one.

Note: Take note of the Application (client) ID and newly created secret, we’ll need it in the following steps.

We also need to grant the Application Permissions. Add ServiceHealth.Read.All and Grant admin consent.

Once we have this information, we will need to find our Tenant ID. Click here to retrieve it.

Lambda Function Deployment

Since we are building a service to basically sample the status and only report if anything happened, we need some compute. And since we want this to be as low maintenance and as economical as possible, we chose to do this using lambda functions.

We need to deploy our Lambda Function to start pulling this data and pushing the logs into our Coralogix Account.

Let’s start by cloning the Function repository (here), build and deploying is quite simple. We need to have AWS CLI and SAM installed to perform the deployment.

# sam build
# sam deploy --guided 

Once we deploy the Lambda function, we need to go into our AWS console and change the environment variables to match the details that we’ve collected in the previous section along with the Coralogix private key which can be found in the “send your logs” screen in the Coralogix UI.

  • Coralogix:
    • PRIVATE_KEY
    • APPLICATION_NAME
    • SUBSYSTEM

Note: The application and subsystem are Naming tags you can choose for your application. (i.e. Application: Office365_status, Subsystem: office365_lambda_tester).

  • Office 365:
    • CLIENT_ID
    • SECRET_ID
    • TENANT 

Once those values are updated with the details relevant to your application, you should be good to go. Test the Lambda function and you will see the status update in your Coralogix Account like in this example:  

Monitoring Office365 Health Status in Coralogix

This data, when ingested to Coralogix, can be correlated with additional application logs and metrics for deeper context into the long-term stability of services. Advanced alerting with dynamic thresholds can be used to update relevant parties about some or all of the organization’s Office365 cloud services.

Learn more about Coralogix’s contextual data analysis solution.

Limit Coralogix Usage per Account Using Azure Functions

At Payoneer, we use Coralogix to collect logs from all our environments from QA to PROD.


Each environment has its own account in Coralogix and thus its own limit. Coralogix price modules are calculated per account.


We as a company have our budget per account and we know how much we pay per each one.

In case you exceed the number of logs assigned per account you will pay for theLimit Coralogix Usage per Account Using Azure Functions “extra” logs. You can see the exact calculation in this link.

Our Flow:

You can see our flow below showing each ENV disconnected from the other but all of them under our account in Coralogix.

The problem

In each environment (Except PROD) we allow our developers to decide what will be the log level they want to write, and it can cause somewhat of an issue if you constantly writing in DEBUG or VERBOSE. You can reach your Coralogix quota quite fast if you are not careful.

We needed a way ( without chasing the developer teams each day with a slap on the wrist ) to limit the number of logs with no human interactions.

The solution

We had a few options to consider on how to do it:

  1. Availability ( 24/7/365)
  2. Not environment dependent
  3. Ability to access Coralogix API

We wanted a solution that will not be part of our stack and will always run against Coralogix API.

We chose to use Azure Functions

Azure Functions

Accelerate and simplify serverless application development with serverless compute

Azure function along with AWS Lambda and Google Cloud Functions was our main focus and we chose Azure function as we are already working with Azure’s cloud service and they provide 1 million executions on a free tier so the choice was easy.

The functions were written in python and you can see the flow below:

Block Coralogix application

Coralogix Rules

As you might have seen in the above diagram we use Coralogix rules to stop logs from being parsed and thus save money on ingested logs every single day. What are the rules:

Rules help you to process, parse, and restructure log data to prepare for monitoring and analysis

Coralogix offers many different types of log parsing rules like:

  • Parse
  • Extract
  • Extract JSON
  • Replace
  • Block
  • Timestamp Extract
  • Remove Fields

You can see the full list at the Coralogix site.


In our case, we used the Block option. Block rules allow you to filter out your incoming logs using RegEx.

The rules are part of log groups that can contain multiple rules.
See the example here:

Blocking RabbitMQ using Coralogix rules

Exclusion list

Per request from our developers we added a way to unblock an application for a predefined period of time OR give them an added XXX lines of logs to be parsed and displayed in the UI before they are blocked again:

Conclusion

We had a necessity to lower our log collector SaaS cost and using Azure Functions Pricing we were able to moderate it to a manageable flow.


Most importantly we now have visibility to what application is costing us the most and we can work closely with the Dev team to reduce the number of logs they write.

AWS Lambda vs Azure Functions vs Google Cloud Functions

Serverless computing is on the rise, having already earned the mantle of “The Next Big Thing”. For developers, serverless computing means less concern regarding infrastructure when deploying code, as all computing resources are dynamically provisioned by the cloud provider.

Azure pricing is generally on a pay-as-you-use model and is based on resources consumed – which is in line with modern business principles of “on-demand”, flexibility and rapid scaling.

We’ll look at some of the big players in this space, including what to look out for when considering the right partner when it comes to serverless computing for your organization.

The Serverless Landscape

As technology moved from mainframes to PCs, to the appearance of “the Cloud” in the mid-2000s, there has been a move towards increased efficiency, better use of resources, and lower costs.

A decade later, “serverless” entered the mainstream conversation and is now recognized almost universally. The term has been linked to Backend as a Service (BaaS), such as the authentication services offered by providers like Facebook; or Function as a Service (Faas), where applications with server-side logic are run over stateless containers, and completely managed by 3rd party providers.

This popularity has been further served by leading technology companies offering their own implementations: AWS offering its AWS Lambda since 2014, Microsoft with its Functions architecture for Azure, and of course Google Cloud Functions.

AWS Lambda

AWS Lambda is a serverless computing platform, implemented on top of AWS platforms such as EC2 and S3. AWS Lambda stores and encrypts your code in S3. When a function is requested to run, a “container” is created using your runtime specifications, deployed to one of the EC2 instances in its compute farm, and that function is executed.

When a Lambda function is created, you need to specify things like the runtime environment, memory allocation, roles, and the method to execute. You can build Lambda functions in Node, Java, Python, and C#, and AWS Lambda seamlessly deploys your code, does all the administration, maintenance, and security patches, and provides built-in logging and monitoring through Amazon CloudWatch.

How lambda works

General positive feedback about Lambda is that it’s simple to set up, pricing is excellent, and it integrates with other internal AWS products such as RDS and Elastic Beanstalk.

When it comes to drawbacks of the solution, there have been 2 main areas where there has been criticism:

  1. “Cold Start”: Creating a temporary container (that is subsequently destroyed) can take between 100 milliseconds to 2 minutes, and this delay is referred to as “cold start”.There are various workarounds to negate this, but it is something important to be aware of.
  2. Computational Restrictions: Being based on temporary containers means that usable memory is limited, so functions requiring a lot of processing cannot be handled by AWS Lambda. Again workarounds are available, such as using a step function.

Additionally, there is an element of “lock-in”, as choosing to go with AWS invariably means you’ll be integrating (and become reliant on) other AWS tools and products in the Amazon ecosystem.

Security for AWS Lambda is impressive, starting with securing your code’s access to other AWS services through the built-in AWS SDK, and integration with AWS Identity and Access Management (IAM). Code is run within a VPC by default, or you can choose to configure AWS Lambda to access resources behind your own VPC. AWS Lambda is SOC, HIPAA, PCI, ISO compliant.

Pricing is per 100ms your code executes, and the number of times your code is triggered – meaning that you don’t pay anything when your code is not running.

The Lambda free tier includes 1m free requests per month and 400,000 GB-seconds of compute time per month. After this, it’s $0.20 per 1m requests, and $0.00001667 for every GB-second used.

lambda pricing

Azure Functions

Azure Functions lets you develop serverless applications on Microsoft Azure. Like the other “serverless” solutions, with Microsoft’s Azure, you just need to write the code, without worrying about a whole application or the infrastructure to run it.

azure functions

Languages supported include C#, F#, Node.js, Java, or PHP, and like AWS Lambda and Google’s Cloud Function offerings, you only pay for the time your code runs.

Advantages of Azure Functions include flexible development, where you can code your functions right in the portal or deploy through GitHub, Visual Studio Team Services, and other supported development tools; the Functions runtime is open-source and available on GitHub; you can use your favorite libraries with support for NuGet and NPM, and integrations with other products in the Microsoft ecosystem.

Integrations are impressive, with the following supported: Azure’s Cosmos DB, Event Hubs, Event Grid, Mobile Apps (tables), Notification Hubs, Service Bus (queues and topics), Storage (blob, queues, and tables), GitHub (webhooks) and Twilio (SMS messages).

Like the other solutions, one of the main disadvantages is vendor lock-in; by going the route of Microsoft Azure, you will in many ways be pinning your colors to the Microsoft mast, which is not for everyone.

Security-wise, you can protect HTTP-triggered functions using OAuth providers such as Azure Active Directory, Facebook, Google, Twitter, and Microsoft Account.

There are 2 types of pricing plans:

  1. Consumption plan: You only pay for the time that your code runs
  2. App Service plan: Run your functions just like your web, mobile, and API apps. When you are already using App Service for your other applications, you can run your functions on the same plan at no additional cost

The Consumption plan is billed on a per-second resource consumption and executions basis.

Execution time is at $0.000016 per GB-second, with 400,000 GB-seconds free per month, and Total Executions is billed at $0.20 per million executions, with 1 million executions free per month.

There are also various support plans available (with an additional cost element).

Google Cloud Functions

Google Cloud Functions is Google’s serverless solution for creating event-driven applications.

With Google Cloud Functions, you can create, manage, and deploy Cloud Functions via the Cloud SDK (Gcloud), Cloud Console web interface, and both REST and gRPC APIs, and build and test your functions using a standard Node.js runtime along with your favorite development tools.

Cloud Functions can be deployed from your local machine or from a source repository like GitHub or Bitbucket.

Image result for google cloud functions

Pricing

Google cloud functions pricing is based on the number of requests to your functions and compute resources consumption, rounded to the nearest 100 milliseconds, and of course, only while your code is running.

The free tier includes 400,000 GB-seconds, and 200,000 GHz-seconds of compute time.

google cloud functions pricing

Advantages of Google Cloud Functions include an excellent free offering to get started ($300 free credit during the first year, and 5GB of storage free to use forever after that), easy integration with other Google Cloud Services like Kubernetes Engine, App Engine or Compute Engine; and detailed and well-managed documentation.

Criticisms of Google’s offering have included high support fees, a confusing interface, and higher (and more complex) pricing.

Serverless Simplicity

Going serverless has a number of advantages, including reduced complexity, lowering administrative overhead, cutting server costs, reduced time to market, quicker software releases, and developers not having to focus on server maintenance, among others. For some, it’s a no-brainer.

When it comes to which solution to go with, particularly when it comes to AWS Lambda, Azure Functions, and Google Cloud Functions, the answer is less obvious.

Each has its own advantages and quirks, and each one will try and tie you into its ecosystem. Overall, it seems that Google is lagging behind from a features perspective and that while Azure is offering a solid solution, AWS Lambda, the oldest on the block, offers a more complete product.

The choice is yours, as we look forward to many more exciting developments in this space.

What We Learned About Enterprise Cloud Services From the 2021 Azure Outage

AWS, GCP and Azure cloud services are invaluable to their enterprise customers. When providers like Microsoft are hit with DNS issues or other errors that lead to downtime, it has huge ramifications for their users. The recent Azure cloud services outage was a good example of that.

In this post, we’ll look at that outage and examine what it can teach us about enterprise cloud services and how we can reduce risk for our own applications. 

The risks of single-supplier reliance and vendor lock-in

Cloud services have gone from cutting-edge to a workplace essential in less than two decades, and the providers of those cloud services have become vital to business continuity.

Microsoft, Amazon, and Google are known as the Big 3 when it comes to cloud services. They’re no longer seen as optional, rather they’re the tools that make modern enterprise possible. Whether it’s simply external storage or an entire IaaS, if removed the damage to business-grade cloud service users is catastrophic.

Reliance on a single cloud provider has left many businesses vulnerable. Any disruption or downtime to a Big 3 cloud services provider can be a major event from which an organization doesn’t recover. Vendor lock-in is compromising data security for many companies.

It’s not difficult to see why many enterprises, both SMEs and blue-chip, are turning to 3rd party platforms to free themselves from the risks of Big 3 reliance and vendor lock-in.

What is the most reliable cloud service vendor?

While the capabilities enabled by cloud computing have revolutionized what is possible for businesses in the 21st century, it’s not a stretch to say that we’ve now reached a point of reliance on them. Nothing is too big to fail. No matter which of the Big 3 hosts your business-critical functions, a contingency plan for their failure should always be based on when rather than if.

The ‘Big 3’ cloud providers (Microsoft with Azure, Amazon with AWS, and Google’s GCP) each support so many businesses that any service disruption causes economic ripples that are felt at a global level. None of them is immune to disruption or outages.

Many business leaders see this risk. The issue they face isn’t deciding whether or not to mitigate it, but finding an alternative to the functions and hosted services their business cannot operate without.

Once they find a trusted 3rd party platform that can fulfill these capabilities (or, in many cases, exceed them) the decision to reinvest becomes an easy one to make. If reliability is your key concern, a 3rd party platform built across the entire public cloud ecosystem (bypassing reliance on any single service) is the only logical choice.

Creating resilience-focused infrastructure with a hybrid cloud solution

Hybrid cloud infrastructures are one solution to vendor lock-in that vastly increases the resilience of your infrastructure. 

By segmenting your infrastructure and keeping core business-critical functions in a private cloud environment you reduce vulnerability when one of the Big 3 public cloud providers experiences an outage. 

Azure, AWS, and GCP each offer highly valuable services to give your organization a competitive edge. With a 3rd party hybrid solution, these public cloud functions can be employed without leaving your entire infrastructure at risk during provider-wide downtime. 

When the cloud fails – the 2021 Azure outages

This has been demonstrated in 2021 by a string of service-wide Azure outages. The largest of these was on April 1st, 2021. A surge in DNS requests triggered a previously unknown code defect in Microsoft’s internal DNS service. Services like Azure Portal, Azure Services, Dynamics 365, and even Xbox Live were inaccessible for nearly an hour.

Whilst even the technically illiterate know the name Microsoft, Azure is a name many unfamiliar with IT and the cloud may not even be aware of. The only reason the Azure outage reached the attention of non-IT-focused media was the impact on common consumer services like Microsoft Office, Xbox live services, Outlook, and OneDrive. An hour without these Microsoft home-user mainstays was frustrating for users and damaging for the Microsoft brand, but hardly a cause for alarm.

For Microsoft’s business customers, however, an hour without Azure-backed functionality had a massive impact. It may not seem like a long time, but for many high data volume Azure business and enterprise customers, an hour of no-service is a huge disruption to business continuity.

Businesses affected were suddenly all too aware of just how vulnerable relying on Azure services and functions alone had made them. An error in DNS code at Microsoft HQ had left their sites and services inaccessible to both frustrated customers and the staff trying to control an uncontrollable situation.

Understanding the impact of the Azure outage

Understanding the impact of the Azure Outages requires having a perspective of how many businesses rely on Azure enterprise and business cloud services. According to Microsoft’s website, 95% of Fortune 500 companies ‘trust their business on Azure’

There are currently over 280,000 companies registered as using Microsoft Azure directly. That’s before taking into account the companies that indirectly rely on Azure through other Microsoft services such as Dynamics 365 and OneDrive. Azure represents over 18% of the cloud infrastructure and services market, bringing Microsoft $13.0 million in revenue during 2021 Q1. 

Suffice to say, Microsoft’s Azure services have significant market penetration across the board. Azure business and enterprise customers rely on the platform for an incredibly wide range of products, services, and solutions. Every one of serves a business-critical function. 

During the Azure outage over a quarter of a million businesses were cut off from these functions. When the most common Azure services include the security of business-critical data, storage of vital workflow and process documentation, and IT systems observability, it’s easy to see why the Azure outage has hundreds of businesses considering 3rd party cloud platforms

It’s not only Azure

Whilst Azure is the most recent of the Big 3 to experience a highly impactful service outage, the solution isn’t as simple as migrating to AWS or GCP. Amazon and Google’s cloud offerings have been historically as prone to failure as Microsoft’s.

In November 2020 a large AWS outage rendered hundreds of websites and services offline. What caused the problem? A single Amazon service (Kinesis) responded badly to a capacity upgrade. The situation then avalanched out of control, leading many to reconsider their dependency on cloud providers. 

Almost exactly a year before this in November 2019, Google’s GCP services also experienced a major global services outage. Whilst GCP’s market reach isn’t as large as its competitors (GCP held 7% market share in 2020 compared to AWS 32% and Azures 19%), many business-critical tools such as Kubernetes were taken offline. More recently, in April 2021 many GCP-hosted Google services such as Google Docs and Drive were taken offline by a string of errors during a back-end database migration

The key takeaway here is that, regardless of vendor choice, any cloud-based services used by your business will experience vendor-induced downtime. As the common cyber-security idiom goes, it’s not if but when. 

Beating vendor lock-in with 3rd party platforms

Whilst there is no way to completely avoid the impact of an industry giant like Microsoft or Amazon experiencing an outage, you can protect your most vital business-critical functions by utilizing a cross-vendor 3rd party platform. 

One area many Azure customers felt the impact of the outage was the removal of system visibility. Many Azure business and enterprise-grade customers rely on some form of Azure-based monitoring or observability service.

During the April 2021 outage, vital system visibility products such as Azure Monitor and Azure API Management were rendered effectively useless. For many organizations using these services, their entire infrastructure went dark. During this time their valuable and business-critical data could have been breached and they’d have lacked the visibility to respond and act.

How Coralogix protects your systems from cloud provider outages

The same was true for AWS customers in November 2020, and GCP ones the year prior. This is why many businesses are opting for a third-party platform like Coralogix to remove the risk of single provider reliance compromising their system visibility and security.

Coralogix is a cross-vendor cloud observability platform. By using our robust platform that draws on functionality from all 3 major cloud providers, our platform users protect their systems and infrastructure from the vulnerabilities of vendor lock-in and service provider outage. 

As a third-party platform Coralogix covers (and improves upon) many key areas of cloud functionality. These include observability, monitoring, security, alerting, developer tools, log analytics, and many more. Coralogix customers have the security of knowing all of these business-critical functions are protected from the impact of the next Big-3 service outage. 

What to consider when choosing a cloud provider

These days, it seems like platform and infrastructure services are more available than ever. With 4 large cloud providers (AWS, Azure, GCE and Softlayer) and countless others with specialties (DigitalOcean, Packet.net, and more). You’d be swamped with comparison tables, charts, and experiences, enough to make your head spin. We’d like to offer an opinionated list, suggesting what makes every offering worth picking over others. Caveat emptor, this is at best a starting point or an educated bet.

Without further ado, let’s start!

AWS:

The market leader carries with it several benefits: There is more talent pool familiar with it than other clouds, the breadth of services is bewildering and there are even GPU and FPGA instances.
Barring an obvious reliance on GPUs and FPGAs, AWS is a great choice if you don’t mind locking into value added services in return for much quicker execution. These services range from convenience and workflow to line of business, especially around AI and Amazon Alexa.

Azure:

The runner-up relies on the technology advantages of Microsoft. Microsoft’s been putting their might behind their cloud offering and it shows.
Obviously, If you want to run windows workloads or use windows technologies, you will get many integration benefits. .net together with .net core offer a great alternative to Java with a growing array of open source libraries to support.

But even if you don’t, Microsoft’s alliance with Docker and windows container support helps sweeten the deal when it comes to developing in a Microsoft environment. This, coupled with vast enterprise integration experience, leads to a formidable choice.
Azure would be a great choice if you develop for windows or integrating with enterprises, windows services, and environments, or would like to tap into large .net talent pool.

GCP:

Google is known for its infrastructure for a very good cause, and GCP exposes some (we don’t really know how much) of that richness to potential customers. If you’re willing to go the google way, you will find some high-performance value-added services at your disposal, such as BigQuery, Pub/Sub, CloudSQL, and GAE. Expertise in the google platform is harder to find but those who are vested in it, swear by it.

GCP would be a great choice if you believe in Google’s technologies and their ability to deliver scalability and ease of development and are willing to invest some time in learning the ropes.

Softlayer:

IBM made its cloud play when it purchased SoftLayer in 2013. IBM has its game up with offerings such as Bluemix for workflow but more interestingly, Watson-based technologies for AI and value-added line of business services.
If you’re considering a hybrid cloud model or would like to set your sights on Watson’s AI technologies as a driver, IBM SoftLayer would make a fine choice for you.

Digital Ocean:

Digital Ocean is well known for being a VPS host. So it was only natural that they would extend their services into a full-fledged cloud. Their ability to keep pricing down and keep a loyal user base has been quite remarkable, to say the least.
If you know you’re going to start with a VPS and are comfortable working with it and extending as you go along, Digital Ocean makes for a very prudent choice.

Packet.net

To be honest, if you’re looking into packet.net you will probably do deeper research than this article has to offer. Packet.net is for the hardcore professionals who are willing to trade convenience for the power of bare metal machines and customized networking options.

Packet.net would be a great choice if you have the capability or need to push the pedal to the metal when it comes to server and networking performance, perhaps if you’re doing things like live video streaming.

Summary:

Looking at this market it seems like there are two dominant players, Azure for enterprise and AWS for all the rest. We chose to run our entire system on fully controlled Docker instances which allowed us the elasticity to serve different clients on different clouds. But many companies rely on PaaS and for them, AWS provides the most extensive support. Just keep in mind that as the prices on IaaS are in a race to the bottom, PaaS is the way for cloud providers to increase their revenue so being bound to a provider due to a usage of his services or platforms can be pricey.