Using NoSQL Databases as Backend Storage for Grafana

Grafana is a popular way of monitoring and analysing data. You can use it to build dashboards for visualizing, analyzing, querying, and alerting on data when it meets certain conditions.

In this post, we’ll look at an overview of integrating data sources with Grafana for visualizations and analysis, connecting NoSQL systems to Grafana as data sources, and look at an in-depth example of connecting MongoDB as a Grafana data source.

MongoDB is a document or a document-oriented database and the most popular database for modern apps. It’s classified as a NoSQL system, using JSON-like documents with flexible schemas. As one of the most popular NoSQL databases around, and the go-to tool for millions of developers, we will focus on this to begin with, as an example.

General NoSQL via Data Sources

What is a data source?

For Grafana to play with data, it must first be stored in a database. It can work with several different types of databases. Even some systems not primarily designed for data storage can be used.

Grafana data source denotes any location wherein Grafana can access a repository of data. In other words, Grafana does not need to have data logged directly into it for that data to be analyzed. Instead, you can connect a data source with the Grafana system. Grafana then extracts that data for analysis, divinating insights and doing essential monitoring.

How do you add a data source?

To add a data source in Grafana, hover your mouse over the gear icon on the top right, (the configuration menu) and select the Data Sources button:

grafana data sources

Once in that section, click the Add data source button. This is where you can view all of your connected data sources. You will also see a list of officially supported types available to be connected:

data source types

Once you’ve selected the data source you want, you will need to set the appropriate parameters such as authorization details, names, URL, etc.:

data source details

Here you can see the Elasticsearch data source, which we will talk about a bit later. Once you have filled the necessary parameters, hit the Save and Test button:

save and test

Grafana is going to now establish a connection between that data source and its own system. You’ll be given a message letting you know when this connection is complete. Then head to the Dashboards section in Grafana to begin venturing through that connected data source’s data.

Elasticsearch

This can function as both a logging and document-oriented database. Use Elasticsearch for powerful search engine capabilities or as a NoSQL database that can be connected directly with Grafana.

Avoid these 5 common Elasticsearch security mistakes

How to Install Third Party Data Sources

Let’s head back to the stage that appears after you click the button Add data source. When the list of available and officially supported data sources pops up, scroll down to the bit that says “Find more data source plugins on Grafana.com”:

more data sources

This link will lead to a page of available plugins (make sure that the plugin type selected is data source, on the left-hand menu):

Plugins that are officially supported will be entitled “by Grafana Labs”, while open-source community plugins will have the individual names of developers:

official plugins

Selecting any of the options will take you to a page with details about the plugin and how to install. After installation, you should see that data source in your list of available data sources in the Grafana UI. If you’re still unclear, there is a more detailed instruction page.

Make a Custom Grafana Data Source

You have the option to make your own data source if there isn’t appropriate one in the official list or community-supported ones. You can make a custom plugin for any database you prefer as long as it uses the HTTP protocol for client communications. The plugin needs to modify data from the database into time-series data so that Grafana can accurately represent in its dashboard visualisations.

You need these three aspects in order to develop a product plugin for the data source you wish to use:

  • QueryCtrl JavaScript class (allows you to do metric edits in dashboards panels)
  • ConfigCtrl JavaScript class (configure your new data source, or user-edit)
  • Data source JavaScript object (handles comms between the data source and data transformation)

MongoDB as a Grafana Data Source — The Enterprise Plugin

NoSQL databases handle enormous amounts of information vital for application developers, SREs, and executives — they get to see real-time infographics.

This can make them a shoe-in with regards to growing and running businesses optimally. See the plugin description here, entitled MongoDB Datasource by Grafana Labs.

MongoDB was added as a data source for Grafana around the end of 2019 as a regularly maintained plugin.

Setup Overview

Setting a New Data Source in Grafana

Make sure to name your data source Prometheus (scaling Prometheus metrics) so that it is by default identified by graphs.

set new data source

Configuring Prometheus

By default Grafana’s dashboards work with the native instance tag to sort through each host, it is best to use a good naming system for each of your instances. Here are a few examples:

configure prometheus

The names that you give to each job is not the essential part. But the ‘Prometheus’ dashboard will take Prometheus as the name.

Doing Exports

The following are the baseline option sets for the 3 exporters:

  • mongodb_exporter: sticking with the default options is good enough.
  • mysqld_exporter:

    -collect.binlog_size=true -collect.info_schema.processlist=true
  • node_exporter: 
    -collectors.enabled="diskstats,filefd,filesystem,loadavg,meminfo,netdev,stat,time,uname,vmstat"

Grafana Configuration (only relates to Grafana.x or below)

First edits to the Grafana config is to enable JSON dashboards — do this by uncommenting the following lines in grafana.ini

[dashboards.json]

enabled=true

path = /var/lib/grafana/dashboards

If you prefer to import dashboards separately through UI, skip this step and the next two altogether.

Dashboard Installation

Here is a link with the necessary code.

For users of Grafana 4.x or under run through these steps: 

cp -r grafana-dashboards/dashboards /var/lib/grafana/

Grafana 5.x or later, make mysqld_export.yml here:

/var/lib/grafana/conf/provisioning/dashboards

with the following content:

dashboard installation

Restarting Grafana:

Finally:

service grafana-server restart

Patch for Grafana 3.x

For users of this version a patch is needed for your install in order to let the zoomable graphs be accessible. 

Updating Instructions

You just need to copy your new dashboards to /var/lib/grafana/dashboards then restart Grafana. Alternatively you can re-import them.

What Do the Graphs Look Like?

Here are a few sample graphs.

sample graphs

sample grafana graph sample grafana graph sample grafana graph sample grafana graph sample grafana graph

Benefits of Using MongoDB Backend Database as Grafana Data Source

Using the Grafana MongoDB plug-in, you can quickly visualize and check on MongoDB data as well as diagnostic metrics.

Diagnose issues and create alerts that let you know ahead of time when adjustments are needed to prevent failures and maintain optimal operations.

For MongoDB diagnostics, monitor:

  • Network: data going in and out, request stats
  • Server connections: total connections, current, available
  • Memory use
  • Authenticated users
  • Database figures: data size, indexes, collections, so on.
  • Connection pool figures: created, available, status, in use

For visualizing and observing MongoDB data:

  • One-line queries: eg. combine sample and find, eg. sample_nflix.movies.find()
  • Quickly detect anomalies in time-series data
  • Neatly gather comprehensive data: see below for an example of visualizing everything about a particular movie such as the plot, reviewers, writers, ratings, poster, and so on:

comprehensive data visualization

Grafana has a more detailed article on this here. We’ve only scratched the surface of how you could use this synchronicity.

Advanced Guide to Kibana Timelion

Kibana Timelion is a time-series based visualization language that enables you to analyze time-series data in a more flexible way. compared to other visualization types that Kibana offers.

Instead of using a visual editor to create visualizations, Timelion uses a combination of chained functions, with a unique syntax, to depict any visualization, as complex as it may be.

The biggest value of using Timelion comes from the fact that you can concatenate any function on any log data. This means that you can plot a combination of functions made of different sets of logs within your index. It’s like creating a Join between uncorrelated logs except you don’t just fetch the entire set of data, but rather visualize their relationship. This is something no other Kibana visualization tool provides.

In this post, we’ll explore the variety of functions that Kibana Timelion supports, their syntax and options, and see some examples.

Kibana Timelion Functions

View the list of functions with their relevant arguments and syntaxes here.

FunctionWhat the function does
.abs()Return the absolute value of each value in the series list (Chainable)
.add() or .plus() or .sum()Adds the values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.aggregate()Creates a static line based on result of processing all points in the series. (Chainable)
.bars()Show the seriesList as bars (Chainable)
.color()Change the color of the series (Chainable)
.cusum()Return the cumulative sum of series, starting at a base. (Chainable)
.derivative()Plot the change in values over time. (Chainable)
.divide()Divides the values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.es()Pull data from an elasticsearch instance (Data Source)
.es(q)Query in lucene query string syntax
.es(metric)An elasticsearch metric aggregation: avg, sum, min, max, percentiles or cardinality, followed by a field. E.g. “sum:bytes_sent.numeric”, “percentiles:bytes_sent.numeric:1,25,50,75,95,99” or just “count”. If metric is not specified default option is count so it is redundant to use metric=count
.es(split)An elasticsearch field to split the series on and a limit. E.g., “hostname:10” to get the top 10 hostnames
.es(index)Index to query, wildcards accepted. Provide Index Pattern name for scripted fields and field name type ahead suggestions for metrics, split, and timefield arguments.
.es(timefield)Field of type “date” to use for x-axis
.es(kibana)Respect filters on Kibana dashboards. Only has an effect when using on Kibana dashboards
.es(offset)Offset the series retrieval by a date expression, e.g., -1M to make events from one month ago appear as if they are happening now. Offset the series relative to the charts overall time range, by using the value “timerange”, e.g. “timerange:-2” will specify an offset that is twice the overall chart time range to the past.
.es(fit)Algorithm to use for fitting series to the target time span and interval.
.fit()Fill null values using a defined fit function (Chainable)
.hide()Hide the series by default (Chainable)
.holt()Sample the beginning of a series and use it to forecast what should happen via several optional parameters. In general, this doesn’t really predict the future, but predicts what should be happening right now according to past data, which can be useful for anomaly detection. Note that nulls will be filled with forecasted values. (Chainable)
.holt(alpha)Smoothing weight from 0 to 1. Increasing alpha will make the new series more closely follow the original. Lowering it will make the series smoother.
.holt(beta)Trending weight from 0 to 1. Increasing beta will make rising/falling lines continue to rise/fall longer. Lowering it will make the function learn the new trend faster.
.holt(gamma)Seasonal weight from 0 to 1. Does your data look like a wave? Increasing this will give recent seasons more importance, thus changing the wave form faster. Lowering it will reduce the importance of new seasons, making history more important.
.holt(season)How long is the season, e.g., 1w if your pattern repeats weekly. (Only useful with gamma)
.holt(sample)The number of seasons to sample before starting to “predict” in a seasonal series. (Only useful with gamma, Default: all)
.if()Compares each point to a number, or the same point in another series using an operator, then sets its value to the result if the condition proves true, with an optional else. (Chainable)
.if(operator)comparison operator to use for comparison, valid operators are eq (equal), ne (not equal), lt (less than), lte (less than equal), gt (greater than), gte (greater than equal)
.if(if)The value to which the point will be compared. If you pass a seriesList here the first series will be used
.if(then)The value the point will be set to if the comparison is true. If you pass a seriesList here the first series will be used
.if(else)The value the point will be set to if the comparison is false. If you pass a seriesList here the first series will be used
.label()Change the label of the series. Use %s to reference the existing label. (Chainable)
.legend()Set the position and style of the legend on the plot. (Chainable)
.legend(position)Corner to place the legend in: nw, ne, se, sw. You can also pass false to disable the legend.
.legend(columns)Number of columns to divide the legend into.
.legend(showTime)Show the time value in legend when hovering over graph. Default: true.
.legend(timeFormat)moment.js format pattern. Default: MMMM Do YYYY, HH:mm:ss.SSS
.lines()Show the seriesList as lines. (Chainable)
.lines(fill)Number between 0 and 10. Use for making area charts.
.lines(width)Line thickness.
.lines(show)Show or hide lines.
.lines(stack)Stack lines, often misleading. At least use some fill if you use this.
.lines(steps)Show line as step, e.g, do not interpolate between points
.log()Return the logarithm value of each value in the seriesList (default base: 10). (Chainable)
.max()Maximum values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.min()Minimum values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.multiply()Multiply the values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.mvavg()Calculate the moving average over a given window. Nice for smoothing noisy series. (Chainable)
.mvstd()Calculate the moving standard deviation over a given window. Uses naive two-pass algorithm. Rounding errors may become more noticeav
.points()Show the series as points. (Chainable)
.precision()Number of digits to round the decimal portion of the value to. (Chainable)
.range()Changes the max and min of the series while keeping the same shape. (Chainable)
.scale_interval()Changes scales of a value (usually a sum or a count) to a new interval. For example, as a per-second rate. (Chainable)
.static() or .value()Draws a single value across the chart. (Data Source)
.subtract()Adds the values of one or more series in a seriesList to each position, in each series, of the input seriesList. (Chainable)
.title()Adds a title to the top of the plot. If called on more than 1 seriesList the last call will be used. (Chainable)
.trend()Draws a trend line using a specified regression algorithm. (Chainable)
.trim()Set N buckets at the start or end of a series to null to fit the “partial bucket issue”. (Chainable)
.yaxis()Configures a variety of y-axis options, the most important likely being the ability to add an Nth (e.g. 2nd) y-axis. (Chainable)
.yaxis(color)Color of axis label
.yaxis(label)Label for axis
.yaxis(max)Max value
.yaxis(min)Max value
.yaxis(position)left or right
.yaxis(tickDecimals)The number of decimal places for the y-axis tick labels.
.yaxis(units)The function to use for formatting y-axis labels. One of: bits, bits/s, bytes, bytes/s, currency(:ISO 4217 currency code), percent, custom(:prefix:suffix)
.yaxis(yaxis)The numbered y-axis to plot this series on, e.g., .yaxis(yaxis=2) for 2nd y-axis. If you are not plotting more than one .es() expression there is no meaning to yaxis=2,3,..

View the list of functions with their relevant arguments and syntaxes here.

Tips

  • You can enter the Timelion wizard either from the main page when entering Kibana or from the visualizations section. If you enter it from the main screen make sure you choose “save current expression as Kibana dashboard panel” if your goal is to add the Timelion visualization to a Dashboard.
  • Use the index Argument in your .es() functions when building your Timelion expressions. By using it, any element you add, such as a metric, or a field will have auto-suggestions for names when starting to type. For example, .es(index=*:9466_newlogs*).
  • If you enable the Coralogix Logs2metrics feature and start to collect aggregations of your logs, using .es(index=*:9466_log_metrics*) lets you visualize those metrics. Using both index patterns in the same expression lets you visualize separate data sources like your logs and metrics. No other Kibana visualization gives you such an option. It will look like this: .es(index=*:9466_newlogs*),.es(index=*:9466_log_metrics*).
  • Use the Kibana Argument, with the true option in your .es() functions when building your Timelion expressions if you plan on integrating them with a Dashboard so that they’ll apply the filters in your dashboard. For example, .es(kibana=true).

 

Examples

Cache Status

.es(kibana=true,q='_exists_:cache_status.keyword', split=cache_status.keyword:5).divide(.es(kibana=true,q='_exists_:cache_status.keyword')).multiply(100).label(regex='.*cache_status.keyword:(.*) > .*', label='$1%').lines(show=true,width=1).yaxis(1,min=0,max=100,null,'Upstream Cache Status (%)').legend(columns=5, position=nw).title(title="Cache Status"),
.es(kibana=true,q='_exists_:cache_status.keyword', split=cache_status.keyword:5).divide(.es(kibana=true,q='_exists_:cache_status.keyword')).multiply(100).label(regex='.*cache_status.keyword:(.*) > .*', label='$1 avg').lines(show=true,width=1).yaxis(1,min=0,max=100,null,'Upstream Cache Status (%)').legend(columns=5, position=nw).title(title="Cache Status").aggregate(function=avg)
  1. .es(q="_exists_:cache_status.keyword", split=cache_status.keyword:5) – query under q=; aggregate, per top 5 unique values of cache_status field, under split=
  2. .divide() – divide the series by whatever is in parenthesis. In this case:  .es(q="_exists_:upstream_cache_status.keyword") is the same query above, but without aggregating different values (=all values together). Together, (1) and (2) provide the % of each value.
  3. .multiply(100) – convert 0..1 to 0..100 for an easier view of %.
  4. .label(regex='.*upstream_cache_status.keyword:(.*) > .*', label='avg $1%') – change the label of the legend (match the original value with the regex and create a label with the result).
  5. .lines(show=true,width=1) – This is the line styling.
  6. .yaxis(1,min=0,max=100,null,'Upstream Cache Status (%)') the y-axis styling sets the min and max as constants for the %.
  7. .legend(columns=5, position=nw) – sets 5 columns for the legend, as we have 5 splits in the above series, and place it at the northwest corner.
  8. .title(title="Upstream Cache Status") – sets a title for the graph.
  9. .aggregate(function=avg) – averages each of the different series (throughout the whole query timeframe).

Using two series, one for 1 through 8 (without averaging) and another series for 1 through 9 (with averaging), we can get each of the series along with its average in the same graph.

Response Size

.es(kibana=true,q='_exists_:response.header_size', metric=sum:response.header_size.numeric,split=request.protocol.keyword:5).add(.es(kibana=true,q='_exists_:response.body_size', metric=sum:response.body_size.numeric,split=request.protocol.keyword:5)).divide(1024).label(regex='.*request.protocol.keyword:(.*) > .*', label='$1').lines(width=1.4,fill=0.5).legend(columns=5, position=nw).title(title="Total response size (KB) by protocol")
  1. .es(kibana=true,q='_exists_:response.header_size', metric=sum:response.header_size.numeric,split=request.protocol.keyword:5) – query under q=; the metric to use (in this case sum of response header size) under metric=; aggregate, per top 5 unique values of request.protocol, under split=
  2. .add() – adding to the series whatever is in parenthesis. In this case – .es(kibana=true,q='_exists_:response.body_size', metric=sum:response.body_size.numeric,split=request.protocol.keyword:5) is the same query above except we are summing the response body size rather than the header size. Together, (1) and (2) provide the total number of bytes for the response.
  3. .divide(1024) – convert bytes to Kilobytes.
  4. .label(regex='.*request.protocol.keyword:(.*) > .*', label='$1') – change the label of the legend (match the original value with the regex, create a label with the result).
  5. .lines(width=1.4,fill=0.5) – line styling.
  6. .legend(columns=5, position=nw) – sets 5 columns for the legend, as we have 5 splits in the above series, and place it at the northwest corner.
  7. .title(title="Total response size (KB) by protocol") – set a title for the graph.

Using the .add() function with the same .es() function for the response body size, we can get the full response size even though our log data includes the size of the header and size of the body separately.

Bytes Sent

.es(metric="percentiles:bytes_sent.numeric:5,25,50,75,95,99").log().lines(width=0.9,steps=true).label(regex="q:* > percentiles([^.]+.numeric):([0-9]+).*",label="bytes sent $1th percentile").legend(columns=3,position=nw,timeFormat="dddd, MMMM Do YYYY, h:mm:ss a").title(title="Bytes sent percentiles, log scale")
  1. .es(metric="percentiles:bytes_sent.numeric:5,25,50,75,95,99") – metric to use (in this case percentiles of bytes sent) under metric=
  2. .log() – calculating log (with base 10 if not specified otherwise) for y-axis values of our expression.
  3. .label(regex="q:* > percentiles([^.]+.numeric):([0-9]+).*",label="bytes sent $1th percentile") – change the label of the legend (match the original value with the regex, create a label with the result).
  4. .lines(width=0.9,steps=true) – line styling.
  5. .legend(columns=3,position=nw,timeFormat="dddd, MMMM Do YYYY, h:mm:ss a") – sets 3 columns for the legend and place it at the northwest corner.
  6. .title(title="Bytes sent percentiles, log scale") – set a title for the graph.

High severity logs

.es(q="coralogix.metadata.severity:(5 OR 6)",split=coralogix.metadata.subsystemName:5).lines(width=1.3,fill=2).label(regex=".*subsystemName:(.*) >.*",label="high severity logs count from subsystem $1").title(title="High severity logs count VS moving average per top 5 subsystems").legend(columns=2,position=nw),
.es(q="coralogix.metadata.severity:(5 OR 6)",split=coralogix.metadata.subsystemName:5).lines(width=1.3,fill=2).label(regex=".*subsystemName:(.*) >.*",label="high severity logs moving average from subsystem $1").mvavg(window=10,position=right)
  1. .es(q="coralogix.metadata.severity:(5 OR 6)",split=coralogix.metadata.subsystemName:5) – query under q=; aggregate, per top 5 subsystems, under split=
  2. .label(regex=".*subsystemName:(.*) >.*",label="high severity logs count from subsystem $1") – change the label of the legend for the 1st series (match the original value with the regex, create a label with the result).
  3. .label(regex=".*subsystemName:(.*) >.*",label="high severity logs moving average from subsystem $1") – change the label of the legend for the 2nd series (match the original value with the regex, create a label with the result).
  4. .lines(width=1.3,fill=2) – line styling.
  5. .legend(columns=2, position=nw) – sets 2 columns for the legend, and place it at the northwest corner.
  6. .title(title="High severity logs count VS moving average") – set a title for the graph.
  7. .mvavg(window=10,position=right) – computes the moving average, sliding window of 10 points to the right of each computation point, for each of the different subsystems.

Using two series, 2nd series is the moving average for the 1st, we can get each of the series along with its moving average in the same graph.

5xx responses benchmark

.es(q="status_code.numeric:[500 TO 599]").if(operator=lt,if=3000,then=.es(q="status_code.numeric:[500 TO 599]")).color(color=red).points(symbol=circle,radius=2).label(label="5xx status log count above 3000"),
.es(q="status_code.numeric:[500 TO 599]",offset=-1w).if(operator=lt,if=3000,then=.es(q="status_code.numeric:[500 TO 599]")).color(color=purple).points(symbol=circle,radius=2).label(label="5xx status log count above 3000 a week ago"),
.es(q="status_code.numeric:[500 TO 599]").if(operator=gte,if=3000,then=null).color(color=blue).points(symbol=circle,radius=2).label(label="5xx status log count under 3000"),
.es(q="status_code.numeric:[500 TO 599]",offset=-1w).if(operator=gte,if=3000,then=null).color(color=green).points(symbol=circle,radius=2).label(label="5xx status log count under 3000 a week ago").title(title="5xx log count benchmark")
  1. .es(q="status_code.numeric:[500 TO 599]") – query under q=;
  2. .es(q="status_code.numeric:[500 TO 599]",offset=-1w) – query with an offset of 1w.
  3. .if(operator=lt,if=3000,then=.es(q="status_code.numeric:[500 TO 599]")) – when chained to a .es() series, each point from the origin series is compared with the value under the ‘if’ Argument. According to the operator (in this case, lt=less than) if the result is true, the value is set to the value under the ‘then’ Argument for each point of the series
  4. .if(operator=gte,if=3000,then=null) – opposite to [3].
  5. .color(color=red) – color styling.
  6. .points(symbol=circle,radius=2) – setting the result to be presented with dots (circle signs) instead of a line.
  7. .label(label="5xx status log count above 3000") – change the label of the legend.
  8. .title(title="5xx log count benchmark") – set a title for the graph.

Using 4 series, 2 pairs of same expressions only second pair with an offset and between them setting them with different colors from a certain threshold, gives us a nice benchmark comparing our 5xx status compared with a week earlier.

Need help? Check our website and in-app chat for quick advice from our product specialists.

How to get the most out of your ELB logs

What is ELB

Amazon ELB (Elastic Load Balancing) allows you to make your applications highly available by using health checks and intelligently distributing traffic across a number of instances. It distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, IP addresses, and Lambda functions. You might have heard the terms, CLB, ALB, and NLB. All of them are types of load balancers under the ELB umbrella. 

Types of Load Balancers

  • CLB: Classic Load Balancer is the previous generation of EC2 load balancer
  • ALB: Application Load Balancer is designed for web applications
  • NLB: Network Load Balancer operates at the network layer

This article will focus on ELB logs, you can get more in-depth information about ELB itself in this post

ELB Logs

Elastic Load Balancing provides access logs that capture detailed information about requests sent to your load balancer. Each ELB log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses.

ELB logs structure

Because of the evolution of ELB, documentation can be a bit confusing. Not surprisingly, there are three variations of the AWS ALB access logs; ALB, NLB, and CLB. We need to rely on the document header to understand which of the variant logs it describes (the URL and body will usually reference ELB generically).

How to collect ELB logs

The ELB access logging monitoring capability is integrated with Coralogix. The logs can be easily collected and sent straight to the Coralogix log management solution. 

ALB Log Example

This is an example of a parsed ALB HTTP entry log:

{
 “type”:“http”,
 “timestamp:“2018-07-02T22:23:00.186641Z”, 
 “elb”:“app/my-loadbalancer/50dc6c495c0c9188”,
 “client_addr”:“192.168.131.39”,
 “client_port”:“2817”,
 “target_addr”:“110.8.13.9”,
 “target_port”:“80”,
 “request_processing_time”:“0.000”,
 “target_processing_time”:“0.001”,
 “response_processing_time”:“0.000”,
 “elb_status_code”:“200”,
 “target_status_code”:“200”,
 “received_bytes”:“34”,
 “sent_bytes”:“366”,
 “request”:“GET https://www.example.com:80/ HTTP/1.1”,
 “user_agent”:“curl/7.46.0”,
 “Ssl_cipher”:“-”,
 “ssl_protocol”:“-”,
 “target_group_arn”:“arn:aws:elasticloadbalancing:us-east-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067”,
 “trace_id”:“Root=1-58337262-36d228ad5d99923122bbe354”,
 “domain_name”:“type”:”http”-”,
 “chosen_cert_arn”:“-”,
 “matched_rule_priority”:”0”,
 “request_creation_time”:”2018-07-02T22:22:48.364000Z”,
 “Actions_executed“:“forward”,
 “redirect_url“:“-”,
 “error_reason“:“-”,
 “target_port_list“:”80”,
 “target_status_code_list“:“200”
}

Note that if you compare this log to the AWS log syntax table, we split the client address and port and target address and port into four different fields to make it easier.

CLB Log Example

This is an example of a parsed HTTPS CLB log:

{
 “timestamp":”2018-07-02T22:23:00.186641Z”, 
 “elb”:”app/my-loadbalancer/50dc6c495c0c9188”,
 “client_addr”:”192.168.131.39”,
 “client_port”:”2817”,
 “target_addr”:”10.0.0.1”,
 “target_port”:”80”,
 “request_processing_time”:”0.001”,
 “backend_processing_time”:”0.021”,
 “response_processing_time”:”0.003”,
 “elb_status_code”:”200”,
 “backend_status_code”:”200”,
 “received_bytes”:”0”,
 “sent_bytes”:”366”,
 “request":”http”GET https://www.example.com:80/ HTTP/1.1,
 “user_agent":”curl/7.46.0,
 “Ssl_cipher":”DHE-RSA_AES128-SHA”,
 “ssl_protocol”:”TLSv1.2”,
}

CLB logs have a subset of the ALB fields. The target is changed to the listener and the relevant target field names are changed to the backend. 

NLB Log Example

This is an example of an NLB log:

{
 “type”:”tls”,
 “version”:”1.0”,
 “timestamp":”2018-07-02T22:23:00.186641Z”, 
 “elb”:”net/my-network-loadbalancer/c6e77e28c25b2234”,
 “listener”:”g3d4b5e8bb8464cd”
 “client_addr”:”192.168.131.39”,
 “client_port”:”51341”,
 “target_addr:”10.0.0.1”,
 “target_port”:”443”,
 “connection_time”:”5”,
 “tls_handshake__time”:”2”,
 “received_bytes”:”29”,
 “sent_bytes”:”366”,
 “Incoming_tls_alert”:”-”
 “chosen_cert_arn”:”arn:aws:elasticloadbalancing:us-east-2:123456789012:certificate/2a108f19-aded-46b0-8493-c63eb1ef4a99”,
 “chosen_cert_serial”:”-”
 “tls_cipher":”ECDHE-RSA_AES128-SHA”,
 “tls_protocol_versiion”:”TLSv12”,
 “tls_named_group”:-,
 “domain_name:”my-network-loadbalancer-c6e77e28c25b2234.elb.us-east-2.amazonaws.com”,
}

ELB Log Parsing

ELB logs contain unstructured data. Using Coralogix parsing rules, you can easily transform the unstructured ELB logs into JSON format to get the full power of Coralogix and the Elastic stack working for you. Parsing rules use RegEx and I created the expressions for NLB, ALB-1, ALB-2, and CLB logs. The two ALB regexes cover “normal” ALB logs and the special cases of WAF, Lambda, or failed or partially fulfilled requests. In this cases AWS assigns the value ‘-‘ to the target_addr field with no port. You will see some time measurements assigned the value -1. Make sure you take it into account in your visualizations filters. Otherwise averages and other aggregations could be skewed. Amazon may add fields and change the log structure from time to time, so always check these against your own logs and make changes if needed. The examples should provide a solid foundation.  

The following section requires some familiarity with regular expressions but just skip directly to the examples if you prefer. 

A bit about why the regexes were created the way they were. Naturally, we always want to use a regex that is simple and efficient. At the same time, we should make sure that each rule captures the correct logs in the correct way (correct values matched with the correct fields). Think about a regex that starts with the following expression:

^(?P<timestamp>[^s]+)s*

It will work as long as the first field in the unstructured log is a timestamp, like in the case of CLB logs. However, in the case of NLB and ALB logs, the expression will capture the “type” field instead. Since the regex and rule have no context, it will just place the wrong value in the wrong JSON key. There are other differences that can cause problems like different numbers of fields or field order. To avoid this problem, we use the fact that NLB logs always start with ‘tls 1.0’ standing for the fields ’type’ and ‘version’, and that ALB logs start with a ‘type’ field with 6 optional values (http, https, h2, ws, wss).

Note: As explained in the Coralogix rule tutorial, rules are organized by groups and executed by the order they appear within the group. When a rule matches a log, the log will move to the next group without processing the remaining rules within the same group.

Taking all this into account, we should:

  • Create a rule group called ‘Parse ELB’
  • Put the ALB and NLB rules first (these are the rules that look for the specific beginning of the respective logs) in the group

This approach will guarantee that each rule matches with the correct log. Now we are ready for the main part of this post.

In the following examples, we’ll describe how different ELB log fields can be used to indicate operational status. In the examples, we assume that the logs were parsed into JSON format. The examples will rely on the Coralogix alerts engine and on Kibana visualizations. They also provide additional insights into the different keys and values within the logs. Like always, we give you ideas and guidance on how to get more value out of your logs. However, every business environment is different and you are encouraged to take these ideas and build on top of them based on the best implementation for your infrastructure and goals. Last but not least, Elastic Load Balancing logs requests on a best-effort basis. The logs should be used to understand the nature of the requests, not as a complete accounting of all requests. In some cases, we will use ‘notify immediately’ alerts, but you should use ELB as a backup and not as the main vehicle for these types of alerts.

Tip: To learn the details of how to create Coralogix alerts you can read this guide.

Alerts

Increase in ELB WAF errors

This alert identifies if a specific ELB generates 403 errors more than usual. A 403 error results from a request that is blocked by AWS WAF, Web Application Firewall. The alert uses the ‘more than usual’ option. With this option, Coraloix’s ML algorithms will identify normal behavior for every time period. It will trigger an alert if the number of errors is more than normal and is above the optional threshold supplied by the user.

Alert Filter:

elb:”app/my-loadbalancer/50dc6c495c0c9188” AND elb_status_code:”403”

Alert Condition: ‘More than usual’. The field elb_status_code can be found across ALB, CLB logs.

 

Outbound Traffic from a Restricted Address

In this example, we use the client field. It contains the IP of the requesting client. The alert will trigger if a request is coming from a restricted address. For the purpose of this example, we assume that permitted addresses are all under the subnet 172.xxx.xxx.xxx.

Alert Filter:

client_addr:/172.[0-9]{1,3},[0-9]{1,3},[0-9]{1,3}/

Note: Client_addr is found across NLB, ALB, and CLB.

Alert Condition:‘Notify immediately’.

ELB Down

This alert identifies an inactive ELB. It uses the ‘less than’ alert condition. The threshold is set to no logs in 5 minutes. This should be adapted to your specific environment. 

Alert Filter:

elb:”app/my-loadbalancer/50dc6c495c0c9188”

Alert Condition: ‘less than 1 in 5 minutes’’ 

This alert works across NLB, ALB, and CLB.

 

Long Connection Time

Knowing the type of transactions running on a specific ELB, ops would like to be alerted if connection times are unusually long. Here again, the Coralogix ‘more than usual” alert option will be very handy. 

Alert Filter:

connection_time:[2 TO *]

Note: Connectiion_time is specific to NLB logs. You can create similar alerts on any of the time-related fields in any of the logs. 

Alert Condition:  ‘more than usual’

 

A Surge in ‘no rule’ Requests

The field ‘matched_rule_priority’ indicates the priority value of the rule that matched the request. The value 0 indicates that no rule was applied and the load balancer resorted to the default. Applying rules to requests is specifically important in highly regulated or secured environments. For such environments, it will be important to identify rule patterns and abnormal behavior. Coralogix has powerful ML algorithms focused on identifying deviation from a normal flow of logs. This alert will notify users if the number of requests not matched with a rule is more than the usual number.

Alert Filter:

matched_rule_priority:0

Note: This is an ALB field.

Alert Condition:  ‘more than usual’

 

No Authentication

In this example, we assume a regulated environment. One of the requirements is that for every ELB request the load balancer should validate the session, authenticate the user, and add the user information to the request header, as specified by the rule configuration. This sequence of actions will be indicated by having the value ‘authenticate’ in the actions_executed’ field. The field can include a few actions separated by ‘,’. Though ELB doesn’t guarantee that every request will be recorded, it is important to be notified of the existence of such a problem, so we will use the ‘notify immediately’ condition.  

Alert Filter:

NOT actions_executed:authenticate

Note: This is an ALB field.

Alert Condition:  ‘notify immediately’

Visualizations

Traffic Type Distribution

Using the ‘type’ field this visualization shows the distribution of the different requests and connection types.

 

Bytes Sent

Summation of the number of bytes sent.

 

Average Request/Response Processing Time

Request processing time is the total time elapsed from the time the load balancer received the request until the time it sent it to a target. Response processing time is the total time elapsed from the time the load balancer received the response header from the target until it started to send the response to the client. In this visualization, we are using Timelion to track the average over time and generate a trend line.

Timelion expression:

.es(q=*,metric=avg:destination.request_processing_time.numeric).label("Avg request processing time").lines(width=3).color(green), .es(q=*,metric=avg:destination.request_processing_time.numeric).trend().lines(width=3).color(green).label("Avg request processing time trend"),.es(q=*,metric=avg:destination.response_processing_time.numeric).label("Avg response processing Time").lines(width=3).color(red), .es(q=*,metric=avg:destination.response_processing_time.numeric).trend().lines(width=3).color(red).label("Avg response processing time trend")

 

Average Response Processing Time by ELB

In this visualization, we show the average response processing time by ELB. We used the horizontal option. See the definition screens.

 

Top Client Requesters for a specific ELB

This table lists the top IP addresses generating requests to specific ELB’s. The ELB’s are separated by the metadata applicationName. This metadata field is assigned to the load balancer when you configure the integration. We created a Kibana filter that looks only at these two devices. You can read about filtering and querying in our tutorial.

 

ELB Status Codes

This is an example showing the status code distribution for the last 24 hours.

You can also create a more dynamic representation showing how the distribution behaves over time.

This blog post covered the different types of services that AWS provides under the ELB umbrella, NLB, ALB, and CLB. We focused on the logs these services generate and their structure, and showed some examples of alerts and visualizations that can help you unlock the value of these logs. Remember that every user is unique and has its own use case and data. Your logs might be customized and configured differently and you will most likely have your own requirements. So, you are encouraged to take the methods and concepts showed here and adapt them to your own needs. If you need help or have any questions, don’t hesitate and reach out to support@coralogixstg.wpengine.com. You can learn more about unlocking the value embedded in AWS ALB logs and other logs in some of our other blog posts.