A Practical Guide to Logstash: Input Plugins

In a previous post, we went through a few input plugins like the file input plugin, the TCP/UDP input plugins, etc for collecting data using Logstash. In this post, we will see a few more useful input plugins like the HTTP, HTTP poller, dead letter queue, twitter input plugins, and see how these input plugins work. 

Setup

First, let’s clone the repository 

sudo git clone https://github.com/2arunpmohan/logstash-input-plugins.git /etc/logstash/conf.d/logstash-input-plugins

This will clone the repository to the folder /etc/logstash/conf.d/logstash-input-plugins folder.

Heartbeat Input plugin

Heartbeat input plugin is one of the simple but most useful input plugins available for Logstash. This is one simple way by which we can check whether the Logstash is up and running without any issues. Put simply, we can check Logstash’s pulse using this plugin. 

This will send periodic messages to the target Elasticsearch or any other destination. The intervals at which these messages are sent can also be defined by us. 

Let’s look at the configuration for sending a status message every 5 seconds, by going to the following link:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/heartbeat/heartbeat.conf

The configuration looks like this:

Original image link here

You can see there are two important settings in the “heartbeat” plugin namely the “message” and the “interval”. 

The “interval” value is in seconds and the heartbeat plugin will send the periodic messages in 5 seconds interval.

“message” setting will emit the specified string as the health indicator string. By default it will emit “ok”. We can give any value here. 

Let’s run logstash using the above configuration, by typing in the command:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/heartbeat/heartbeat.conf

Wait for some time till the Logstash starts running and you can press CTRL+C after that. 

To view a sample document generated by the heartbeat plugin, you can type in the following request: 

curl -XGET "https://127.0.0.1:9200/heartbeat/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "size": 1}'

This request will give the response as:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 44,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "heartbeat",
        "_type" : "_doc",
        "_id" : "_ZOa-3IBngqXVfHU3YTf",
        "_score" : 1.0,
        "_source" : {
          "message" : "ok",
          "@timestamp" : "2020-06-28T15:45:29.966Z",
          "type" : "heartbeat",
          "host" : "es7",
          "@version" : "1"
        }
      }
    ]
  }
}

You can see the field named “message” with the specified string “ok” in the document. 

Now let’s explore other two different options that can be given to the “message” setting.

If we give the value “epoch” in the “message” setting, it will emit the time of the event as an epoch value under a field called “clock”.
Let’s replace the “ok” with the “epoch” for the “message” setting in our configuration. 

The updated configuration can be found in:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/heartbeat/heartbeat-epoch.conf

The configuration would look like this

Original image link here

Let’s run Logstash with this configuration file using the command:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/heartbeat/heartbeat-epoch.conf


Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.
Let’s inspect a single document from the index which we created now by typing in the request:

curl -XGET "https://127.0.0.1:9200/heartbeat-epoch/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "size": 1}'

You can see that the document in the result looks like this:

{
        "_index" : "heartbeat-epoch",
        "_type" : "_doc",
        "_id" : "_ZOM_HIBngqXVfHUd6d_",
        "_score" : 1.0,
        "_source" : {
          "host" : "es7",
          "@timestamp" : "2020-06-28T20:09:23.480Z",
          "clock" : 1593374963,
          "type" : "heartbeat",
          "@version" : "1"
        }
      }

In this document, you can see that there is a field called “clock” which has the time in “epoch” representation. This is really helpful in scenarios where we need to calculate the delays in the ingestion pipelines using Logstash. The actual time of generation of an event and the time of ingestion can be subtracted to find the time delay in ingestion.  

Also, the “message” setting can be set to have the value “sequence”, which basically will generate incrementing numbers in sequence under the field “clock”. The configuration of the logstash for this can be found in this link: 

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/heartbeat/heartbeat-sequence.conf

The configuration would look like this:

Original image link here 

Let’s run Logstash with this configuration file using the command:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/heartbeat/heartbeat-sequence.conf

Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.

Let’s inspect a few documents from the index which we created now by typing in the request:

curl -XGET "https://127.0.0.1:9200/heartbeat-sequence/_search?pretty=true" -H 'Content-Type: application/json' -d'{ 
  "sort": [
    {
      "@timestamp": {
        "order": "asc"
      }
    }
  ]
}'

In the query, I have asked Elasticsearch to return the documents based on the ascending order of timestamps. 

You can see that the returned documents would look like below:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 45,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "5Db5FnMBfj_gbv8MHEss",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:09.159Z",
          "clock" : 1
        },
        "sort" : [
          1593818289159
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "5zb5FnMBfj_gbv8MKEuq",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:14.023Z",
          "clock" : 2
        },
        "sort" : [
          1593818294023
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "8Db5FnMBfj_gbv8MO0sn",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:19.054Z",
          "clock" : 3
        },
        "sort" : [
          1593818299054
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "Lzb5FnMBfj_gbv8MTkyz",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:24.055Z",
          "clock" : 4
        },
        "sort" : [
          1593818304055
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "NDb5FnMBfj_gbv8MYkwu",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:29.056Z",
          "clock" : 5
        },
        "sort" : [
          1593818309056
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "OTb5FnMBfj_gbv8MdUyz",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:34.056Z",
          "clock" : 6
        },
        "sort" : [
          1593818314056
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "PDb5FnMBfj_gbv8MiUxA",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:39.056Z",
          "clock" : 7
        },
        "sort" : [
          1593818319056
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "Qjb5FnMBfj_gbv8MnEzH",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:44.057Z",
          "clock" : 8
        },
        "sort" : [
          1593818324057
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "RDb5FnMBfj_gbv8MsExQ",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:49.057Z",
          "clock" : 9
        },
        "sort" : [
          1593818329057
        ]
      },
      {
        "_index" : "heartbeat-sequence",
        "_type" : "_doc",
        "_id" : "Rzb5FnMBfj_gbv8MxEwD",
        "_score" : null,
        "_source" : {
          "@timestamp" : "2020-07-03T23:18:54.057Z",
          "clock" : 10
        },
        "sort" : [
          1593818334057
        ]
      }
    ]
  }
}

Here the field “clock” can be seen with incrementing numbers and has been sent in every 5 seconds. This is also helpful in determining missed events.

Generate Input plugin

Sometimes for testing or other purposes it would be great to have some specified data to be inserted. The conditions might vary like you might want to have some specific data generated for longer times, or some data generated for some iterations etc. Of course we can write custom scripts or programs to generate such data and send to a file or even Logstash directly. But there is a much more simple way to do such things within Logstash itself. We can use the generator plugin for logstash to generate random/custom log events. 

Let’s dive into an example first. 

The configuration for the example can be seen in the link

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/generator/generator.conf

The configuration would look like this:

Original image link here

In the configuration, under the “lines” section, two JSON documents were given and also for the Logstash to understand it is JSON, we have specified the “codec” value as JSON.
Now, “count” parameter is set to 0, which basically tells the Logstash to generate an infinite number of events with the values in the “lines” array.
If we set any other number for the “count” parameter, Logstash will generate each line that many times. 

Let’s run Logstash with this configuration file using the command:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/generator/generator.conf

Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.

Let’s inspect a single document from the index which we created now by typing in the request:

curl -XGET "https://127.0.0.1:9200/generator/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "size": 1}'

You can see that the document in the result looks like this:

{
        "_index" : "generator",
        "_type" : "_doc",
        "_id" : "oDYqF3MBfj_gbv8MqlUB",
        "_score" : 1.0,
        "_source" : {
          "last_name" : "Tarn",
          "sequence" : 0,
          "gender" : "Male",
          "host" : "es7",
          "@version" : "1",
          "@timestamp" : "2020-07-04T00:12:16.363Z",
          "email" : "ftarn0@go.com",
          "id" : 1,
          "ip_address" : "112.29.200.6",
          "first_name" : "Ford"
        }
}

This plugin is very useful when we need to have documents generated for testing. We can give multiple lines and test grok patterns or filters and see how they are getting indexed using this plugin.

Dead letter queue

This is one important input plugin and is an immensely helpful one too. So far we have seen the cases of successful processing of events by Logstash. Now what happens when Logstash fails to process an event? Normally the event is dropped and hence we will lose it. 

In many cases it would be helpful to collect those documents for further inspection and do the necessary actions so that in future the event dropping won’t happen. 

To make this possible Logstash provides us with a queue called dead-letter-queue. The events which are dropped would get collected here. 

Now, once in the dead-letter-queue, with the help of this dead-letter-queue input plugin , we can process these documents using another Logstash configuration and make the necessary changes and then index them back to Elasticsearch.

 You can see the flow of unprocessed events to dead-letter-queue and then to Elasticsearch in this diagram.

Original image link here 

So that we have seen what is “dead-letter-queue” in Logstash, let’s move on to a simple use case involving the dead letter queue input plugin. 

Since we are using the dead-letter-queue input plugin, we need to do two things prior.
First, we need to enable the dead-letter-queue.

Second, we need to set the path to the dead-letter-queue. 

Let’s create a folder named dlq to store the dead-letter-queue data by typing in the following command

mkdir /home/student/dlq

Now, Logstash creates a user named “logstash” during the installation and performs the actions using this user. So let’s change the ownership of the above folder to the user named “logstash” by typing in the following command

sudo chown -R logstash:logstash /home/student/dlq

Now let’s enable the dead_letter_queue and also specify its path in the logstash’s setting by editing the settings file by typing in

sudo nano /etc/logstash/logstash.yml

Now uncomment the setting dead_letter_queue and give the value as true  

Also give set the path to store the queue data, by uncommenting the path.dead_letter_queue as /home/student/dlq    

The sections of the logstash.yml file will look like this after we have made the necessary edits

Original image link here 

To save our file, we press CTRL+X, then press Y, and finally ENTER.

Now let’s input a few documents from a JSON file. 

Let’s have a look at the documents in the file by opening the file using the command

cat /etc/logstash/conf.d/logstash-input-plugins/sample-data.json

Original image link here

You can see that, in the first 10 documents, the “age” field has integer values. But from the 11th document, it is mistakenly given as boolean. So what happens is that Elasticsearch will assign a type “long” data type used for the integers for the “age” field. This is because, when we are indexing the documents the first document which has the “age” as integers are indexed first, so Elasticsearch will assign them the “age” field. Now, when the 11th document comes with the boolean value for “age” , Elasticsearch will reject the document since a field cannot contain two data types. So it does for all the documents with “age” as a boolean value.

So let’s index the sample data and see what happens. 

For that let’s run the configuration file for logstash. Let’s have a look at the configuration file by going to the following link:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/dead-letter-queue/dlq-data-01-ingest.conf

The configuration file would look like this:

You can type in this command to run Logstash for indexing the sample data 

sudo /usr/share/logstash/bin/logstash --path.settings /etc/logstash -f /etc/logstash/conf.d/dead-letter-queue/dlq-data-01.conf

In the command we typed, there is an extra parameter called “path.settings” and its value “/etc/logstash”. This is given so that Logstash will understand where the logstash.yml file is located and will take that settings.

Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.

Let’s inspect how many documents were inserted to the index. There was a total of 20 documents and out of it only 10 should have been indexed as the other 10 would have the mapping issue. 

curl -XGET "https://127.0.0.1:9200/dlq-sample-data/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "track_total_hits": true, size:0}'

You can see that the response will looks like this:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dlq-sample-data",
        "_type" : "_doc",
        "_id" : "G2v1GHMBfj_gbv8MTzae",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2020-07-04T08:33:14.322Z",
          "@version" : "1",
          "gender" : "Female",
          "host" : "es7",
          "message" : """{"age":39,"full_name":"Shelley Bangs","gender":"Female"}""",
          "path" : "/etc/logstash/conf.d/logstash-input-plugins/dead-letter-queue/sample-data-dlq.json",
          "age" : 39,
          "full_name" : "Shelley Bangs"
        }
      }
    ]
  }
}

In the response you can see that the hits.total.value is equal to 10 (marked in bold in the above response). 

Now you can see that the remaining 10 documents are not indexed.So these documents are dropped and since we have enabled the dead_letter_queue, these documents should be there.
Let’s use the input plugin for reading from the dead_letter_queue. 

The configuration for reading from the dead_letter_queue can be seen in this link:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/dead-letter-queue/dlq.conf

The configuration looks like this

Original image link here 

In the configuration, we have pointed to the path where we had set the dead_letter_queue. Now if we run the configuration Logstash will read the contents in the dead_letter_queue and process it and then index to Elasticsearch. 

For the sake of simplicity , in the configuration, I have not added any filters. We definitely can add filters depending up on the use case and then run the configuration if we want. 

Let’s run Logstash with this configuration file using the command:

sudo /usr/share/logstash/bin/logstash --path.settings /etc/logstash -f /etc/logstash/conf.d/logstash-input-plugins/dead-letter-queue/dlq.conf

Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.

Let’s inspect the number of documents that were indexed now by typing in the request:

curl -XGET "https://127.0.0.1:9200/dlq-01/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "size": 1}'

The response to the request will return this.

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "dlq-01",
        "_type" : "_doc",
        "_id" : "O2thGnMBfj_gbv8MJmR4",
        "_score" : 1.0,
        "_source" : {
          "@timestamp" : "2020-07-04T08:33:14.502Z",
          "full_name" : "Robbyn Narrie",
          "gender" : "Female",
          "@version" : "1",
          "path" : "/etc/logstash/conf.d/logstash-input-plugins/dead-letter-queue/sample-data-dlq.json",
          "host" : "es7",
          "message" : """{"age":false,"full_name":"Robbyn Narrie","gender":"Female"}""",
          "age" : false
        }
      }
    ]
  }
}

You can see that there are 10 documents that were indexed. And upon inspection on the sample document, it has returned the document which had the boolean value for the field “age”. 

Note that with this configuration, after running the Logstash, the queue contents would not be cleared. If you want the documents which were read not to be indexed again, the flag “commit_offsets” to true, just under the “path” setting.

Http_poller plugin

The http_poller plugin is a very useful input plugin that comes with Logstash. As the name indicates, this plugin can be used to poll http endpoints periodically and let us store the response in Elasticsearch. 

This can be helpful if we have health APIs exposed for our applications and need periodic status monitoring of them. Also we can use it to collect the status like weather, match details etc periodically. 

In our case let’s look at polling two http APIs for the purpose of understanding the plugin. We will first call a POST method to an external API at an interval of 5 seconds, and will store it to an index named “http-poller-api”. Then we will also call the Elasticsearch’s cluster health API, which is a GET request , in every second and store it in an index named “http-poller-es-health”. 

Let’s first familiarize with both of the APIs first. 

The first API is a simple online free API, which gives some response when we give a POST request. 

Open the website apitester.com and a web page will open like this. We open this website simply to test one of the APIs we are trying to call.

Original image link here 

Now change the value GET to POST from the dropdown and also click on the “Add Request Header”. The resulting window would look like this:

Original image link here

Now, fill the box marked as URL with this address https://jsonplaceholder.typicode.com/posts
In the section post data, add the following JSON lines

{ "title": "foo", "body": "bar", "userId": "1"}

And add  “content-type” and “application/json” as the key value pairs for the boxes named “name” and “value” respectively.
The filled form details would look like this: 

Original image link here 

Now press the “Test” button and you can see the response details as below:

Original image link here

What this API does is that it will create a record in the remote database and send the details with the record id as the response body. You can also see a pretty long response header also along with the response. 

Our second API is a GET request supplied by our own Elasticsearch itself, which gives us the cluster status. You can test the API by typing in the following request in the terminal 

curl -XGET "https://localhost:9200/_cluster/health"


This will result in the following response:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 56,
  "active_shards" : 56,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 50,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 52.83018867924528
}

Now that we have familiarised both the API calls , let’s have a look at the configuration for logstash. You can see the configuration in this link:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/http-poller/http-poller.conf

The configuration would look like this:

Original image link here

In the configuration you can see two “http_poller” sections. One is named as “external_api” and the other is named as “es_health_status”.
In the “external_api” section , you can see the URL we have called in the first example, with the method, and the content-type mentioned in the “headers” section.
Apart from that, I have given an identifier named “external-api” in the “tags” to identify the response from this section. 

Also the periodic scheduling is done every 5 seconds. This is specified under the “schedule” section.
Another important setting is the “metadata_target”. I have specified that to be “http_poller_metadata”. This will ensure that the response headers will be stored in the field named “http_poller_metadata” when storing in ES. 

Similar settings are applied to the “es_health_status” section too. Here the difference is that the “method” is a GET method and also the “Schedule” section is given a cron expression as the value. Also the value of “tags” is “es_health”. 

Now coming to the output section, there is a check on the “tags” field. If the data is coming from the “external-api” tag, the index where the data is stored is “http-poller-api”. But if the data contains “tags” value as “es_health”, it will be stored in the index named “http-poller-es-health”. 

Since we have a good understanding of the Logstash configuration, let’s run this configuration by typing in this command

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/http_poller/http_poller.conf

Wait for some time for the logstash to run. Since it is a polling operation, Logstash won’t finish running,but will keep running. So after about 5 mins, you can stop the Logstash by hitting CTRL+C 

Now let’s check whether we have data in both of the indices. Lets first query the index “http-poller-api” with this query:

curl -XGET "https://localhost:9200/http-poller-api/_search?pretty=true" -H 'Content-Type: application/json' -d'{
  "query": {
    "match_all": {
    }
  },
  "size": 1,
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}'

This will return a single document like this:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 463,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "http-poller-api",
        "_type" : "_doc",
        "_id" : "YGw8InMBfj_gbv8MmFqD",
        "_score" : null,
        "_source" : {
          "userId" : "1",
          "id" : 101,
          "@timestamp" : "2020-07-06T03:47:43.258Z",
          "http_poller_metadata" : {
            "request" : {
              "headers" : {
                "content-type" : "application/json"
              },
              "body" : """{ "title": "foo", "body": "bar", "userId": "1"}""",
              "method" : "post",
              "url" : "https://jsonplaceholder.typicode.com/posts"
            },
            "response_headers" : {
              "server" : "cloudflare",
              "expires" : "-1",
              "x-ratelimit-limit" : "1000",
              "content-length" : "67",
              "access-control-expose-headers" : "Location",
              "access-control-allow-credentials" : "true",
              "x-content-type-options" : "nosniff",
              "via" : "1.1 vegur",
              "pragma" : "no-cache",
              "etag" : "W/"43-e0UvNeXth+6+06UFNnGIVUOlAcw"",
              "date" : "Mon, 06 Jul 2020 03:47:43 GMT",
              "location" : "https://jsonplaceholder.typicode.com/posts/101",
              "cf-ray" : "5ae6588feab20000-SIN",
              "cache-control" : "no-cache",
              "connection" : "keep-alive",
              "content-type" : "application/json; charset=utf-8",
              "x-powered-by" : "Express",
              "x-ratelimit-reset" : "1594007283",
              "cf-cache-status" : "DYNAMIC",
              "expect-ct" : "max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"",
              "vary" : "Origin, X-HTTP-Method-Override, Accept-Encoding",
              "x-ratelimit-remaining" : "991",
              "cf-request-id" : "03c3d5adf3000100d3bf307200000001"
            },
            "code" : 201,
            "name" : "external_api",
            "response_message" : "Created",
            "times_retried" : 0,
            "runtime_seconds" : 0.576409,
            "host" : "es7"
          },
          "@version" : "1",
          "body" : "bar",
          "title" : "foo",
          "tags" : [
            "external-api"
          ]
        },
        "sort" : [
          1594007263258
        ]
      }
    ]
  }
}

In the response, you can see the “http_poller_metadata” field with the necessary details of both the request and response headers. Also there is the field named “id” returned by the server. 

Now coming to the index “http-poller-es-health”, lets query the same query like this:

curl -XGET "https://localhost:9200/http-poller-es-health/_search?pretty=true" -H 'Content-Type: application/json' -d'{
  "query": {
    "match_all": {
      
    }
  },
  "size": 1,
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}'

This will return document like this:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 39,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "http-poller-es-health",
        "_type" : "_doc",
        "_id" : "PWw7InMBfj_gbv8M8Fqg",
        "_score" : null,
        "_source" : {
          "number_of_pending_tasks" : 0,
          "task_max_waiting_in_queue_millis" : 0,
          "active_shards_percent_as_number" : 52.83018867924528,
          "status" : "yellow",
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 50,
          "active_primary_shards" : 56,
          "active_shards" : 56,
          "delayed_unassigned_shards" : 0,
          "timed_out" : false,
          "@version" : "1",
          "tags" : [
            "es_health"
          ],
          "cluster_name" : "elasticsearch",
          "@timestamp" : "2020-07-06T03:47:00.279Z",
          "number_of_in_flight_fetch" : 0,
          "number_of_nodes" : 1,
          "http_poller_metadata" : {
            "request" : {
              "headers" : {
                "Accept" : "application/json"
              },
              "method" : "get",
              "url" : "https://localhost:9200/_cluster/health"
            },
            "response_headers" : {
              "content-type" : "application/json; charset=UTF-8"
            },
            "code" : 200,
            "name" : "es_health_status",
            "response_message" : "OK",
            "times_retried" : 0,
            "runtime_seconds" : 0.0057480000000000005,
            "host" : "es7"
          },
          "number_of_data_nodes" : 1
        },
        "sort" : [
          1594007220279
        ]
      }
    ]
  }
}

Here also you can see the same.
So by using this input plugin we can call HTTP calls periodically index the response with necessary data to a single index or multiple indices. 

Twitter plugin

Another interesting input plugin which is provided by Logstash is the Twitter plugin. Twitter input plugin allows us to stream Twitter events directly to Elasticsearch or any output that Logstash support. 

Let’s explore the Twitter input plugin and see it in action. 

For this to work, you need to have a Twitter account. If you don’t have one, don’t worry, just go to Twitter and create one using the signup option. 

Now once you have your Twitter account, go to
https://developer.twitter.com/en/apps 

This is the developer portal of Twitter. You can see the screen like this

Original image link here 

Click on the “create an app” button. If you are creating an app for the first time, it would ask to “apply” and it might take a day or two to get the approval. So once you have the approval, the “create an app” clicking would redirect to a screen like this:

Original image link here

After pressing the “create” button in the bottom, it will ask once again for confirmation with the updated terms and conditions like this:

Original image link here 

Clicking on the “create” button will create the app for us and you will see the screen like this

Original image link here 

In the screen you can see three tabs, press the “keys and tokens” tab and you will see this screen, which contains the necessary keys and tokens

Original image link here

From this screen copy the “api key” and “api secret key” values to a notepad or so.

In the screen, press the “generate” button and the “access token” and “access token secret” will be generated for you like this:

Original image link here

Now copy this information and keep them along with the “api key” and “api secret key” which you copied earlier in the notepad.

Now the information collected in the notepad will have the following details

Original image link here 

Now we have the necessary information to proceed to use the Twitter plugin for Logstash. 

Let’s look directly at the configuration file of this setup in this link:

https://raw.githubusercontent.com/2arunpmohan/logstash-input-plugins/master/twitter/twitter.conf

The configuration would look like this

original image link here 

Now enter the value of

“api key” which we copied to the notepad to that of the “consumer_key”

“api key secret” to that of the “consumer_secret”

“Access token” to that of the “oauth_token”

“Access token secret” to that of the “oauth_token_secret” 

You can enter any string values in the “keywords” section which will get you the tweets which contains those strings. Here I have given “covid”, “corona” as the strings. This will give tweets which contain either “covid” OR “corona” in their body. 

The “full_tweet” value is set to true to get the full information about the tweet as given by the Twitter

Let’s run Logstash with this configuration file using the command:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-input-plugins/twitter/twitter.conf

Give some time for Logstash to run. After it has started running, wait for some time and then hit CTRL+C to quit the Logstash screen and to stop it.

Let’s inspect a single document from the index which we created now by typing in the request:

curl -XGET "https://127.0.0.1:9200/twitter/_search?pretty=true" -H 'Content-Type: application/json' -d'{  "size": 1}'

You can see that the document in the result looks like this:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 115,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "_doc",
        "_id" : "OZNS-XIBngqXVfHUK0_d",
        "_score" : 1.0,
        "_source" : {
          "quote_count" : 0,
          "reply_count" : 0,
          "lang" : "en",
          "@version" : "1",
          "retweeted" : false,
          "coordinates" : null,
          "text" : "RT @beatbyanalisa: My biggest fear about contracting COVID is not getting sick, it's me unknowingly passing it along to someone else... And...",
          "in_reply_to_screen_name" : null,
          "is_quote_status" : false,
          "geo" : null,
          "favorite_count" : 0,
          "favorited" : false,
          "created_at" : "Sun Jun 28 05:06:43 +0000 2020",
          "retweeted_status" : {
            "quote_count" : 6,
            "reply_count" : 8,
            "lang" : "en",
            "retweeted" : false,
            "coordinates" : null,
            "text" : "My biggest fear about contracting COVID is not getting sick, it's me unknowingly passing it along to someone else..... https://t.co/Hh3mMNUewS",
            "in_reply_to_screen_name" : null,
            "is_quote_status" : false,
            "geo" : null,
            "extended_tweet" : {
              "full_text" : "My biggest fear about contracting COVID is not getting sick, it's me unknowingly passing it along to someone else... And then that someone may not be able to handle it as well as I may be able to. That's what scares me the most.",
              "display_text_range" : [
                0,
                228
              ],
              "entities" : {
                "hashtags" : [ ],
                "symbols" : [ ],
                "user_mentions" : [ ],
                "urls" : [ ]
              }
            },
            "favorite_count" : 87,
            "favorited" : false,
            "created_at" : "Sun Jun 28 01:06:16 +0000 2020",
            "truncated" : true,
            "retweet_count" : 46,
            "in_reply_to_status_id_str" : null,
            "in_reply_to_user_id" : null,
            "place" : null,
            "id" : 1277045619723558914,
            "filter_level" : "low",
            "user" : {
              "followers_count" : 1386,
              "friends_count" : 1032,
              "translator_type" : "none",
              "profile_background_color" : "000000",
              "profile_image_url" : "https://pbs.twimg.com/profile_images/1247544253015867394/wNFFsE6l_normal.jpg",
              "default_profile" : false,
              "lang" : null,
              "profile_background_image_url" : "https://abs.twimg.com/images/themes/theme9/bg.gif",
              "screen_name" : "beatbyanalisa",
              "profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme9/bg.gif",
              "geo_enabled" : true,
              "utc_offset" : null,
              "name" : """""",
              "listed_count" : 24,
              "profile_link_color" : "F76BB5",
              "statuses_count" : 70995,
              "profile_text_color" : "241006",
              "profile_banner_url" : "https://pbs.twimg.com/profile_banners/66204811/1581039038",
              "following" : null,
              "created_at" : "Sun Aug 16 22:09:45 +0000 2009",
              "time_zone" : null,
              "notifications" : null,
              "contributors_enabled" : false,
              "verified" : false,
              "id" : 66204811,
              "favourites_count" : 4986,
              "is_translator" : false,
              "profile_background_tile" : false,
              "profile_sidebar_fill_color" : "C79C68",
              "profile_sidebar_border_color" : "FFFFFF",
              "protected" : false,
              "description" : """• freelance mua • cheese connoisseur • catfish • IG:BeatByAnalisa • DM or email beatbyanalisa@gmail.com for appt or PR inquiries! ✨""",
              "id_str" : "66204811",
              "default_profile_image" : false,
              "profile_use_background_image" : true,
              "location" : "San Antonio, TX",
              "follow_request_sent" : null,
              "url" : "https://www.instagram.com/beatbyanalisa/?hl=en",
              "profile_image_url_https" : "https://pbs.twimg.com/profile_images/1247544253015867394/wNFFsE6l_normal.jpg"
            },
            "in_reply_to_status_id" : null,
            "in_reply_to_user_id_str" : null,
            "source" : """

Twitter for iPhone

""",
            "id_str" : "1277045619723558914",
            "contributors" : null,
            "entities" : {
              "hashtags" : [ ],
              "symbols" : [ ],
              "user_mentions" : [ ],
              "urls" : [
                {
                  "display_url" : "twitter.com/i/web/status/1...",
                  "indices" : [
                    117,
                    140
                  ],
                  "url" : "https://t.co/Hh3mMNUewS",
                  "expanded_url" : "https://twitter.com/i/web/status/1277045619723558914"
                }
              ]
            }
          },
          "truncated" : false,
          "retweet_count" : 0,
          "@timestamp" : "2020-06-28T05:06:43.000Z",
          "in_reply_to_status_id_str" : null,
          "in_reply_to_user_id" : null,
          "place" : null,
          "id" : 1277106130813140993,
          "filter_level" : "low",
          "user" : {
            "followers_count" : 239,
            "friends_count" : 235,
            "translator_type" : "none",
            "profile_background_color" : "F5F8FA",
            "profile_image_url" : "https://pbs.twimg.com/profile_images/1276215312912957440/HBjApFTt_normal.jpg",
            "default_profile" : true,
            "lang" : null,
            "profile_background_image_url" : "",
            "screen_name" : "briannaaamariie",
            "profile_background_image_url_https" : "",
            "geo_enabled" : true,
            "utc_offset" : null,
            "name" : """""",
            "listed_count" : 0,
            "profile_link_color" : "1DA1F2",
            "statuses_count" : 555,
            "profile_text_color" : "333333",
            "profile_banner_url" : "https://pbs.twimg.com/profile_banners/1269087963373342723/1592204564",
            "following" : null,
            "created_at" : "Sat Jun 06 02:09:13 +0000 2020",
            "time_zone" : null,
            "notifications" : null,
            "contributors_enabled" : false,
            "verified" : false,
            "id" : 1269087963373342723,
            "favourites_count" : 1856,
            "is_translator" : false,
            "profile_background_tile" : false,
            "profile_sidebar_fill_color" : "DDEEF6",
            "profile_sidebar_border_color" : "C0DEED",
            "protected" : false,
            "description" : """, ,  ,   , """,
            "id_str" : "1269087963373342723",
            "default_profile_image" : false,
            "profile_use_background_image" : true,
            "location" : "San Antonio, TX",
            "follow_request_sent" : null,
            "url" : "https://instagram.com/briannaaamarieeee",
            "profile_image_url_https" : "https://pbs.twimg.com/profile_images/1276215312912957440/HBjApFTt_normal.jpg"
          },
          "timestamp_ms" : "1593320803727",
          "in_reply_to_status_id" : null,
          "in_reply_to_user_id_str" : null,
          "source" : """

Twitter for iPhone

""",
          "id_str" : "1277106130813140993",
          "contributors" : null,
          "entities" : {
            "hashtags" : [ ],
            "symbols" : [ ],
            "user_mentions" : [
              {
                "name" : """""",
                "id" : 66204811,
                "id_str" : "66204811",
                "indices" : [
                  3,
                  17
                ],
                "screen_name" : "beatbyanalisa"
              }
            ],
            "urls" : [ ]
          }
        }
      }
    ]
  }
}

As you can see there is a lot of information per tweet. It would be interesting to collect tweets on the topic you are interested and do some analytics on them.

Cleaning up 

Now let’s clean up the indices we have created in this data. You can keep them if you want. 

curl -XDELETE heartbeat

curl -XDELETE generator

curl -XDELETE dlq-sample-data

curl -XDELETE dlq-01

curl -XDELETE http-poller-api

curl -XDELETE http-poller-es-health

curl -XDELETE twitter 

Conclusion

In this post, we have seen some of the most common and helpful input plugins out there. There are many more input plugins in Logstash’s arsenal, but most of them follow similar patterns of the ones we have seen.

A Practical Guide to Logstash: Parsing Common Log Patterns with Grok

In a previous post, we explored the basic concepts behind using Grok patterns with Logstash to parse files. We saw how versatile this combo is and how it can be adapted to process almost anything we want to throw at it. But the first few times you use something, it can be hard to figure out how to configure for your specific use case. Looking at real-world examples can help here, so let’s learn how to use Grok patterns in Logstash to parse common logs we’d often encounter, such as those generated by Nginx, MySQL, Elasticsearch, and others.

First, Some Preparation

We’ll take a look at a lot of example logs and Logstash config files in this post so, if you want to follow along, instead of downloading each one at each step, let’s just copy all of them at once and place them in the “/etc/logstash/conf.d/logstash” directory.

First, install Git if it’s not already installed:

sudo apt update && sudo apt install git

Now let’s download the files and place them in the directory:

sudo git clone https://github.com/coralogix-resources/logstash.git /etc/logstash/conf.d/logstash

NGINX Access Logs

NGINX and Apache are the most popular web servers in the world. So, chances are, we will often have contact with the logs they generate. These logs reveal information about visits to your site like file access requests, NGINX responses to those requests, and information about the actual visitors, including their IP, browser, operating system, etc. This data is helpful for general business intelligence, but also for monitoring for security threats by malicious actors. 

Let’s see how a typical Nginx log is structured.

We’ll open the following link in a web browser https://raw.githubusercontent.com/coralogix-resources/logstash/master/nginx/access.log and then copy the first line. Depending on your monitor’s resolution, the first line might actually be broken into two lines, to fit on the screen (otherwise called “line wrapping”). To avoid any mistakes, here is the exact content of the line we will copy:

73.44.199.53 - - [01/Jun/2020:15:49:10 +0000] "GET /blog/join-in-mongodb/?relatedposts=1 HTTP/1.1" 200 131 "https://www.techstuds.com/blog/join-in-mongodb/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"

Original image link here

Next, let’s open the Grok Debugger Tool at https://grokdebug.herokuapp.com/ to help us out. In the first field, the input section, we’ll paste the previously copied line.

Original image link here

Now let’s have a look at the Logstash config we’ll use to parse our Nginx log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/nginx/nginx-access-final.conf

From here, we’ll copy the Grok pattern from the “match” section. This is the exact string we should copy:

%{IPORHOST:remote_ip} - %{DATA:user_name} [%{HTTPDATE:access_time}] "%{WORD:http_method} %{DATA:url} HTTP/%{NUMBER:http_version}" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} "%{DATA:referrer}" "%{DATA:agent}"

Original image link 

We go back to the https://grokdebug.herokuapp.com/ website and paste the Grok pattern in the second field, the pattern section. We’ll also tick the “Named captures only” checkbox and then click the “Go” button.

Note: For every line you copy and paste, make sure there are no empty lines before (or after) the actual text in the pattern field. Depending on how you copy and paste text, sometimes an empty line might get inserted before or after the copied string, which will make the Grok Debugger fail to parse your text. If this happens, just delete the empty line(s).

Original image link here

This tool is useful to test if our Grok patterns work as intended. It makes it convenient to try out new patterns, or modify existing ones and see in advance if they produce the desired results.

Now that we’ve seen that this correctly separates and extracts the data we need, let’s run Logstash with the configuration created specifically to work with the Nginx log file:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/nginx/nginx-access-final.conf

The job should finish in a few seconds. When we notice no more output is generated, we can close Logstash by pressing CTRL+C.

Now let’s see how the log file has been parsed and indexed:

curl -XGET "https://localhost:9200/nginx-access-logs-02/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

We’ll see a response similar to the following:

      {
        "_index" : "nginx-access-logs-02",
        "_type" : "_doc",
        "_id" : "vvhO2XIBB7MjzkVPHJhV",
        "_score" : 0.0,
        "_source" : {
          "access_time" : "01/Jun/2020:15:49:10 +0000",
          "user_name" : "-",
          "url" : "/blog/join-in-mongodb/?relatedposts=1",
          "path" : "/etc/logstash/conf.d/logstash/nginx/access.log",
          "body_sent_bytes" : "131",
          "response_code" : "200",
          "@version" : "1",
          "referrer" : "https://www.techstuds.com/blog/join-in-mongodb/",
          "http_version" : "1.1",
          "read_timestamp" : "2020-06-21T23:54:33.738Z",
          "@timestamp" : "2020-06-21T23:54:33.738Z",
          "agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
          "http_method" : "GET",
          "host" : "coralogix",
          "remote_ip" : "73.44.199.53"
        }

We can see the fields and their associated values neatly extracted by the Grok patterns.

IIS Logs

While we’ll often see Apache and Nginx web servers on the Linux operating system, Microsoft Windows has its own web server included in IIS (Internet Information Services). These generate their own logs that can be helpful to monitor the state and activity of applications. Let’s learn how to parse logs generated by IIS.

Just as before, we will take a look at the sample log file and extract the first useful line: https://raw.githubusercontent.com/coralogix-resources/logstash/master/iis/u_ex171118-sample.log.

We’ll ignore the first few lines starting with “#” as that is a header, and not actual logged data. The line we’ll extract is the following:

2017-11-18 08:48:20 GET /de adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 301 0 0 624 543 46

Original image link here

Once again, to take a closer look at how our specific Grok patterns will work, we’ll paste our log line into the Grok Debugger Tool tool, in the first field, the input section.

Original image link here

The config file we’ll use to parse the log can be found at https://raw.githubusercontent.com/coralogix-resources/logstash/master/iis/iis-final-working.conf.

Original image link here 

Once again, let’s copy the Grok pattern within:

%{TIMESTAMP_ISO8601:time} %{WORD:method} %{URIPATH:uri_requested} %{NOTSPACE:query} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:client_ip} %{NOTSPACE:http_version} %{NOTSPACE:user_agent} %{NOTSPACE:cookie} %{URI:referrer_url} %{IPORHOST:host} %{NUMBER:http_status_code} %{NUMBER:protocol_substatus_code} %{NUMBER:win32_status} %{NUMBER:bytes_sent} %{NUMBER:bytes_received} %{NUMBER:time_taken}

…and paste it to the second field in the https://grokdebug.herokuapp.com/ website, the pattern section:

Original Image Link here

Let’s run Logstash and parse this IIS log:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/iis/iis-final-working.conf

As usual, we’ll wait for a few seconds until the job is done and then press CTRL+C to exit the utility.

Let’s look at the parsed data:

curl -XGET "https://localhost:9200/iis-log/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

A response similar to the following shows us that everything is neatly structured in the index.

      {
        "_index" : "iis-log",
        "_type" : "_doc",
        "_id" : "6_i62XIBB7MjzkVPS5xL",
        "_score" : 0.0,
        "_source" : {
          "http_version" : "HTTP/2.0",
          "query" : "adpar=12345&gclid=1234567890",
          "bytes_received" : "543",
          "read_timestamp" : "2020-06-22T01:52:43.628Z",
          "user_agent" : "Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39",
          "uri_requested" : "/de",
          "username" : "-",
          "time_taken" : "46",
          "referrer_url" : "https://www.google.de/",
          "client_ip" : "149.172.138.41",
          "http_status_code" : "301",
          "bytes_sent" : "624",
          "time" : "2017-11-18 08:48:20",
          "cookie" : "-",
          "method" : "GET",
          "@timestamp" : "2017-11-18T06:48:20.000Z",
          "protocol_substatus_code" : "0",
          "win32_status" : "0",
          "port" : "443"
        }

MongoDB Logs

While not as popular as MySQL, the MongoDB database engine still has a fairly significant market share and is used by many leading companies. The MongoDB logs can help us track the database performance and resource utilization to help with troubleshooting and performance tuning. 

Let’s see how a MongoDB log looks like: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mongodb/mongodb.log.

Original image link

We can see fields are structured in a less repetitive and predictable way than in a typical Nginx log.

Let’s copy the first line from the log and paste it into the first field of the Grok Debugger Tool website.

2019-06-25T10:08:01.111+0000 I CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'

Original image link

The config file we will use for Logstash, to parse our log, can be found at https://raw.githubusercontent.com/coralogix-resources/logstash/master/mongodb/mongodb-final.conf.

Original image link

And here is the Grok pattern we need to copy:

%{TIMESTAMP_ISO8601:timestamp}s+%{NOTSPACE:severity}s+%{NOTSPACE:component}s+(?:[%{DATA:context}])?s+%{GREEDYDATA:log_message}

As usual, let’s paste it to the second field in the https://grokdebug.herokuapp.com/ website.

Original image link 

Let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/mongodb/mongodb-final.conf

When the job is done, we press CTRL+C to exit the program and then we can take a look at how the data was parsed:

curl -XGET "https://localhost:9200/mongo-logs-01/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

The output should be similar to the following:

      {
        "_index" : "mongo-logs-01",
        "_type" : "_doc",
        "_id" : "0vjo2XIBB7MjzkVPS6y9",
        "_score" : 0.0,
        "_source" : {
          "log_message" : "Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'",
          "@timestamp" : "2020-06-22T02:42:58.604Z",
          "timestamp" : "2019-06-25T10:08:01.111+0000",
          "context" : "main",
          "component" : "CONTROL",
          "read_timestamp" : "2020-06-22T02:42:58.604Z",
          "@version" : "1",
          "path" : "/etc/logstash/conf.d/logstash/mongodb/mongodb.log",
          "host" : "coralogix",
          "severity" : "I"
        }

User Agent Mapping and IP to Geo Location Mapping in Logs

Very often, when a web browser requests a web page from a web server, it also sends a so-called “user agent”. This can contain information such as the operating system used by a user, the device, the web browser name and version and so on. Obviously, this can be very useful data in certain scenarios. For example, it can help you find out if users of a particular operating system are experiencing issues.

Web servers also log the IP addresses of the visitors. While that’s useful to have in raw logs, those numbers themselves are not always useful to humans. They might be nice to have when trying to debug connectivity issues, or block a class of IPs, but for statistics and charts, it might be more relevant to have the geographic location of each IP, like country/city and so on.

Logstash can “transform” user agents like

Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39

to the actual names of the specific operating system, device and/or browser that was used, and other info which is much more easy to read and understand by humans. Likewise, IP addresses can be transformed to estimated geographical locations. The technical term for these transformations is mapping.

Let’s take a look at an Apache access log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/apache/access_log.

Original image link

We notice IP addresses and user agents all throughout the log. Now let’s see the Logstash config we’ll use to do our mapping magic with this information: https://raw.githubusercontent.com/coralogix-resources/logstash/master/apache/apache-access-enriched.conf.

The interesting entries here can be seen under “useragent” and “geoip“.

Original image link 

In the useragent filter section, we simply instruct Logstash to take the contents of the agent field, process them accordingly, and map them back to the agent field.

In the geoip filter, we instruct Logstash to take the information from the clientip field, process it, and then insert the output in a new field, called geoip.

Let’s run Logstash with this config and see what happens:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/apache/apache-access-enriched.conf

We’ll need to wait for a longer period of time for this to be done as there are many more lines the utility has to process (tens of thousands). As usual, when it’s done, we’ll press CTRL+C to exit.

Now let’s explore how this log was parsed and what was inserted to the index:

curl -XGET "https://localhost:9200/apache-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1,
  "track_total_hits": true,
  "query": {
  "bool": {
    "must_not": [
      {
        "term": {
          "tags.keyword": "_grokparsefailure"
        }
      }
    ]
  }
  }
}'

The output will be similar to the following:

      {
        "_index" : "apache-logs",
        "_type" : "_doc",
        "_id" : "4vgC2nIBB7MjzkVPhtPl",
        "_score" : 0.0,
        "_source" : {
          "verb" : "GET",
          "host" : "coralogix",
          "response" : "200",
          "agent" : {
            "name" : "Firefox",
            "build" : "",
            "device" : "Other",
            "os" : "Windows",
            "major" : "34",
            "minor" : "0",
            "os_name" : "Windows"
          },
          "clientip" : "178.150.5.107",
          "ident" : "-",
          "bytes" : "5226",
          "geoip" : {
            "continent_code" : "EU",
            "timezone" : "Europe/Kiev",
            "country_code3" : "UA",
            "country_name" : "Ukraine",
            "location" : {
              "lat" : 50.4547,
              "lon" : 30.5238
            },
            "region_name" : "Kyiv City",
            "city_name" : "Kyiv",
            "country_code2" : "UA",
            "ip" : "178.150.5.107",
            "postal_code" : "04128",
            "longitude" : 30.5238,
            "region_code" : "30",
            "latitude" : 50.4547
          },
          "referrer" : ""-"",
          "auth" : "-",
          "httpversion" : "1.1",
          "read_timestamp" : "2020-06-22T03:11:37.715Z",
          "path" : "/etc/logstash/conf.d/logstash/apache/access_log",
          "@timestamp" : "2017-04-30T19:16:43.000Z",
          "request" : "/wp-login.php",
          "@version" : "1"
        }
      }

Looking good. We can see the newly added geoip and agent fields are very detailed and very easy to read.

Elasticsearch Logs

We explored many log types, but let’s not forget, Elasticsearch generates logs too which helps us troubleshooting issues, like for example, figuring out why a node hasn’t started. Let’s look at a sample: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_logs/elasticsearch.log.

Original link here

Now, this is slightly different from what we’ve worked with up until now. In all the other logs, each line represented one specific log entry (or message). That meant we could process them line by line and reasonably expect that each logged event is contained within a single line, in its entirety. 

Here, however, we sometimes encounter multi-line log entries. This means that a logged event can span across multiple lines, not just one. Fortunately, though, Elasticsearch clearly signals where a logged event begins and where it ends. It does so by using opening [ and closing ] square brackets. If you see that a line opens a square bracket [ but doesn’t close it on the same line, you know that’s a multi-line log entry and it ends on the line that finally uses the closing square bracket ].

Logstash can easily process these logs by using the multiline input codec.

Let’s take a look at the Logstash config we’ll use here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_logs/es-logs-final.conf.

Original link here

In the codec => multiline section of our config, we define the pattern that instructs Logstash on how to identify multiline log entries. Here, we use a RegEx pattern, but of course, we can also use Grok patterns when we need to.

With negate set to true, a message that matches the pattern is not considered a match for the multiline filter. By default, this is set to false and when it is false, a message that matches the pattern is considered a match for multiline.

what” can be assigned a value of “previous” or “next“. For example, if we have a match, negate is set to false, and what has a value set to previous, this means that the current matched line belongs to the same event as the previous line.

In a nutshell, what we are doing for our scenario here is telling Logstash that if a line does not start with an opening square bracket [ then the line in the log file is a continuation of the previous line, so these will be grouped in a single event. Logstash will apply a “multiline” tag to such entries, which can be useful for debugging, or other similar purposes if we ever need to know which entry was contained in a single line, and which on multiple lines.

In the filter section we use a typical Grok pattern, just like we did many times before, and replace the message field with the parsed content.

Finally, a second Grok pattern will process the content in the message field even further, extracting things like the logged node name, index name, and so on.

Let’s run Logstash and see all of this in action:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/elasticsearch_logs/es-logs-final.conf

After the program does its job, we press CTRL+C to exit.

Logstash has now parsed both single-line events and multiline events. We will now see how useful it can be that multiline events have been tagged appropriately. Because of this tag, we can now search entries that contain only single-line events. We do this by specifying in our cURL request that the matches must_not contain the tags called multiline.

curl -XGET "https://localhost:9200/es-test-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "tags": "multiline"
          }
        }
      ]
    }
  }
}'

The output will look something like this:

      {
        "_index" : "es-test-logs",
        "_type" : "_doc",
        "_id" : "9voa2nIBB7MjzkVP7ULy",
        "_score" : 0.0,
        "_source" : {
          "node" : "node-1",
          "source" : "o.e.x.m.MlDailyMaintenanceService",
          "host" : "coralogix",
          "@timestamp" : "2020-06-22T03:38:16.842Z",
          "@version" : "1",
          "message" : "[node-1] triggering scheduled [ML] maintenance tasks",
          "timestamp" : "2020-06-15T01:30:00,000",
          "short_message" : "triggering scheduled [ML] maintenance tasks",
          "type" : "elasticsearch",
          "severity" : "INFO",
          "path" : "/etc/logstash/conf.d/logstash/elasticsearch_logs/elasticsearch.log"
        }

Now let’s filter only the multiline entries:

curl -XGET "https://localhost:9200/es-test-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "tags": "multiline"
          }
        }
      ]
    }
  }
}'

Output should look similar to this:

      {
        "_index" : "es-test-logs",
        "_type" : "_doc",
        "_id" : "Kfoa2nIBB7MjzkVP7UPy",
        "_score" : 0.046520013,
        "_source" : {
          "node" : "node-1",
          "source" : "r.suppressed",
          "host" : "coralogix",
          "@timestamp" : "2020-06-22T03:38:16.968Z",
          "@version" : "1",
          "message" : "[node-1] path: /.kibana/_count, params: {index=.kibana}norg.elasticsearch.action.search.SearchPhaseExecutionException: all shards failedntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]ntat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]ntat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]ntat java.lang.Thread.run(Thread.java:832) [?:?]",
          "timestamp" : "2020-06-15T17:13:35,457",
          "short_message" : "path: /.kibana/_count, params: {index=.kibana}norg.elasticsearch.action.search.SearchPhaseExecutionException: all shards failedntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.0.jar:7.7.0]ntat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]ntat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]ntat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]ntat java.lang.Thread.run(Thread.java:832) [?:?]",
          "type" : "elasticsearch",
          "severity" : "WARN",
          "tags" : [
            "multiline"
          ],
          "path" : "/etc/logstash/conf.d/logstash/elasticsearch_logs/elasticsearch.log"
        }

Elasticsearch Slow Logs

Elasticsearch can also generate another type of logs, called slow logs and are used to optimize Elasticsearch search and indexing operations. These are easier to process since they don’t contain multiline messages.

Let’s take a look at a slow log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_slowlogs/es_slowlog.log.

Original image link 

As we did in previous sections, let’s copy the first line and paste it into the first (input) field of the https://grokdebug.herokuapp.com/ website.

[2018-03-13T00:01:09,810][TRACE][index.search.slowlog.query] [node23] [inv_06][1] took[291.9micros], took_millis[0], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[105], source[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}],

Original image link

Now let’s take a look at the Logstash config we’ll use: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_slowlogs/es-slowlog-final.conf.

Original image link

Let’s copy the Grok pattern within this config and paste it to the second (pattern) field of the https://grokdebug.herokuapp.com/ website.

%{TIMESTAMP_ISO8601:timestamp}][%{LOGLEVEL:level}][%{HOSTNAME:type}]%{SPACE}[%{HOSTNAME:[node_name]}]%{SPACE}[%{WORD:[index_name]}]%{NOTSPACE}%{SPACE}took[%{NUMBER:took_micro}%{NOTSPACE}]%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}search_type[%{WORD:search_type}]%{NOTSPACE}%{SPACE}total_shards[%{NUMBER:total_shards}]%{NOTSPACE}%{SPACE}source%{GREEDYDATA:query}Z

Original image link 

Now that we saw how this Grok pattern works, let’s run Logstash with our new config file.

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/elasticsearch_slowlogs/es-slowlog-final.conf

As usual, once the parsing is done, we press CTRL+C to exit the application.

Let’s see how the log file was parsed and added to the index:

curl -XGET "https://localhost:9200/es-slow-logs/_search?pretty" -H 'Content-Type: application/json' -d'{  "size": 1}'

The output will look something like this:

      {
        "_index" : "es-slow-logs",
        "_type" : "_doc",
        "_id" : "e-JzvHIBocjiYgvgqO4l",
        "_score" : 1.0,
        "_source" : {
          "total_shards" : "105",
          "message" : """[2018-03-13T00:01:09,810][TRACE][index.search.slowlog.query] [node23] [inv_06][1] took[291.9micros], took_millis[0], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[105], source[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}], """,
          "node_name" : "node23",
          "index_name" : "inv_06",
          "level" : "TRACE",
          "type" : "index.search.slowlog.query",
          "took_micro" : "291.9",
          "timestamp" : "2018-03-13T00:01:09,810",
          "query" : """[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}], """,
          "search_type" : "QUERY_THEN_FETCH"
        }
      }

MySQL Slow Logs

MySQL can also generate slow logs to help with optimization efforts. However, these will log events on multiple lines so we’ll need to use the multiline codec again.

Let’s look at a log file: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mysql_slowlogs/mysql-slow.log.

Original image link here

Now let’s look at the Logstash config file: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mysql_slowlogs/mysql-slowlogs.conf.

Original image link

In the multiline codec configuration, we use a Grok pattern. Simply put, we instruct Logstash that if the line doesn’t begin with the “# Time:” string, followed by a timestamp in the TIMESTAMP_ISO8601 format, then this line should be grouped together with previous lines in this event. This makes sense, since all logged events in this slow log begin with that specific timestamp, and then describe what has happened at that time, in the next few lines. Consequently, whenever a new timestamp appears, it signals the end of the current logged event and the beginning of the next.

Let’s run Logstash with this config:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/mysql_slowlogs/mysql-slowlogs.conf

As always, after the parsing is done, we press CTRL+C to exit the utility.

Let’s look at how the slow log was parsed:

curl -XGET "https://localhost:9200/mysql-slowlogs-01/_search?pretty" -H 'Content-Type: application/json' -d'{

  "size":1,
  "query": {
    "bool": {
    "must_not": [
      {
        "term": {
          "tags.keyword": "_grokparsefailure"
        }
      }
    ]
  }
  }

}'

The output should look like this:

      {
        "_index" : "mysql-slowlogs-01",
        "_type" : "_doc",
        "_id" : "Zfo42nIBB7MjzkVPGUfK",
        "_score" : 0.0,
        "_source" : {
          "tags" : [
            "multiline"
          ],
          "host" : "localhost",
          "user" : "root",
          "lock_time" : "0.000000",
          "timestamp" : "2020-06-03T06:04:09.582225Z",
          "read_timestamp" : "2020-06-22T04:10:08.892Z",
          "message" : " Time: 2020-06-03T06:04:09.582225Z  User@Host: root[root] @ localhost []  Id:     4  Query_time: 3.000192  Lock_time: 0.000000 Rows_sent: 1  Rows_examined: 0 SET timestamp=1591164249; SELECT SLEEP(3);",
          "query_time" : "3.000192",
          "rows_examined" : "0",
          "path" : "/etc/logstash/conf.d/logstash/mysql_slowlogs/mysql-slow.log",
          "sql_id" : "4",
          "@version" : "1",
          "rows_sent" : "1",
          "@timestamp" : "2020-06-22T04:10:08.892Z",
          "command" : "SELECT SLEEP(3)"
        }
      }

AWS ELB

AWS Elastic Load Balancer is a popular service that intelligently distributes traffic across a number of instances. ELB provides access logs that capture detailed information about requests sent to your load balancer. Each ELB log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses.

Let’s look at an example of such a log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_elb/elb_logs.log

Original image link here

Once again, let’s copy the first line of this log and paste it into the first (input) field of the https://grokdebug.herokuapp.com/ website.

2020-06-14T17:26:04.805368Z my-clb-1 170.01.01.02:39492 172.31.25.183:5000 0.000032 0.001861 0.000017 200 200 0 13 "GET https://my-clb-1-1798137604.us-east-2.elb.amazonaws.com:80/ HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36" - -

Original image link here

The Logstash config we’ll use is this one: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_elb/aws-elb.conf.

Original image link here 

From this config, we can copy the Grok pattern and paste it into the second (pattern) field of the https://grokdebug.herokuapp.com/ website.

%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:loadbalancer} %{IP:client_ip}:%{NUMBER:client_port} (?:%{IP:backend_ip}:%{NUMBER:backend_port}|-) %{NUMBER:request_processing_time} %{NUMBER:backend_processing_time} %{NUMBER:response_processing_time} (?:%{NUMBER:elb_status_code}|-) (?:%{NUMBER:backend_status_code}|-) %{NUMBER:received_bytes} %{NUMBER:sent_bytes} "(?:%{WORD:verb}|-) (?:%{GREEDYDATA:request}|-) (?:HTTP/%{NUMBER:httpversion}|-( )?)" "%{DATA:userAgent}"( %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol})?

Original image link here

Let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_elb/aws-elb.conf

We press CTRL+C once it finishes its job and then take a look at the index to see how the log has been parsed:

curl -XGET "https://localhost:9200/aws-elb-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 1,
  "query": {
    "bool": {
      "must_not": [
        {
        "term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

The output should look similar to this:

      {
        "_index" : "aws-elb-logs",
        "_type" : "_doc",
        "_id" : "avpQ2nIBB7MjzkVPIEc-",
        "_score" : 0.0,
        "_source" : {
          "request_processing_time" : "0.000032",
          "timestamp" : "2020-06-14T17:26:05.145274Z",
          "sent_bytes" : "232",
          "@version" : "1",
          "userAgent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
          "elb_status_code" : "404",
          "ssl_protocol" : "-",
          "path" : "/etc/logstash/conf.d/logstash/aws_elb/elb_logs.log",
          "response_processing_time" : "0.000016",
          "backend_processing_time" : "0.002003",
          "client_port" : "39492",
          "verb" : "GET",
          "received_bytes" : "0",
          "backend_ip" : "172.31.25.183",
          "backend_status_code" : "404",
          "client_ip" : "170.01.01.02",
          "backend_port" : "5000",
          "host" : "coralogix",
          "loadbalancer" : "my-clb-1",
          "request" : "https://my-clb-1-1798137604.us-east-2.elb.amazonaws.com:80/favicon.ico",
          "ssl_cipher" : "-",
          "httpversion" : "1.1",
          "@timestamp" : "2020-06-22T04:36:23.160Z"
        }
      }

AWS ALB

Amazon also offers an Application Load Balancer that generates its own logs. These are very similar to the ELB logs and we can see an example here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_alb/alb_logs.log.

Original image link here

The config file we will use can be seen here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_alb/aws-alb.conf.

Original image link here

If you want to test things out in the https://grokdebug.herokuapp.com/ website, the input line you can copy and paste into the first field is the following:

h2 2015-11-07T18:45:33.575333Z elb1 195.142.179.105:55857 10.0.2.143:80 0.000025 0.0003 0.000023 200 200 0 3764 "GET https://example.com:80/favicons/favicon-160x160.png HTTP/1.1" "Mozilla/5.0 (Linux; Android 4.4.2; GT-N7100 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36" - - arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354"

And the Grok pattern is:

%{NOTSPACE:request_type} %{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:alb-name} %{NOTSPACE:client}:%{NUMBER:client_port} (?:%{IP:backend_ip}:%{NUMBER:backend_port}|-) %{NUMBER:request_processing_time} %{NUMBER:backend_processing_time} %{NOTSPACE:response_processing_time:float} %{NOTSPACE:elb_status_code} %{NOTSPACE:target_status_code} %{NOTSPACE:received_bytes:float} %{NOTSPACE:sent_bytes:float} %{QUOTEDSTRING:request} %{QUOTEDSTRING:user_agent} %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol} %{NOTSPACE:target_group_arn} %{QUOTEDSTRING:trace_id}

Original image link here

Once again, let’s run Logstash with the new config:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_alb/aws-alb.conf

We’ll press CTRL+C, once it’s done, and then take a look at how the log has been parsed and imported to the index:

curl -XGET "https://localhost:9200/aws-alb-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 1,
  "query": {
    "bool": {
      "must_not": [
        {"term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

The output should look something like this:

      {
        "_index" : "aws-alb-logs",
        "_type" : "_doc",
        "_id" : "dvpZ2nIBB7MjzkVPF0ex",
        "_score" : 0.0,
        "_source" : {
          "client" : "78.164.152.56",
          "path" : "/etc/logstash/conf.d/logstash/aws_alb/alb_logs.log",
          "client_port" : "60693",
          "ssl_protocol" : "-",
          "target_group_arn" : "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067",
          "backend_port" : "80",
          "trace_id" : ""Root=1-58337262-36d228ad5d99923122bbe354"",
          "backend_processing_time" : "0.001005",
          "response_processing_time" : 2.6E-5,
          "@timestamp" : "2020-06-22T04:46:09.813Z",
          "@version" : "1",
          "request_processing_time" : "0.000026",
          "received_bytes" : 0.0,
          "sent_bytes" : 33735.0,
          "alb-name" : "elb1",
          "log_timestamp" : "2015-11-07T18:45:33.578479Z",
          "request_type" : "h2",
          "user_agent" : ""Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36"",
          "request" : ""GET https://example.com:80/images/logo/devices.png HTTP/1.1"",
          "elb_status_code" : "200",
          "ssl_cipher" : "-",
          "host" : "coralogix",
          "backend_ip" : "10.0.0.215",
          "target_status_code" : "200"
        }
      }

AWS CloudFront

Amazon’s CloudFront content delivery network generates useful logs to help ensure availability, performance, and in security audits. 

Here is a sample log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_cloudfront/cloudfront_logs.log.

Original image link here

The Logstash config file can be viewed here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_cloudfront/aws-cloudfront.conf.

Original image link

Once again, If you want to test how things work, in the https://grokdebug.herokuapp.com/ website, the input line you can copy and paste into the first field is this one:

2020-06-16	11:00:04	MAA50-C2	7486	2409:4073:20a:8398:c85d:cc75:6c7a:be8b	GET	dej1k5scircsp.cloudfront.net	/css/style/style.css	200	https://dej1k5scircsp.cloudfront.net/	Mozilla/5.0%20(X11;%20Linux%20x86_64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/77.0.3865.75%20Safari/537.36	-	-	Miss	P9QcGJ-je6GoPCt-1KqOIgAHr6j05In8FFJK4E8DbZKHFyjp-dDfKw==	dej1k5scircsp.cloudfront.net	http	376	0.102	-	-	-	Miss	HTTP/1.1	-	-	38404	0.102	Miss	text/css	7057	-	-

And the Grok pattern is:

%{DATE:date}[ t]%{TIME:time}[ t]%{DATA:x_edge_location}[ t](?:%{NUMBER:sc_bytes}|-)[ t]%{IP:c_ip}[ t]%{WORD:cs_method}[ t]%{HOSTNAME:cs_host}[ t]%{NOTSPACE:cs_uri_stem}[ t]%{NUMBER:sc_status}[ t]%{GREEDYDATA:referrer}[ t]%{NOTSPACE:user_agent}[ t]%{GREEDYDATA:cs_uri_query}[ t]%{NOTSPACE:cookie}[ t]%{WORD:x_edge_result_type}[ t]%{NOTSPACE:x_edge_request_id}[ t]%{HOSTNAME:x_host_header}[ t]%{URIPROTO:cs_protocol}[ t]%{INT:cs_bytes}[ t]%{NUMBER:time_taken}[ t]%{NOTSPACE:x_forwarded_for}[ t]%{NOTSPACE:ssl_protocol}[ t]%{NOTSPACE:ssl_cipher}[ t]%{NOTSPACE:x_edge_response_result_type}[ t]%{NOTSPACE:cs_protocol_version}[ t]%{NOTSPACE:fle_status}[ t]%{NOTSPACE:fle_encrypted_fields}[ t]%{NOTSPACE:c_port}[ t]%{NOTSPACE:time_to_first_byte}[ t]%{NOTSPACE:x_edge_detailed_result_type}[ t]%{NOTSPACE:sc_content_type}[ t]%{NOTSPACE:sc_content_len}[ t]%{NOTSPACE:sc_range_start}[ t]%{NOTSPACE:sc_range_end}

Original image link here

Now let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_cloudfront/aws-cloudfront.conf

As always, we press CTRL+C once it finishes its job.

Once again, let’s take a look at how the log has been parsed and inserted into the index:

curl -XGET "https://localhost:9200/aws-cloudfront-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must_not": [
        {"term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

Part of the output should be similar to the following:

{
        "_index" : "aws-cloudfront-logs",
        "_type" : "_doc",
        "_id" : "Da1s4nIBnKKcJetIb-p9",
        "_score" : 0.0,
        "_source" : {
          "time_to_first_byte" : "0.000",
          "cs_uri_stem" : "/favicon.ico",
          "x_edge_request_id" : "vhpLn3lotn2w4xMOxQg77DfFpeEtvX49mKzz5h7iwNXguHQpxD6QPQ==",
          "sc_bytes" : "910",
          "@version" : "1",
          "cs_host" : "dej1k5scircsp.cloudfront.net",
          "c_ip" : "2409:4073:20a:8398:c85d:cc75:6c7a:be8b",
          "user_agent" : "Mozilla/5.0%20(X11;%20Linux%20x86_64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/77.0.3865.75%20Safari/537.36",
          "sc_range_start" : "-",
          "c_port" : "57406",
          "x_edge_result_type" : "Error",
          "referrer" : "https://dej1k5scircsp.cloudfront.net/",
          "x_edge_location" : "MAA50-C2",
          "path" : "/etc/logstash/conf.d/logstash/aws_cloudfront/cloudfront_logs.log",
          "cs_protocol" : "http",
          "time_taken" : "0.001",
          "x_forwarded_for" : "-",
          "time" : "10:58:07",
          "cookie" : "-",
          "sc_status" : "502",
          "date" : "20-06-16",
          "sc_range_end" : "-",
          "x_edge_detailed_result_type" : "Error",
          "ssl_cipher" : "-",
          "cs_method" : "GET",
          "x_host_header" : "dej1k5scircsp.cloudfront.net",
          "sc_content_len" : "507",
          "ssl_protocol" : "-",
          "fle_status" : "-",
          "@timestamp" : "2020-06-23T18:24:15.784Z",
          "fle_encrypted_fields" : "-",
          "cs_bytes" : "389",
          "x_edge_response_result_type" : "Error",
          "host" : "coralogix",
          "cs_uri_query" : "-",
          "sc_content_type" : "text/html",
          "cs_protocol_version" : "HTTP/1.1"
        }
      }

Cleaning Up

Before continuing with the next lesson, let’s clean up the resources we created here.

First, we’ll delete the directory where we stored our sample log files and Logstash configurations:

sudo rm -r /etc/logstash/conf.d/logstash/

Next, let’s delete all the new indices we created:

curl -XDELETE localhost:9200/nginx-access-logs-02

curl -XDELETE localhost:9200/iis-log

curl -XDELETE localhost:9200/mongo-logs-01

curl -XDELETE localhost:9200/apache-logs

curl -XDELETE localhost:9200/es-test-logs

curl -XDELETE localhost:9200/es-slow-logs

curl -XDELETE localhost:9200/mysql-slowlogs-01

curl -XDELETE localhost:9200/aws-elb-logs

curl -XDELETE localhost:9200/aws-alb-logs

curl -XDELETE localhost:9200/aws-cloudfront-logs

Conclusion

I hope this arsenal of Grok patterns for common log types are useful for most of your future Logstash needs. Keep in mind that if the log you encounter is just slightly different, only slight changes need to be made to these patterns, which you can use as your starting templates.

A Practical Guide to Logstash: Syslog Deep Dive

Syslog is a popular standard for centralizing and formatting log data generated by network devices. It provides a standardized way of generating and collecting log information, such as program errors, notices, warnings, status messages, and so on. Almost all Unix-like operating systems, such as those based on Linux or BSD kernels, use a Syslog daemon that is responsible for collecting log information and storing it. 

They’re usually stored locally, but they can also be streamed to a central server if the administrator wants to be able to access all logs from a single location. By default, port 514 and UDP are used for the transmission of Syslogs. 

Note: It’s recommended to avoid UDP whenever possible, as it doesn’t guarantee that all logs will be sent and received; when the network is unreliable or congested, some messages could get lost in transit.

For more security and reliability, port 6514 is often used with TCP connections and TLS encryption.

In this post, we’ll learn how to collect Syslog messages from our servers and devices with Logstash and send it to Elasticsearch. This will allow us to take advantage of its super-awesome powers of ingesting large volumes of data and then allowing us to quickly and efficiently search for what we need.

We’ll explore two methods. One involves using the Syslog daemon to send logs through a TCP connection to a central server running Logstash. The other method uses Logstash to monitor log files on each server/device and automatically index messages to Elasticsearch.

Getting Started

Let’s take a look at how typical syslog events look like. These are usually collected locally in a file named /var/log/syslog.

To display the first 10 lines, we’ll type:

sudo head -10 /var/log/syslog

Original image link

Let’s analyze how a syslog line is structured.

Original image link

We can see the line starts with a timestamp, including the month name, day of month, hour, minute and second at which the event was recorded. The next entry is the hostname of the device generating the log. Next is the name of the process that created the log entry, its process ID number, and, finally, the log message itself.

Logs are very useful when we want to monitor the health of our systems or debug errors. But when we have to deal with tens, hundreds, or even thousands of such systems, it’s obviously too complicated to log into each machine and manually look at syslogs. By centralizing all of them into Elasticsearch, it makes it easier to get a birds-eye view over all of the logged events, filter only what we need and quickly spot when a system is misbehaving.

Collecting syslog Data with Logstash

In this post, we’ll explore two methods with which we can get our data into Logstash logs, and ultimately into an Elasticsearch index:

  1. Using the syslog service itself to forward logs to Logstash, via TCP connections.
  2. Configuring Logstash to monitor log files and collect their contents as soon as they appear within those files.

Forwarding Syslog Messages to Logstash via TCP Connections

The syslog daemon has the ability to send all the log events it captures to another device, through a TCP connection. Logstash, on the other hand, has the ability to open up a TCP port and listen for incoming connections, looking for syslog data. Sounds like a perfect match! Let’s see how to make them work together.

For simplicity, we will obviously use the same virtual machine to send the logs and also collect them. But in a real-world scenario, we would configure a separate server with Logstash to listen for incoming connections on a TCP port. Then, we would configure the syslog daemons on all of the other servers to send their logs to the Logstash instance.

Important: In this exercise, we’re configuring the syslog daemon first, and Logstash last, since we want the first captured logged events to be the ones we intentionally generate. But in a real scenario, configure Logstash listening on the TCP port first. This is to ensure that when you later configure the syslog daemons to send their messages, Logstash is ready to ingest them. If Logstash isn’t ready, the log entries sent while you configure it, won’t make it into Elasticsearch.

We will forward our syslogs to TCP port 10514 of the virtual machine. Logstash will listen to port 10514 and collect all messages.

Let’s edit the configuration file of the syslog daemon.

sudo nano /etc/rsyslog.d/50-default.conf

Above the line “#First some standard log files. Log by facility” we’ll add the following:

*.*                         @@127.0.0.1:10514

Original image link here

*.* indicates to forward all messages. @@  instructs the rsyslog utility to transmit data through TCP connections.

To save the config file, we press CTRL+X, after which we type Y and finally press ENTER.

We’ll need to restart the syslog daemon (called “rsyslogd”) so that it picks up on our desired changes.

sudo systemctl restart rsyslog.service

If you don’t have a git tool available on your test system, you can install it with:

sudo apt update && sudo apt install git

Now let’s clone the repo which contains the configuration files we’ll use with Logstash.

sudo git clone https://github.com/coralogix-resources/logstash-syslog.git /etc/logstash/conf.d/logstash-syslog

Let’s take a look at the log entries generated by the “systemd” processes.

sudo grep "systemd" /var/log/syslog

Original image link here

We’ll copy one of these lines and paste it to the https://grokdebug.herokuapp.com/ website, in the first field, the input section.

Original image link here

Now, in a new web browser tab, let’s take a look at the following Logstash configuration: https://raw.githubusercontent.com/coralogix-resources/logstash-syslog/master/syslog-tcp-forward.conf.

Original image link

We can see in the highlighted “input” section how we instruct Logstash to listen for incoming connections on TCP port 10514 and look for syslog data.

To test how the Grok pattern we use in this config file matches our syslog lines, let’s copy it

%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}

and then paste it to the https://grokdebug.herokuapp.com/ website, in the second field, the pattern section.

Original image link

We can see every field is perfectly extracted.

Now, let’s run Logstash with this configuration file.

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-syslog/syslog-tcp-forward.conf

Since logs are continuously generated and collected, we won’t stop Logstash this time with CTRL+C. We’ll just leave it running until we see this:

Original image link

Specifically, we’re looking for the “Successfully started Logstash” message.

Let’s leave Logstash running in the background, collecting data. Leave its terminal window open (so you can see it catching syslog events) and open up a second terminal window to enter the next commands.

It’s very likely that at this point no syslog events have been collected yet, since we just started Logstash. Let’s make sure to generate some log entries first. A simple command such as

sudo ls

will ensure we’ll generate a few log messages. We’ll be able to see in the window where Logstash is running that sudo generated some log entries and these have been added to the Elasticsearch index.

Let’s take a look at an indexed log entry.

curl -XGET "https://localhost:9200/syslog-received-on-tcp/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 1}'

The output we’ll get will contain something similar to this:

        {
        "_index" : "syslog-received-on-tcp",
        "_type" : "_doc",
        "_id" : "fWJ7QXMB9gZX17ukIc6D",
        "_score" : 1.0,
        "_source" : {
          "received_at" : "2020-07-12T05:24:14.990Z",
          "syslog_message" : " student : TTY=pts/1 ; PWD=/home/student ; USER=root ; COMMAND=/bin/ls",
          "syslog_timestamp" : "2020-07-12T05:24:14.000Z",
          "message" : "<85>Jul 12 08:24:14 coralogix sudo:  student : TTY=pts/1 ; PWD=/home/student ; USER=root ; COMMAND=/bin/ls",
          "syslog_hostname" : "coralogix",
          "port" : 51432,
          "type" : "syslog",
          "@timestamp" : "2020-07-12T05:24:14.990Z",
          "host" : "localhost",
          "@version" : "1",
          "received_from" : "localhost",
          "syslog_program" : "sudo"
        }

Awesome! Everything worked perfectly. Now let’s test out the other scenario.

Monitoring syslog Files with Logstash

We’ll first need to stop the Logstash process we launched in the previous section. Switch to the terminal where it is running and press CTRL+C to stop it.

Let’s open up this link in a browser and take a look at the Logstash config we’ll use this time: https://raw.githubusercontent.com/coralogix-resources/logstash-syslog/master/logstash-monitoring-syslog.conf.

Original image link

We can see that the important part here is that we tell it to monitor the “/var/log/syslog” file.

Let’s run Logstash with this config.

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash-syslog/logstash-monitoring-syslog.conf

As usual, we’ll wait until it finishes its job and then press CTRL+C to exit the process.

Let’s see the data that has been parsed.

curl -XGET "https://localhost:9200/syslog-monitor/_search?pretty" -H 'Content-Type: application/json' -d'{"size": 1}'

We will get an output similar to this:

        {
        "_index" : "syslog-monitor",
        "_type" : "_doc",
        "_id" : "kmKYQXMB9gZX17ukC878",
        "_score" : 1.0,
        "_source" : {
          "type" : "syslog",
          "@version" : "1",
          "syslog_message" : " [origin software="rsyslogd" swVersion="8.32.0" x-pid="448" x-info="https://www.rsyslog.com"] rsyslogd was HUPed",
          "syslog_hostname" : "coralogix",
          "message" : "Jul 12 05:52:46 coralogix rsyslogd:  [origin software="rsyslogd" swVersion="8.32.0" x-pid="448" x-info="https://www.rsyslog.com"] rsyslogd was HUPed",
          "received_at" : "2020-07-12T05:55:49.644Z",
          "received_from" : "coralogix",
          "host" : "coralogix",
          "syslog_program" : "rsyslogd",
          "syslog_timestamp" : "2020-07-12T02:52:46.000Z",
          "path" : "/var/log/syslog",
          "@timestamp" : "2020-07-12T05:55:49.644Z"
        }

Clean-Up Steps

To clean up what we created in this exercise, we just need to delete the two new indexes that we added

curl -XDELETE "https://localhost:9200/syslog-received-on-tcp/"

curl -XDELETE "https://localhost:9200/syslog-monitor/"

and also delete the directory where we placed our Logstash config files.

sudo rm -r /etc/logstash/conf.d/logstash-syslog

Conclusion

As you can see, it’s fairly easy to gather all of your logs in a single location, and the advantages are invaluable. For example, besides making everything more accessible and easier to search, think about servers failing. It happens a little bit more often than we like. If logs are kept on the server, once it fails, you lose the logs. Or, another common scenario, is that hackers delete logs once they compromise a machine. By collecting everything into Elasticsearch, though, you’ll have the original logs, untouched and ready to review to see what happened before the machine experienced problems.

Running ELK on Kubernetes with ECK – Part 3

This is last installment of our 3-part series on running ELK on Kubernetes with ECK. If you’re just getting started, make sure to check out Part 1 and Part 2.

With that, let’s jump right in.

Using Persistent Volumes

When dealing with a Kubernetes cluster, containers can appear and disappear at any time. As a container gets removed, the data contained within it is lost too. That’s no problem for stateless apps, but Elasticsearch is a stateful one, and needs to preserve some of its data. Let’s learn how to do this with Persistent Volumes, which we’ll sometimes call PVs, for short, throughout this post.

First, we’ll delete the last Elasticsearch node we created in the previous post:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/04_single_node_es_plugin_install.yaml

Let’s clean it up even more and delete Kibana:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/02_kibana.yaml

And now we’ll remove the Dashboard:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/03_k8s_dashboard-not-safe-for-production.yaml

Next, let’s create a 5GB persistent volume we’ll call “es-data-holder“:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/05_persistent-volume.yaml

The output confirms that the PV has been created:

Link to original image

To display a list of persistent volumes available, we can use this next command:

kubectl get pv

In our case, the output should look like this:

Link to original image

Let’s analyze the YAML file we used to create this PV.

Link to original image

We can see the name we chose for the PV, in the metadata section. Under capacity, we specified 5Gi for the storage attribute. While other tools may use MB or GB to represent megabytes and gigabytes respectively, Kubernetes uses Mi and Gi to represent so-called mebibytes and gibibytes. As an example, a kilobyte is made out of 1000 bytes, while a kibibyte represents 2 to the power of 10 (2^10) bytes, which equals 1024. Similarly, a mebibyte is 2^20 bytes, a gibibyte is 2^30 and so on.

We set accessModes to ReadWriteOnce, which means that the volume may be mounted in read-write mode by only one node.

hostPath instructs Kubernetes to use the local directory specified by path, for this PV. 

Note that we’re using this here since it’s a convenient way to quickly start testing, without spending hours to setup network shared filesystems, or similar solutions. However, in production environments, you should never use a local directory for a PV. Instead, use filesystems that are available to all nodes, such as NFS shares, AWS storage, GlusterFS and so on.

Although our persistent volume has 5 gibibytes of total space available, we don’t have to allocate all of it to a single pod. Just like we can partition a disk to be used by multiple operating systems, so can we allocate different portions of the PV to different pods. In this case, we will claim 2 gibibytes for our pod.

Let’s look at the YAML file we will use for this purpose.

Link to original image

We can see a few new additions, compared to the YAML files we used previously for our Elasticsearch node. In the section named volumeClaimTemplates we request 2Gi of space from one of the persistent volumes existent in Kubernetes. The application will decide which PV will serve the request.

Let’s apply the settings in this YAML file:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/06_es-with-persistent-volume-enabled.yaml

As usual, we should wait for the Pod to be created. We can check the status with the same command we used before:

kubectl get pods

We will continue only when the quickstart-es-default-0 pod displays Running under its STATUS column and it’s also READY 1/1.

Now let’s see, did our Pod successfully claim the persistent storage it needed? We can check, with the next command:

kubectl get pvc

Note that “pvc” here stands for Persistent Volume Claim.

The output will show that the claim was successful. We can tell, because we see Bound under the STATUS column:

Link to original image

Deploying a Multi-Node Elasticsearch Cluster

For simplicity, up until now we’ve only played around with a single Elasticsearch node. But, under normal circumstances, Elasticsearch forms a cluster out of multiple nodes to achieve its performance and resiliency. Let’s see how we can create such a cluster.

This time, we will provision two persistent volumes, named “es-data-holder-01” and “es-data-holder-02“.

Afterwards, we will create an Elasticsearch cluster composed of two nodes. Each node will have one of the PVs allocated to it. Normally, Kubernetes will allocate whichever PV it decides is convenient. In most scenarios, multiple pods may use the same persistent volume. However, since we’ll use the ReadWriteOnce option, only one pod may use a PV, hence, one PV will be allocated to one pod and the other will be allocated to the other pod.

Ok, now let’s deploy our multi-node cluster. First, let’s delete the single node Elasticsearch setup we created earlier:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/06_es-with-persistent-volume-enabled.yaml

Let’s also delete the persistent volume we created:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/05_persistent-volume.yaml

Now, we’ll create two new PVs.

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/07-pv-for-multi-nodes.yaml

The output will confirm their creation:

Link to original image

Let’s check the available PVs and confirm everything looks as it should:

kubectl get pv

Link to original image

Now let’s take a look at the YAML config we used, where we’ll learn a new trick.

Since we wanted two separate PVs, we should have used two different YAML files. But in this case, we only used one because it’s more convenient. 

How did we do this? Notice the three minus signs in the middle of this config. This let basically allows us to logically separate two different YAML specifications in a single file; pretty simple and effective!

At this point, we can instruct Kubernetes about the multi-node Elasticsearch cluster we want to create.

Again, let’s analyze the contents in the YAML file that we’ll use.

Link to original image

We now have two nodeSets, one with master-nodes and another with data-nodes. We can see that the Elasticsearch master nodes will also serve as data nodes since node.data is also set to true

For our exercise, we see that the count of master nodes is one. We used the same count for data nodes as well, but in a real-world scenario, we can easily scale up an Elasticsearch cluster by simply setting the count number higher, for whichever nodeset that we want to have more nodes on. 

Finally, notice that we configured it so that each pod in the nodeSet will claim one gibibyte of storage space on the persistent volumes.

We’re ready to apply this YAML specification:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/08-multinode-es.yaml

Once again, we’ll check the status of the pods until we notice that quickstart-es-data-nodes-0 and quickstart-es-master-nodes-0 are both Running and READY 1/1:

kubectl get pods

Link to original image

Let’s see how the Persistent Volume Claims look this time:

kubectl get pvc

Link to original image

Now we want to make some cURL requests to Elasticsearch, so we’ll need access to the same type of password we retrieved in previous posts:

PASSWORD=$(kubectl get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')

Let’s send a request to the _cat API to list information about the nodes available in our Elasticsearch cluster:

curl -u elastic:$PASSWORD -k https://localhost:31920/_cat/nodes?v

We see an output similar to this:

Link to original image

The node.role column is useful here. The letter d indicates that this is a data node. The letter i indicates it is also an ingest node. m means it is eligible for the master role. We can also see an indication of master nodes in the master column. An asterisk, * denotes that this is the currently elected master node and signals the entry is not elected as master.

Elasticsearch Hot-Warm Architecture 

In a typical scenario, some data in our Elasticsearch cluster may end up being searched more often while there could be data that’s rarely accessed. In this case, it makes sense to implement the so-called hot-warm architecture

In such a setup, our often-searched data would end up on hot nodes, while less-frequently searched data would end up on warm nodes. This way, we can do useful things, such as use faster servers for our hot nodes, so that search results are returned quicker for a better user experience. It also helps cut down on cloud-related costs. 

You can read more about the hot-warm-cold architecture in this great blog post.
Implementing this in Kubernetes is quite easy, since it does all the heavy lifting for us. We just need to assign the right values to the node.attr settings in our YAML specifications.

Link to original image

We already used the required node attributes in our previously applied YAML, so the foundation for our hot-warm architecture is already set up at this point. 

Our master nodes have the node.attr.temp set to hot and our data nodes have the node.attr.temp set to warm. We implemented this in advance, to avoid having to repeat the steps to delete the Elasticsearch nodes and recreate them, as this can be time consuming on some systems.

Let’s index some data on the nodes and test this hot-warm architecture.

First, we’ll create two indices, named “logs-01” and “logs-02“. We’ll assume “logs-02” will contain fresh new data that is often searched while “logs-01” will contain rarely searched for data.

Ok, let’s jump right in! When we create the first index, we set “index.routing.allocation.require.temp” to “warm”, to ensure that it will be assigned to a warm node:

curl -u elastic:$PASSWORD -XPUT -k "https://localhost:31920/logs-01/" -H 'Content-Type: application/json' -d'{
  "settings":{
"index.routing.allocation.require.temp": "warm"
}
}'

Link to original image

Creating the “logs-02” index is very similar, the difference being that we’ll set the routing allocation parameter to “hot“.

curl -u elastic:$PASSWORD -XPUT -k "https://localhost:31920/logs-02/" -H 'Content-Type: application/json' -d'{
  "settings":{
"index.routing.allocation.require.temp": "hot"
}
}'

Let’s go ahead and send a request to the _cat API and see if the shards from the “logs-01” index were placed on the right node:

curl -u elastic:$PASSWORD -XGET -k https://localhost:31920/_cat/shards/logs-01?v

Link to original image

We can see that the primary shard of the index ended up on quickstart-es-data-nodes-0. We designated that our data nodes should be warm, so everything went according to plan. We also notice that the replica is UNASSIGNED, but that’s normal here. Elasticsearch wants to place it on another warm node, but we have only one available in our setup. However, in a configuration with multiple warm nodes, this would need to be properly assigned.

Now let’s check the same thing for the “logs-02” index:

curl -u elastic:$PASSWORD -XGET -k https://localhost:31920/_cat/shards/logs-02?v

Link to original image

Great, we can see that the primary shard was properly assigned to the master node, which is configured as hot.

Upgrade Management

Scaling

The demand put on our Elasticsearch cluster will normally fluctuate. During certain periods, it could be hit by a higher than usual amount of requests and the amount of nodes at that time may be unable to respond quickly enough. 

To handle such spikes in requests, we could increase the number of nodes, otherwise known as “scaling-up”. 

Kubernetes can take care of putting these nodes on different physical servers, so that they can all work in parallel and respond to requests more efficiently.

Let’s say we’d want to increase the number of data nodes in our Elasticsearch cluster. All we would need to do in this scenario, is increase the count parameter in our YAML file, from 1, to a higher number.

Link to original image

Let’s test this out, by applying the following YAML configuration, which is identical to the previous “08-multinode-es.yaml” we used, except this time, we’re configuring it as “count: 2” for the data-nodes and master-nodes nodesets:=

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/09-multinode-es-with-2-data-nodes.yaml

Now, let’s check out the pods:

kubectl get pods

We will see two extra pods pop up, called quickstart-es-data-nodes-1 and quickstart-es-master-nodes-1

It’s worth mentioning that since we used ReadWriteOnce with our persistent volumes, these are exclusively claimed by the previous two pods. This means there aren’t any other PVs available, so the new pods won’t start and will remain in a “Pending” state. Once the administrator creates new persistent volumes, the pods can claim them and start running.

Also, in our case, the pods won’t be able to start since we don’t have enough CPU and memory resources available in our Kubernetes cluster. We can explore why a pod is stuck in a Pending state with a command such as:

kubectl describe pods quickstart-es-data-nodes-1

Version Upgrade

As with any software, new versions of Elasticsearch are released and made available to the public. These may include performance improvements, bug fixes, and security patches. It’s natural that we’d want to periodically upgrade to stay up-to-date. 

In our case, Kubernetes will use the Elasticsearch operator to take care of the necessary steps to upgrade software across nodes.

Up to this point, we’ve used Elasticsearch version 7.6.2. We can check the current version number with this request:

curl -u elastic:$PASSWORD -k https://localhost:31920

Link to original image

Let’s upgrade Elasticsearch to version 7.7.0 by simply applying a new YAML specification that contains the line “version: 7.7.0” under “spec:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/10-multinode-es-upgrade-to-7.7.0.yaml

Link to original image link

We’ll wait until the upgrade process completes. Then, we can check for the version number again:

curl -u elastic:$PASSWORD -k https://localhost:31920

If you still notice the old version, wait a few more minutes and try the command again.

Link to the original image

It’s important to note that in a “live” production cluster that’s receiving data and serving requests, you should take appropriate measures before doing an upgrade. Always ensure to first snapshot the data, create backups, stop indexing new data to the cluster, or index to two clusters simultaneously – whatever’s necessary to ensure a smooth process.

I hope you’ve found this series helpful to get started running ELK on Kubernetes.

Hands-on-Exercises: Mapping Exceptions with Elasticsearch

Mapping is an essential foundation of an index that can generally be considered the heart of Elasticsearch. So you can be sure of the importance of a well-managed mapping. But just as it is with many important things, sometimes mappings can go wrong. Let’s take a look at various issues that can arise with mappings and how to deal with them.

Before delving into the possible challenges with mappings, let’s quickly recap some key points about Mappings. A mapping essentially entails two parts:

  1. The Process: A process of defining how your JSON documents will be stored in an index
  2. The Result: The actual metadata structure resulting from the definition process

The Process

If we first consider the process aspect of the mapping definition, there are generally two ways this can happen:

  • An explicit mapping process where you define what fields and their types you want to store along with any additional parameters.
  • A dynamic mapping Elasticsearch automatically attempts to determine the appropropriate datatype and updates the mapping accordingly.

The Result

The result of the mapping process defines what we can “index” via individual fields and their datatypes, and also how the indexing happens via related parameters.

Consider this mapping example:

{
"mappings": {
"properties": {
"timestamp": { "type": "date" },
"service": { "type": "keyword" },
"host_ip": { "type": "ip" },
"port": { "type": "integer" },
"message": { "type": "text" }
}
}
}

It’s a very simple mapping example for a basic logs collection microservice. The individual logs consist of the following fields and their associated datatypes:

  • Timestamp of the log mapped as a date
  • Service name which created the log mapped as a keyword
  • IP of the host on which the log was produced mapped as an ip datatype
  • Port number mapped as an integer
  • The actual log Message mapped as text to enable full-text searching
  • More… As we have not disabled the default dynamic mapping process so we’ll be able to see how we can introduce new fields arbitrarily and they will be added to the mapping automatically.

The Challenges

So what could go wrong :)?

There are generally two potential issues that many will end up facing with Mappings:

  • If we create an explicit mapping and fields don’t match, we’ll get an exception if the mismatch falls beyond a certain “safety zone”. We’ll explain this in more detail later.
  • If we keep the default dynamic mapping and then introduce many more fields, we’re in for a “mapping explosion” which can take our entire cluster down.

Let’s continue with some interesting hands-on examples where we’ll simulate the issues and attempt to resolve them.

Hands-on Exercises

Field datatypes – mapper_parsing_exception

Let’s get back to the “safety zone” we mentioned before when there’s a mapping mismatch.

We’ll create our index and see it in action. We are using the exact same mapping that we saw earlier:

curl --request PUT 'https://localhost:9200/microservice-logs' 
--header 'Content-Type: application/json' 
--data-raw '{
   "mappings": {
       "properties": {
           "timestamp": { "type": "date"  },
           "service": { "type": "keyword" },
           "host_ip": { "type": "ip" },
           "port": { "type": "integer" },
           "message": { "type": "text" }
       }
   }
}'

A well-defined JSON log for our mapping would look something like this:

{"timestamp": "2020-04-11T12:34:56.789Z", "service": "ABC", "host_ip": "10.0.2.15", "port": 12345, "message": "Started!" }

But what if another service tries to log its port as a string and not a numeric value? (notice the double quotation marks). Let’s give that try:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "XYZ", "host_ip": "10.0.2.15", "port": "15000", "message": "Hello!" }'
>>>
{
...
 "result" : "created",
...
}

Great! It worked without throwing an exception. This is the “safety zone” I mentioned earlier.

But what if that service logged a string that has no relation to numeric values at all into the Port field, which we earlier defined as an Integer? Let’s see what happens:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "XYZ", "host_ip": "10.0.2.15", "port": "NONE", "message": "I am not well!" }'
>>>
{
 "error" : {
   "root_cause" : [
     {
       "type" : "mapper_parsing_exception",
       "reason" : "failed to parse field [port] of type [integer] in document with id 'J5Q2anEBPDqTc3yOdTqj'. Preview of field's value: 'NONE'"
     }
   ],
   "type" : "mapper_parsing_exception",
   "reason" : "failed to parse field [port] of type [integer] in document with id 'J5Q2anEBPDqTc3yOdTqj'. Preview of field's value: 'NONE'",
   "caused_by" : {
     "type" : "number_format_exception",
     "reason" : "For input string: "NONE""
   }
 },
 "status" : 400
}

We’re now entering the world of Elastisearch mapping exceptions! We received a code 400 and the mapper_parsing_exception that is informing us about our datatype issue. Specifically that it has failed to parse the provided value of “NONE” to the type integer.

So how we solve this kind of issue? Unfortunately, there isn’t a one-size-fits-all solution. In this specific case we can “partially” resolve the issue by defining an ignore_malformed mapping parameter.

Keep in mind that this parameter is non-dynamic so you either need to set it when creating your index or you need to: close the index → change the setting value → reopen the index. Something like this.

curl --request POST 'https://localhost:9200/microservice-logs/_close'
 
curl --location --request PUT 'https://localhost:9200/microservice-logs/_settings' 
--header 'Content-Type: application/json' 
--data-raw '{
   "index.mapping.ignore_malformed": true
}'
 
curl --request POST 'https://localhost:9200/microservice-logs/_open'

Now let’s try to index the same document:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "XYZ", "host_ip": "10.0.2.15", "port": "NONE", "message": "I am not well!" }'

Checking the document by its ID will show us that the port field was omitted for indexing. We can see it in the “ignored” section.

curl 'https://localhost:9200/microservice-logs/_doc/KZRKanEBPDqTc3yOdTpx?pretty'
{
...
 "_ignored" : [
   "port"
 ],
...
}

The reason this is only a “partial” solution is because this setting has its limits and they are quite considerable. Let’s reveal one in the next example.

A developer might decide that when a microservice receives some API request it should log the received JSON payload in the message field. We already mapped the message field as text and we still have the ignore_malformed parameter set. So what would happen? Let’s see:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "ABC", "host_ip": "10.0.2.15", "port": 12345, "message": {"data": {"received":"here"}}}'
>>>
{
...
       "type" : "mapper_parsing_exception",
       "reason" : "failed to parse field [message] of type [text] in document with id 'LJRbanEBPDqTc3yOjTog'. Preview of field's value: '{data={received=here}}'"
...
}

We see our old friend, the mapper_parsing_exception! This is because ignore_malformed can’t handle JSON objects on the input. Which is a significant limitation to be aware of.

Now, when speaking of JSON objects be aware that all the mapping ideas remains valid for their nested parts as well. Continuing our scenario, after losing some logs to mapping exceptions, we decide it’s time to introduce a new payload field of the type object where we can store the JSON at will.

Remember we have dynamic mapping in place so you can index it without first creating its mapping:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "ABC", "host_ip": "10.0.2.15", "port": 12345, "message": "Received...", "payload": {"data": {"received":"here"}}}'
>>>
{
...
 "result" : "created",
...
}

All good. Now we can check the mapping and focus on the payload field.

curl --request GET 'https://localhost:9200/microservice-logs/_mapping?pretty'
>>>
{
...
       "payload" : {
         "properties" : {
           "data" : {
             "properties" : {
               "received" : {
                 "type" : "text",
                 "fields" : {
                   "keyword" : {
                     "type" : "keyword",
                     "ignore_above" : 256
                   }
...
}

It was mapped as an object with (sub)properties defining the nested fields. So apparently the dynamic mapping works! But there is a trap. The payloads (or generally any JSON object) in the world of many producers and consumers can consist of almost anything. So you know what will happen with different JSON payload which also consists of a payload.data.received field but with a different type of data:

curl --request POST 'https://localhost:9200/microservice-logs/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw '{"timestamp": "2020-04-11T12:34:56.789Z", "service": "ABC", "host_ip": "10.0.2.15", "port": 12345, "message": "Received...", "payload": {"data": {"received": {"even": "more"}}}}'

…again we get the mapper_parsing_exception!

So what else can we do?

  • Engineers on the team need to be made aware of these mapping mechanics. You can also eastablish shared guidelines for the log fields.
  • Secondly you may consider what’s called a Dead Letter Queue pattern that would store the failed documents in a separate queue. This either needs to be handled on an application level or by employing Logstash DLQ which allows us to still process the failed documents.

Limits – illegal_argument_exception

Now the second area of caution in relation to mappings, are limits. Even from the super-simple examples with payloads you can see that the number of nested fields can start accumulating pretty quickly. Where does this road end? At the number 1000. Which is the default limit of the number of fields in a mapping.

Let’s simulate this exception in our safe playground environment before you’ll unwillingly meet it in your production environment.

We’ll start by creating a large dummy JSON document with 1001 fields, POST it and then see what happens.

To create the document, you can either use the example command below with jq tool (apt-get install jq if you don’t already have it) or create the JSON manually if you prefer:

thousandone_fields_json=$(echo {1..1001..1} | jq -Rn '( input | split(" ") ) as $nums | $nums[] | . as $key | [{key:($key|tostring),value:($key|tonumber)}] | from_entries' | jq -cs 'add')
 
echo "$thousandone_fields_json"
{"1":1,"2":2,"3":3, ... "1001":1001}

We can now create a new plain index:

curl --location --request PUT 'https://localhost:9200/big-objects'

And if we then POST our generated JSON, can you guess what’ll happen?

curl --request POST 'https://localhost:9200/big-objects/_doc?pretty' 
--header 'Content-Type: application/json' 
--data-raw "$thousandone_fields_json"
>>>
{
 "error" : {
   "root_cause" : [
     {
       "type" : "illegal_argument_exception",
       "reason" : "Limit of total fields [1000] in index [big-objects] has been exceeded"
     }
...
 "status" : 400
}

… straight to the illegal_argument_exception exception! This informs us about the limit being exceeded.

So how do we handle that? First, you should definitely think about what you are storing in your indices and for what purpose. Secondly, if you still need to, you can increase this 1,000 limit. But be careful as with bigger complexity might come a much bigger price of potential performance degradations and high memory pressure (see the docs for more info).

Changing this limit can be performed with a simple dynamic setting change:

curl --location --request PUT 'https://localhost:9200/big-objects/_settings' 
--header 'Content-Type: application/json' 
--data-raw '{
 "index.mapping.total_fields.limit": 1001
}'
>>>
{"acknowledged":true}

Now that you’re more aware of the dangers lurking with Mappings, you’re much better prepared for the production battlefield 🙂

Learn More

Running ELK on Kubernetes with ECK – Part 2

This part 2 of a 3-part series on running ELK on Kubernetes with ECK. If you’re just getting started, make sure to checkout Part 1.

Setting Up Elasticsearch on Kubernetes

Picking up where we left off, our Kubernetes cluster is ready for our Elasticsearch stack. We’ll first create an Elasticsearch Node and then continue with setting up Kibana.

Importing Elasticsearch Custom Resource Definitions (CRD) and Operators

Currently, Kubernetes doesn’t yet know about how it should create and manage our various Elasticsearch components. We would have to spend a lot of time manually creating the steps it should follow. But, we can extend Kubernetes’ understanding and functionality, with Custom Resource Definitions and Operators

Luckily, the Elasticsearch team provides a ready-made YAML file that defines the necessary resources and operators. This makes our job a lot easier, as all we have to do is feed this file to Kubernetes.

Let’s first log in to our master node:

vagrant ssh kmaster

Note: if your command prompt displays “vagrant@kmaster:~$“, it means you’re already logged in and you can skip this command.

With the next command, we import and apply the structure and logic defined in the YAML file:

kubectl apply -f https://download.elastic.co/downloads/eck/1.1.1/all-in-one.yaml

Optionally, by copying the “https” link from the previous command and pasting it into the address bar of a browser, we can download and examine the file. 

Many definitions have detailed descriptions which can be helpful when we want to understand how to use them.

We can see in the command’s output that a new namespace was created, named “elastic-system”.

Let’s go ahead and list all namespaces in our cluster:

kubectl get ns

Now let’s look at the resources in this namespace:

kubectl -n elastic-system get all

-n elastic-system” selects the namespace we want to work with and “get all” displays the resources.

The output of this command will be useful when we need to check on things like which Pods are currently running, what services are available, at which IP addresses they can be reached at, and so on.

If the “STATUS” for “pod/elastic-operator-0” displays “ContainerCreating“, then wait a few seconds and then repeat the previous command until you see that the status change to “Running“. 

We need the operator to be active before we continue.

Launching an Elasticsearch Node in Kubernetes

Now it’s time to tell Kubernetes about the state we want to achieve. 

The Kubernetes Operator will then proceed to automatically create and manage the necessary resources to achieve and maintain this state. 

We’ll accomplish this with the help of a YAML file. Let’s analyze its contents before passing it to the kubectl command:

Link to image file

  1. kind here means the type of object that we’re describing and intend to create
  1. Under metadata, the name, a value of our choosing, helps us identify the resources that’ll be created
  1. Under nodeSets, we define things like:
  • The name for this set of nodes. 
  • In count, we choose the number of Elasticsearch nodes we want to create. 
  • Finally, under config, we define how the nodes should be configured. In our case, we’re choosing a single Elasticsearch instance that should be both a Master Node and a Data Node. We’re also using the config option “node.store.allow_mmap: false“, to quickly get started. Note, however, that in a production environment, this section should be carefully configured. For example, in the case of the allow_mmap config setting, users should read Elasticsearch’s documentation about virtual memory before deciding on a specific value.
  1. Under podTemplate we have spec (or specifications) for containers
  • Under env we’re passing some environment variables. These ultimately reach the containers in which our applications will run and some programs can pick up on those variables to change their behavior in some way. The Java Virtual Machine, running in the container and hosting our Elasticsearch application, will notice our variable and change the way it uses memory by default
  • Also, notice that under resources we define requests with a cpu value of “0.5“. This decreases the CPU priority of this pod.
  1. Under http, we define a service, of the type: NodePort. This creates a service that will be accessible even from outside of Kubernetes’ internal network. In this lesson, we will analyze why this option is important and when we’d want to use it.

Under the ports section we find:

  • Port tells the service on which port to accept connections. Only apps running inside the Kubernetes cluster can connect to this, so no external connections allowed. For external connections, nodePort will be used.
  • targetPort makes the requests received by the Kubernetes service on the previously defined port to be redirected to this targetPort in one of the Pods. Of course, the application running in that Pod/Container will also need to listen on this port, to be able to receive the requests. For example, a program makes a request on port 12345, the service will redirect the request to a pod, on targetPort 54321.
  • Kubernetes runs on Nodes, that is physical or virtual machines. Each physical or virtual machine can have its own IP address, on which other computers can communicate with it. This is called the Node’s IP address or external IP address. nodePort opens up a port, on every node in your cluster, that can be accessed by computers that are outside of Kubernetes’ internal network. For example, if one node would be using a publicly accessible IP address, we could connect to that IP and the specified nodePort and Kubernetes would accept the connection and redirect it to the targetPort to one of the Pods.

As mentioned earlier, we can find a lot of the Elasticsearch-specific objects defined in the “all-in-one.yaml” file we used to import Custom Resource Definitions. For example, if we would open the file and search for “nodeSets“, we would see the following:

Link to image file

With that out of the way, let’s finally pass this desired state to Kubernetes:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/01_single-node-elasticsearch.yaml

This will take a while, but we can verify progress by looking at the resources available. Initially, the status for our Pod will display “Init:0/1”.

kubectl get all

When the Pod containing our Elasticsearch node is finally created, we should notice in the output of this command that “pod/quickstart-es-default-0” has availability of “1/1” under READY and a STATUS of “Running“. 

Link to image file

Now we’re set to continue.

Retrieving a Password from Kubernetes Secrets

First, we’ll need to authenticate our cURL requests to Elasticsearch with a username and password. Storing this password in the Pods, Containers, or other parts of the filesystem would not be secure, as, potentially, anyone and anything could freely read them. 

Kubernetes has a special location where it can store sensitive data such as passwords, keys or tokens, called Secrets.

To list all secrets protected by Kubernetes, we use the following command:

kubectl get secrets

In our case, the output should look something like this:

We will need the “quickstart-es-elastic-user” secret. With the following command we can examine information about the secret:

kubectl describe secret quickstart-es-elastic-user

We’ll get the following output:

Let’s extract the password stored here and save it to a variable called “PASSWORD”.

PASSWORD=$(kubectl get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')

To display the password, we can type:

echo $PASSWORD

Making the Elasticsearch Node Publicly Accessible

Let’s list the currently available Kubernetes services:

kubectl get svc

Here’s an example output we will analyze:

A lot of IP addresses we’ll see in Kubernetes are so-called internal IP addresses. This means that they can only be accessed from within the same network. In our case, this would imply that we can connect to certain things only from our Master Node or the other two Worker Nodes, but not from other computers outside this Kubernetes cluster.

When we will run a Kubernetes cluster on physical servers or virtual private servers, these will all have external IP addresses that can be accessed by any device connected to the Internet. By using the previously discussed NodePort service, we open up a certain port on all Nodes. This way, any computer connected to the Internet can get access to services offered by our pods, by sending requests to the external IP address of a Kubernetes node and the specified NodePort number.

Alternatively, instead of NodePort, we can also use a LoadBalancer type of service to make something externally available.

In our case, we can see that all incoming requests, on the external IP of the Node, to port 31920/TCP will be routed to port 9200 on the Pods.

We extracted the necessary password earlier, so now we can fire a cURL request to our Elasticsearch node:

curl -u elastic:$PASSWORD -k https://localhost:31920

Since we made this request from the “kmaster” Node, it still goes through Kubernetes’ internal network. 

So to see if our service is indeed available from outside this network, we can do the following.

First, we need to find out the external IP address for the Node we’ll use. We can list all external IPs of all Nodes with this command:

kubectl get nodes --selector=kubernetes.io/role!=master -o jsonpath={.items[*].status.addresses[?(@.type=="InternalIP")].address} ; echo

Alternatively, we can use another method:

ip addr

And look for the IP address displayed under “eth1”, like in the following:

However, this method requires closer attention, as the external IP may become associated with a different adapter name in the future. For example, the identifier might start with the string “enp”.

In our case, the IP we extracted here belongs to the VirtualBox machine that is running this specific Node. If the Kubernetes Node would run on a server instead, it would be the publicly accessible IP address of that server.

Now, let’s assume for a moment that the external IP of our node is 172.42.42.100. If you want to run this exercise, you’ll need to replace this with the actual IP of your own Node, in case it differs. 

You will also need to replace the password, with the one that was generated in your case.

Let’s display the password again:

echo $PASSWORD

Select and copy the output you get since we’ll need to paste it in another window.

In our example, the output is 3sun1I8PB41X2C8z91Xe7DGy, but you shouldn’t use this. We brought attention to this value just so you can see where your password should be placed in the next command.

Next, minimize your current SSH session or terminal window, don’t close it, as you’ll soon return to that session. 

Windows: If you’re running Windows, open up a Command Prompt and execute the next command. 

Linux/Mac: On Linux or Mac, you would need to open up a new terminal window instead. 

Windows 10 and some versions of Linux have the cURL utility installed by default. If it’s not available out of the box for you, you will have to install it before running the next command. 

Remember to replace highlighted values with what applies to your situation:

curl -u "elastic:3sun1I8PB41X2C8z91Xe7DGy" -k "https://172.42.42.100:31920"

And there it is, you just accessed your Elasticsearch Node that’s running in a Kubernetes Pod by sending a request to the Kubernetes Node’s external IP address. 

Now let’s close the Command Prompt or the Terminal for Mac users and return to the previously minimized SSH session, where we’re logged in to the kmaster Node.

Setting Up Kibana

Creating the Kibana Pod

As we did with our Elasticsearch node, we’ll declare to Kubernetes what state we want to achieve, and it will take the necessary steps to bring up and maintain a Kibana instance.

Let’s look at a few key points in the YAML file that we’ll pass to the kubectl command:

Image link

  1. The elasticsearchRef entry is important, as it points Kibana to the Elasticsearch cluster it should connect to.
  2. In the service and ports sections, we can see it’s similar to what we had with the Elasticsearch Node, making it available through a NodePort service on an external IP address.

Now let’s apply these specifications from our YAML file:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/02_kibana.yaml

It will take a while for Kubernetes to create the necessary structures. We can check its progress with:

kubectl get pods

The name of the Kibana pod will start with the string “quickstart-kb-“. If we don’t see “1/1” under READY and a STATUS of Running, for this pod, we should wait a little more and repeat the command until we notice that it’s ready.

Accessing the Kibana Web User Interface

Let’s list the services again to extract the port number where we can access Kibana.

kubectl get svc

We can see the externally accessible port is 31560. We also need the IP address of a Kubernetes Node. 

The procedure is the same as the one we followed before and the external IPs should also be the same:

kubectl get nodes --selector=kubernetes.io/role!=master -o jsonpath={.items[*].status.addresses[?(@.type=="InternalIP")].address} ; echo

Finally, we can now open up a web browser, where, in the URL address bar we type “https://” followed by the IP address and the port number. The IP and port should be separated by a colon (:) sign. 

Here’s an example of how this could look like:

https://172.42.42.100:31560

Since Kibana currently uses a self-signed SSL/TLS security certificate, not validated by a certificate authority, the browser will automatically refuse to open the web page. 

To continue, we need to follow the steps specific to each browser. For example, in Chrome, we would click on “Advanced” and then at the bottom of the page, click on “Proceed to 172.42.42.100 (unsafe)“. 

On production systems, you should use valid SSL/TLS certificates, signed by a proper certificate authority. The Elasticsearch documentation has instructions about how we can import our own certificates when we need to.

Finally, the Kibana dashboard appears:

Under username, we enter “elastic” and the password is the same one we retrieved in the $PASSWORD variable. If we need to display it again, we can go back to our SSH session on the kmaster Node and enter the command:

echo $PASSWORD

Inspecting Pod Logs

Now let’s list our Pods again:

kubectl get pods

By copying and pasting the pod name to the next command, we can look at the logs Kubernetes keeps for this resource. We also use the “-f” switch here to “follow” our log, that is, watch it as it’s generated.

kubectl logs quickstart-es-default-0 -f

Whenever we open logs in this “follow” mode, we’ll need to press CTRL+C when we want to exit.

Installing The Kubernetes Dashboard

So far, we’ve relied on the command line to analyze and control various things in our Kubernetes infrastructure. But just like Kibana can make some things easier to visualize and analyze, so can the Kubernetes Web User Interface.

Important Note: Please note that the YAML file used here is meant just as an ad-hoc, simple solution to quickly add Kubernetes Web UI to the cluster. Otherwise said, we used a modified config that gives you instant results, so you can experiment freely and effortlessly. But while this is good for testing purposes, it is NOT SAFE for a production system as it will make the Web UI publicly accessible and won’t enforce proper login security. If you intend to ever add this to a production system, follow the steps in the official Kubernetes Web UI documentation.

Let’s pass the next YAML file to Kubernetes, which will do the heavy lifting to create and configure all of the components necessary to create a Kubernetes Dashboard:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/03_k8s_dashboard-not-safe-for-production.yaml

As usual, we can check with the next command if the job is done:

kubectl get pods

Once the Dashboard Pod is running, let’s list the Services, to find the port we need to use to connect to it:

kubectl get svc

In our example output, we see that Dashboard is made available at port 30000.

Just like in the previous sections, we use the Kubernetes Node’s external IP address, and port, to connect to the Service. Open up a browser and type the following in the address bar, replacing the IP address and port, if necessary, with your actual values:

https://172.42.42.100:30000

The following will appear:

Since we’re just testing functionality here, we don’t need to configure anything and we can just click “Skip” and then we’ll be greeted with the Overview page in the Kubernetes Web UI.

Installing Plugins to an Elasticsearch Node Managed by Kubernetes

We might encounter a need for plugins to expand Elasticsearch’s basic functionality. Here, we will assume we need the S3 plugin to access Amazon’s object storage service.

The process we’ll go through looks like this:

Storing S3 Authentication Keys as Kubernetes Secrets

We previously explored how to extract values from Kubernetes’ secure Secret vault. Now we’ll learn how to add sensitive data here.

To make sure that only authorized parties can access them, S3 buckets will ask for two keys. We will use the following fictional values.

AWS_ACCESS_KEY=123456

AWS_SECRET_ACCESS_KEY=123456789

If, in the future, you want to adapt this exercise for a real-world scenario, you would just copy the key values from your Amazon Dashboard and paste them in the next two commands.

To add these keys, with their associated values, to Kubernetes Secrets, we would enter the following commands:

kubectl create secret generic awsaccesskey --from-literal=AWS_ACCESS_KEY_ID=123456

and:

kubectl create secret generic awssecretkey --from-literal=AWS_SECRET_ACCESS_KEY=123456789

Each command will output a message, informing the user that the secret has been created.

Let’s list the secrets we have available now:

kubectl get secrets

Notice our newly added entries:

We can also visualize these in the Kubernetes Dashboard:

Installing the Elasticsearch S3 Plugin

When we created our Elasticsearch node, we described the desired state in a YAML file and passed it to Kubernetes through a kubectl command. To install the plugin, we simply describe a new, changed state, in another YAML file, and pass it once again to Kubernetes.

The modifications to our original YAML config are highlighted here:

Image Link

The first group of changes we added are as follows:

                - name: AWS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: awsaccesskey
                      key: AWS_ACCESS_KEY_ID
                - name: AWS_SECRET_ACCESS_KEY
                  valueFrom:
                    secretKeyRef:
                      name: awssecretkey
                      key: AWS_SECRET_ACCESS_KEY

Here, we create environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, inside the Container. We assign them the values of our secret keys, extracted from the Kubernetes Secrets vault.

In the second group of changes we added this:

          initContainers:
            - name: install-plugins
              command:
                - sh
                - -c
                - |
                  bin/elasticsearch-plugin install --batch repository-s3
                  echo $AWS_ACCESS_KEY_ID | /usr/share/elasticsearch/bin/elasticsearch-keystore add --stdin s3.client.default.access_key
                  echo $AWS_SECRET_ACCESS_KEY | /usr/share/elasticsearch/bin/elasticsearch-keystore add --stdin s3.client.default.secret_key

Here, we simply instruct Kubernetes to execute certain commands when it initializes the Containers. The commands will first install the S3 plugin and then configure it with the proper secret key values, passed along through the $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY environment variables.

To get started, let’s first delete the Elasticsearch node from our Kubernetes cluster, by removing its associated YAML specification:

kubectl delete -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/01_single-node-elasticsearch.yaml

If we now check the status of the Pods, with:

kubectl get pods

We can see that the Elasticsearch Pod has a STATUS of “Terminating“.

Finally, let’s apply our latest desired state for our Elasticsearch Node, with the S3 plugin installed:

kubectl apply -f https://raw.githubusercontent.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/master/04_single_node_es_plugin_install.yaml

After a while, we can check the status of the Pods again to see if Kubernetes finished setting up the new configuration:

kubectl get pods

As usual, a STATUS of “Running” means the job is complete:

Verifying Plugin Installation

Since we’ve created a new Elasticsearch container, this will use a newly generated password to authenticate cURL requests. Let’s retrieve it, once again, and store it in the PASSWORD variable:

PASSWORD=$(kubectl get secret quickstart-es-elastic-user -o go-template='{{.data.elastic | base64decode}}')

It’s useful to list the Services again, to check which port we’ll need to use in order to send cURL requests to the Elasticsearch Node:

kubectl get svc

Take note of the port displayed for “quickstart-es-http” since we’ll use it in the next command:

Finally, we can send a cURL request to Elasticsearch to display the plugins it is using: 

curl -XGET -u elastic:$PASSWORD -k https://localhost:31920/_cat/plugins

Now the output will show that the repository-s3 plugin is active. 

In the third and final post of this series (coming next week), we’ll:

  • Use persistent volumes for storage
  • Setup a multi-node cluster deployment
  • Setup a Hot-warm architecture 
  • Learn about upgrade management

Running ELK on Kubernetes with ECK – Part 1

More and more employers are looking for people experienced in building and running Kubernetes-based systems, so it’s a great time to start learning how to take advantage of the new technology. Elasticsearch consists of multiple nodes working together, and Kubernetes can automate the process of creating these nodes and taking care of the infrastructure for us, so running ELK on Kubernetes can be a good options in many scenarios.

We’ll start this with an overview of Kubernetes and how it works behind the scenes. Then, armed with that knowledge, we’ll try some practical hands-on exercises to get our hands dirty and see how we can build and run Elastic Cloud on Kubernetes, or ECK for short.

What we’ll cover:

  • Fundamental Kubernetes concepts
  • Use Vagrant to create a Kubernetes cluster with one master node and two worker nodes
  • Create Elasticsearch clusters on Kubernetes
  • Extract a password from Kubernetes secrets
  • Publicly expose services running on Kubernetes Pods to the Internet, when needed.
  • How to install Kibana
  • Inspect Pod logs
  • Install the Kubernetes Web UI (i.e. Dashboard)
  • Install plugins on an Elasticsearch node running in a Kubernetes container
System Requirements: Before proceeding further, we recommend a system with at least 12GB of RAM, 8 CPU cores, and a fast internet connection. If your computer doesn’t meet the requirements, just use a VPS (virtual private server) provider. Google Cloud is one service that meets the requirements, as it supports nested virtualization on Ubuntu (VirtualBox works on their servers).

There’s a trend, lately, to run everything in isolated little boxes, either virtual machines or containers. There are many reasons for doing this so we won’t get into it here, but if you’re interested, you can read Google’s motivation for using containers

Let’s just say that containers make some aspects easier for us, especially in large-scale operations.

Managing one, two, or three containers is no big deal and we can usually do it manually. But when we have to deal with tens or hundreds of them, we need some help. 

This is where Kubernetes comes in.

What is Kubernetes?

By way of analogy, if containers are the workers in a company, then Kubernetes would be the manager, supervising everything that’s happening and taking appropriate measures to keep everything running smoothly.

After we define a plan of action, Kubernetes does the heavy lifting to fulfill our requirements.

Examples of what you can do with K8s:

  • Launch hundreds of containers, or whatever number needed with much less effort
  • Set up ways that containers can communicate with each other (i.e. networking)
  • Automatically scale up or down. When demand is high, create more containers, even on multiple physical servers, so that the stress of the high demand is distributed across multiple machines, making it easier to process. As soon as demand goes down, it can remove unneeded containers, as well as the nodes that were hosting them (if they’re sitting idle).
  • If there are a ton of requests coming in, Kubernetes can load balance and evenly distribute the workload to multiple containers and nodes.
  • Containers are carefully monitored with health checks, according to user-defined specifications. If one stops working, Kubernetes can restart it, create a new one as a replacement, or kill it entirely. If a physical machine running containers fails, those containers can be moved to another physical machine that’s still working correctly.

Kubernetes Cluster Structure

Let’s analyze the structure from the top down to get a good handle on things before diving into the hands-on section.

First, Kubernetes must run on computers of some kind. It might end up being on dedicated servers, virtual private servers, or virtual machines hosted by a capable server. 

Multiple such machines running Kubernetes components form a Kubernetes cluster, which is considered the whole universe of Kubernetes, because everything, from containers to data, to monitoring systems and networking exists here. 

In this little universe, there has to be a central point of command, like the “brains” of Kubernetes. We call this the master node. This node assumes control of the other nodes, sometimes also called worker nodes. The master node manages the worker nodes, while these, in turn, run the containers and do the actual work of hosting our applications, services, processing data, and so on.

Master Node

Basically, we’re the master of our master node, and it, in turn, is the master of every other node.

We instruct our master node about what state we want to achieve which then proceeds to take the necessary steps to fulfill our demands. 

Simply put, it automates our plan of action and tries to keep the system state within set parameters, at all times.

Nodes (or Worker Nodes)

The Nodes are like the “worker bees” of a Kubernetes cluster and provide the physical resources, such as CPU, storage space, memory, to run our containers.

Basic Kubernetes Concepts

Up until this point, we kept things simple and just peaked at the high-level structure of a Kubernetes cluster. So now let’s zoom in and take a closer look at the internal structure so we better understand what we’re about to get our hands dirty with.

Pods

Pods are like the worker ants of Kubernetes – the smallest units of execution. They are where applications run and do their actual work, processing data. A Pod has its own storage resources, and its own IP address and runs a container, or sometimes, multiple containers grouped together as a single entity.

Services

Pods can appear and disappear at any moment, each time with a different IP address. It would be quite hard to send requests to Pods since they’re basically a moving target. To get around this, we use Kubernetes Services.

A K8s Service is like a front door to a group of Pods. The service gets its own IP address. When a request is sent to this IP address, the service then intelligently redirects it to the appropriate Pod. We can see how this approach provides a fixed location that we can reach. It can also be used as a mechanism for things like load balancing. The service can decide how to evenly distribute all incoming requests to appropriate Pods.

Namespaces

Physical clusters can be divided into multiple virtual clusters, called namespaces. We might use these for a scenario in which two different development teams need access to one Kubernetes cluster. 

With separate namespaces, we don’t need to worry if one team screws up the other team’s namespace since they’re logically isolated from one another.

Deployments

In deployments, we describe a state that we want to achieve. Kubernetes then proceeds to work its magic to achieve that state. 

Deployments enable:

  • Quick updates – all Pods can gradually be updated, one-by-one, by the Deployment Controller. This gets rid of having to manually update each Pod. A tedious process no one enjoys.
  • Maintain the health of our structure – if a Pod crashes or misbehaves, the controller can replace it with a new one that works.
  • Recover Pods from failing nodes – if a node should go down, the controller can quickly launch working Pods in another, functioning node.
  • Automatically scale up and down based on the CPU utilization of Pods.
  • Rollback changes that created issues. We’ve all been there 🙂

Labels and Selectors

First, things like Pods, services, namespaces, volumes, and the like, are called “objects”. We can apply labels to objects. Labels help us by grouping and organizing subsets of these objects that we need to work with. 

The way Labels are constructed is with key/value pairs. Consider these examples:

app:nginx

site:example.com

Applied to specific Pods, it can easily help us identify and select those that are running the Nginx web server and are hosting a specific website.

And finally, with a selector, we can match the subset of objects we intend to work with. For example, a selector like

app = nginx

site = example.com

This would match all the Pods running Nginx and hosting “example.com”.

Ingress

In a similar way that Kubernetes Services sit in front of Pods to redirect requests, Ingress sits in front of Services to load balance between different Services using SSL/TLS to encrypt web traffic or using name-based hosting. 

Let’s take an example to explain name-based hosting. Say there are two different domain names, for example, “a.example.com” and “b.example.com” pointing to the same ingress IP address. Ingress can be made to route requests coming from “a.example.com” to service A and requests from “b.example.com” to service B.

Stateful Sets

Deployments assume that applications in Kubernetes are stateless, that is, they start and finish their job and can then be terminated at any time – with no state being preserved. 

However, we’ll need to deal with Elasticsearch, which needs a stateful approach. 

Kubernetes has a mechanism for this called StatefulSets. Pods are assigned persistent identifiers, which makes it possible to do things like:

  • Preserve access to the same volume, even if the Pod is restarted or moved to another node.
  • Assign persistent network identifiers, even if Pods are moved to other nodes.
  • Start Pods in a certain order, which is useful in scenarios where Pod2 depends on Pod1 so, obviously, Pod1 would need to start first, every time.
  • Rolling updates in a specific order.

Persistent Volumes

A persistent volume is simply storage space that has been made available to the Kubernetes cluster. This storage space can be provided from the local hardware, or from cloud storage solutions.

When a Pod is deleted, its associated volume data is also deleted. As the name suggests, persistent volumes preserve their data, even after a Pod that was using it disappears. Besides keeping data around, it also allows multiple Pods to share the same data.

Before a Pod can use a persistent volume, though,  it needs to make a Persistent Volume Claim on it.

Headless Service

We previously saw how a Service sits in front of a group of Pods, acting as a middleman, redirecting incoming requests to a dynamically chosen Pod. But this also hides the Pods from the requester, since it can only “talk” with the Service’s IP address. 

If we remove this IP, however, we get what’s called a Headless Service. At that point, the requester could bypass the middle man and communicate directly with one of the Pods. That’s because their IP addresses are now made available to the outside world.

This type of service is often used with Stateful Sets.

Kubectl

Now, we need a way to interact with our entire Kubernetes cluster. The kubectl command allows us to enter commands to get kubectl to do what we need. It then interacts with the Kubernetes API, and all of the other components, to execute our desired actions.

Let’s look at a few simple commands. 

For example, to check the cluster information, we’d would enter:

kubectl cluster-info

If we wanted to list all nodes in the cluster, we’d enter:

kubectl get nodes

We’ll take a look at many more examples in our hands-on exercises.

Operators

Some operations can be complex. For example, upgrading an application might require a large number of steps, verifications, and decisions on how to act if something goes wrong. This might be easy to with one installation, but what if we have 1000 to worry about? 

In Kubernetes, hundreds, thousands, or more containers might be running at any given point. If we would have to manually do a similar operation on all of them, it’s why we’d want to automate that.

Enter Operators. We can think of them as a sort of “software operators,” replacing the need for human operators. These are written specifically for an application, to help us, as service owners, to automate tasks.

Operators can deploy and run the many containers and applications we need, react to failures and try to recover from them, automatically backup data, and so on. This essentially lets us extend Kubernetes beyond its out-of-the-box capabilities without modifying the actual Kubernetes code.

Custom Resources

Since Kubernetes is modular by design, we can extend the API’s basic functionality. For example, the default installation might not have appropriate mechanisms to deal efficiently with our specific application and needs. By registering a new Custom Resource Definition, we can add the functionality we need, custom-tailored for our specific application. In our exercises, we’ll explore how to add Custom Resource Definitions for various Elasticsearch applications.

Hands-On Exercises

Basic Setup

Ok, now the fun begins. We’ll start by creating virtual machines that will be added as nodes to our Cluster. We will use VirtualBox to make it simpler.

1. Installing VirtualBox

1.1 Installing VirtualBox on Windows

Let’s go to the download page: https://www.virtualbox.org/wiki/Downloads and click on “Windows Hosts”.

We can then open the setup file we just downloaded and click “Next” in the installation wizard, keeping the default options selected.

After finishing with the installation, it’s a good idea to check if everything works correctly by opening up VirtualBox, either from the shortcut added to the desktop, or the Start Menu.

If everything seems to be in order, we can close the program and continue with the Vagrant setup.

1.2 Installing VirtualBox on Ubuntu

First, we need to make sure that the Ubuntu Multiverse repository is enabled.

Afterward, we install VirtualBox with the next command:

sudo apt-get update && sudo apt-get install virtualbox-qt

Let’s try to run VirtualBox to ensure the install was successful:

virtualbox

Once the app opens up, we can close it and continue with Vagrant.

1.3 Installing VirtualBox on macOS

Let’s download the setup file from https://www.virtualbox.org/wiki/Downloads and click on “OS X hosts.”

We can now open the DMG file, execute the PKG inside and run the installer. We keep the default options selected and continue with the steps in the install wizard.

Let’s open up the terminal and check if the install was successful.

virtualbox

If the application opens up and everything seems to be in order, we can continue with the Vagrant setup.

2. Installing Vagrant

It would be pretty time-consuming to set up each virtual machine for use with Kubernetes. But we will use Vagrant, a tool that automates this process, making our work much easier.

2.1 Installing Vagrant on Windows

Installing on Windows is easy. We just need to visit the following address, https://www.vagrantup.com/downloads.html, and click on the appropriate link for the Windows platform. Nowadays, it’s almost guaranteed that everyone would need the 64-bit executable. Only download the 32-bit program if you’re certain your machine has an older, 32-bit processor.

Now we just need to follow the steps in the install wizard, keeping the default options selected.

If at the end of the setup you’re prompted to restart your computer, please do so, to make sure all components are configured correctly.

Let’s see if the “vagrant” command is available. Click on the Start Menu, type “cmd” and open up “Command Prompt”. Next, type:

vagrant --version

If the program version is displayed, we can move on to the next section and provision our Kubernetes cluster.

2.2 Installing Vagrant on Ubuntu

First, we need to make sure that the Ubuntu Universe repository is enabled.

If that’s enabled, installing Vagrant is as simple as running the following command:

sudo apt-get update && sudo apt-get install vagrant

Finally, let’s double-check that the program was successfully installed, with:

vagrant --version

2.3 Installing Vagrant on macOS

Let’s first download the setup file from https://www.vagrantup.com/downloads.html, which, at the time of this writing, would be found at the bottom of the page, next to the macOS icon.

Once the download is finished, let’s open up the DMG file, execute the PKG inside, and go through the steps of the install wizard, leaving the default selections as they are.

Once the install is complete, we will be presented with this window.

But we can double-check if Vagrant is fully set up by opening up the terminal and typing the next command:

vagrant --version

Provisioning the Kubernetes Cluster 

Vagrant will interact with the VirtualBox API to create and set up the required virtual machines for our cluster. Here’s a quick overview of the workflow.

Once Vagrant finishes the job, we will end up with three virtual machines. One machine will be the master node and the other two will be worker nodes.

Let’s first download the files that we will use with Vagrant, from https://github.com/coralogix-resources/elastic-cloud-on-kubernetes-webinar/raw/master/k8s_ubuntu.zip

Credit for files: https://bitbucket.org/exxsyseng/k8s_ubuntu/src/master/

Next, we have to extract the directory “k8s_ubuntu” from this ZIP file.

Now let’s continue, by entering the directory we just unzipped. You’ll need to adapt the next command to point to the location where you extracted your files. 

For example, on Windows, if you extracted the directory to your Desktop, the next command would be “cd Desktopk8s_ubuntu”. 

On Linux, if you extracted to your Downloads directory, the command would be “cd Downloads/k8s_ubuntu”.

cd k8s_ubuntu

We’ll need to be “inside” this directory when we run a subsequent “vagrant up” command.

Let’s take a look at the files within. On Windows, enter:

dir

On Linux/macOS, enter:

ls -lh

The output will look something like this:

We can see a file named “Vagrantfile”. This is where the main instructions exist, telling Vagrant how it should provision our virtual machines.

Let’s open the file, since we need to edit it:

Note: In case you’re running an older version of Windows, we recommend you edit in WordPad instead of Notepad. Older versions of Notepad have trouble interpreting EOL (end of line) characters in this file, making the text hard to read since lines wouldn’t properly be separated.

Look for the text “v.memory” found under the “Kubernetes Worker Nodes” section. We’ll assign this variable a value of 4096, to ensure that each Worker Node gets 4 GB of RAM because Elasticsearch requires at least this amount to function properly with the 4 nodes we will add later on. We’ll also change “v.cpus” and assign it a value of 2 instead of 1.

After we save our edited file, we can finally run Vagrant:

vagrant up

Now, this might take a while since there’re quite a few things that need to be downloaded and set up. We’ll be able to follow its progress in the output and we may get a few prompts to accept some changes.

When the job is done, we can SSH into the master node by typing:

vagrant ssh kmaster

Let’s check if Kubernetes is up and running:

kubectl get nodes

This will list the nodes that make up this cluster:

Pretty awesome! We are well on our way to implementing the ELK stack on Kubernetes. So far, we’ve created our Kubernetes cluster and just barely scratched the surface of what we can do with such automation tools. 

Stay tuned for more about Running ELK on Kubernetes with the rest of the series!

Part 2 – Coming December 22nd, 2020

Part 3 – Coming December 29th, 2020

Kibana Canvas: An In-Depth Guide

When we look at information, numbers, percentages, statistics, we tend to have an easier time understanding and interpreting them if they’re also represented by corresponding visual cues. Kibana Canvas is a tool that helps us present our Elasticsearch data with infographic-like dashboards – fully visual, dynamic, and live.

What is Kibana Canvas?

Kibana Canvas

A Kibana Canvas presentation is a bit like a PowerPoint presentation. We can generate bar charts, plots, and graphs or fully customized visualizations to showcase pertinent data that address specific needs. But Canvas goes way beyond PowerPoint. Kibana Canvas extracts whatever live data from our Elasticsearch cluster we need, processes it to prepare it for our visualizations, and then generates the graphical representations that we define.

Kibana Canvas Basics

The first step in Canvas is to create a so-called workpad which is basically or workspace where we build the graphical representation of our Elasticsearch data. Workpads can be composed of a single page, or multiple pages. You could think of these pages as dashboards. Within the pages, we add graphical elements to represent our data any way we need to. 

Let’s go through a few examples of the elements that Kibana Canvas has to offer:

  • Charts which include area, bubble, coordinate, donut, bar and other types of charts
  • Shapes and text boxes that are formatted with Markdown, which is a plain-text-formatting syntax
  • Images that can either repeat a number of times based on dynamic data or the image can be revealed, based on the data they represent.
  • Other supporting elements such as dropdown filters or time filters, which give us a way to limit what data is considered for graphical representation. For example, we can select to only include data inserted in the last month. This means we can create workpads that change their content on the fly based on user input – making it more of an app-like experience.

Canvas Data Sources

Each element needs a way to extract the data that will be represented. The data sources an element can use include:

  • Elasticsearch SQL queries. The SQL syntax gives us a lot of flexibility in terms of what data we want to pull from Elasticsearch and how we want to present it to Canvas.
  • Timelion expressions to work with time-based data.
  • Raw documents pulled directly from indexed documents.

Piping Functions

Everything that’s happening in Canvas is driven by the custom expression language, similarly to Kibana Timelion. We chain defined functions by piping results (which are also known as contexts). Piping simply means that we take the result from one function and send it to another function, to be further processed. All of this gives us a lot of freedom in choosing how to extract the data we need and transform it in a way that is usable and meaningful for our infographics.

Here, we have an example of a custom expression for Canvas:

In this example of a Metric element definition we can see that:

  • It begins with the element “filters” to ensure that any global filter defined in the workpad, for example, a date picker, will also apply to this element. 
  • Next, “essql” is specified as the data source, which then defines an Elasticsearch SQL query.
  • The “math” function calculates the value for the metric that will be displayed. Here, it simply passes the value of “count_documents” which, as it name implies, will calculate the total returned documents from our defined query.
  • The next line, starting with the “metric” word defines the element type we want to use along with its formatting.

To get a more in-depth look at Timelion functions, check out this article on the topic which also includes a summary of Elasticsearch SQL functions for quick reference.

Well, that covers the basics of how Canvas works. So now let’s learn create our own Kibana Canvas workpad and see what it can do for our users.

Hands-on Exercises

Sample Data

To get started with the exercise, we need to get some sample data to work with. We’ll accomplish this with the NGINX web server log provided by Elastic.

To get the saample data:

  • Download the log from GitHub repository, with the wget command
  • Convert “nginx_json_logs” file with awk, to a form that can be processed by the Elasticsearch  _bulk API and then the converted file is saved as “nginx_json_logs_bulk”
  • Create the index and basic mappings. We especially need to map Nginx’s timestamps to a datetime format, so they can be understood and used by Elasticsearch
  • Finally, we’ll index the data with the _bulk API

Here are the commands, in the order they were specified:

wget https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/nginx_json_logs/nginx_json_logs
 
awk '{print "{"index":{}}n" $0}' nginx_json_logs > nginx_json_logs_bulk
 
 
curl --request PUT "https://localhost:9200/nginx" 
--header 'Content-Type: application/json' 
-d '{
   "settings": {
       "number_of_shards": 1,
       "number_of_replicas": 0
   },
   "mappings": {
       "properties": {
           "time": {"type":"date","format":"dd/MMM/yyyy:HH:mm:ss Z"},
           "response": {"type":"keyword"}
       }
   }
}'
 
curl --silent --request POST 'https://localhost:9200/nginx/_doc/_bulk' 
--header 'Content-Type: application/x-ndjson' 
--data-binary '@nginx_json_logs_bulk' | jq '.errors'

After the last command, we should look for the message “false” in the output, which signals that there were no errors.

Tips for an Efficient Workflow in Canvas

In Canvas, we’ll often use Elasticsearch SQL to retrieve the data for our visualizations. To prepare our queries and test out different variations, we can use Kibana DevTools. To do this, we just need to wrap the SQL query in a simple JSON object and then POST it to the _sql API endpoint. We’ll also need to add “?format=txt” as a query parameter to our POST request, to see the returned result nicely formatted in a text-based table.

While working with an element, we can click the Expression editor in the bottom-right corner. This gives us much more flexibility in changing how the element works. After we get comfortable with the expression language used in Canvas, you’ll notice it’s a more efficient way to work, rather than using the interface and clicking on buttons and menus to edit various functionality.

The documentation for Canvas functions and TinyMath functions will come in handy when you dive deeper into the expression editor.

Getting Started

Let’s create our first workpad in Canvas and learn how to build an infographic like the one in this example:

nginx

Adding a New Element

In the top-right corner, above our workpad, notice the Add element button. Let’s click on it and reveal a list of available elements with clear visual representations and descriptions. 

Let’s add a new element and pick the Shape type from that list. We’ll use this to create a green background for the right side of our infographic.

Let’s open the Display section in the right pane and find the Fill box. We’ll use the following color code for the green:

#0b974d

Metrics

Now it’s time to add our metrics on the right side. To do this, we add a new element and select the Metric type from that list. After we drag it to its place, we also change the Metric and Label text color to white.

Finally, we get to the most important part, bringing in the data from our Nginx index. 

After we select the element we want to work with (in this case, the metric we just added), look in the right pane of Canvas for Selected element. Switch from the Display tab to the Data tab, (as we see in this image), and change the data source to Elasticsearch SQL. To change the data source, we click on the underlined, bold text, which might display “Demo data” initially, on some installations.

elasticsearch sql data

We’ll use the following SQL query, which will return the total number of documents in our index:

SELECT COUNT(*) AS count_documents FROM nginx

After hitting Save we’ll see a warning sign on the screen. The metric doesn’t yet have a source value to use to display data. To fix this, we need to switch back to the Display tab and select count_documents as our Value, (as seen in this image). “count_documents” is generated by our previous SQL query.

display count documents value

We should also enter the text “Logs”, to use as our Label.

The general procedure for the rest of the metrics is the same.

On the right side, in the Selected element pane, we can see the three vertical dots menu option, in the top-right corner. By clicking on it, we can then Clone the element to quickly add similar ones. We can also do this by simply copying with CTRL+C and then pasting with CTRL+V. There’s also an option for saving elements, which we find in the My elements tab when adding a new element.

element options

To add the rest of the metrics, clone the previous one three times and then add the following SQL queries under each one’s Data source.

For the first metric we’ll do the following:

SELECT SUM(bytes) as bytes FROM nginx

And for the second metric:

SELECT COUNT(DISTINCT remote_ip) as remote_ip FROM nginx

Finally, for the third metric:

SELECT COUNT(DISTINCT agent) as agents FROM nginx

Don’t forget to follow the same procedure as we did with our first metric:

  • Point the Value field to use the column resulting from the SQL query.
  • And change the label of the metric.

Image and Image Repeat

We have two image elements in our infographic.

The first one will be used as the logo. To add the logo, we add an Image element and then change the link to point to the Nginx logo.

nginx site url

Next, we’ll add an Image repeat element. As the name suggests, this repeats the defined image, for a number of times, based on underlying data. Similarly, Image reveal shows a portion (or percentage) of the image, based on underlying data.

We’ll use a simple globe icon from wikipedia to represent the 136 unique agents in the right green section of our workpad. After adding the element, you need to explicitly add the Image Size property. We then use the same SQL query in the data section as we used for our “Unique agents” metric. 

image size

We’ll also need to adjust the size of the repeated image so that they all fit in:

SELECT COUNT(DISTINCT agent) as agents FROM nginx

Now the globe image should appear 136 times. Pretty cool! But there’s a lot more. 

Text

Now, let’s add the Text elements.

After inserting the first one, we navigate to the Display section to edit the text, where we copy and paste the following Markdown-formatted content:

## REQUESTS STATISTICS - **NUMBER OF REQUESTS**

For the next text element, we add this content:

## TOP 5 IP ADDRESSES - **TRANSFERRED BYTES**

Tables

Now it’s time to add the two Data table elements. The first one will display our request statistics. 

The following SQL will select the data we need, grouping results based on the request field and ordering them in descending order, according to the count of requests:

SELECT request, COUNT(*) AS count_requests FROM nginx GROUP BY request ORDER BY count_requests DESC

For nicer formatting, we can go to the Display section and hide the Pagination and column Header.

The second table will be very similar, but we’ll learn how to use functions as well.

First, let’s insert the following SQL content for our data source:

SELECT remote_ip, SUM(bytes) AS total_transferred FROM nginx GROUP BY remote_ip ORDER BY total_transferred DESC NULLS LAST LIMIT 5

Now let’s open up the Expression editor, from the bottom-right corner, and use the mapColumn function. Paired with getCell and formatnumber functions, it allows us to change the number format of each value in the total_transferred column and display them in so-called “human-readable” formats, such as GB, MB, KB for gigabytes, megabytes and kilobytes.

So this is the expression we’ll add to do that:

| mapColumn "total_transferred" fn={ getCell column="total_transferred" | formatnumber  "0.00b"}

It should be placed right after the ESSQL source definitions, after the SQL query line.

The function uses the NumeralJS library, which defines the formatting notations that can be applied.

Charts

The final elements that we’ll add are the charts.

Bar Charts

First, let’s add the Horizontal bar chart to visualize the request statistics from our table. We’ll use this SQL query in the Data section:

SELECT request, COUNT(*) AS count_requests FROM nginx GROUP BY request ORDER BY count_requests DESC

Now, we need to link it with the data returned by the query:

  • For the X-axis, we pick Value and count_requests
  • For the Y-axis, we also pick Value and then request.

We can also disable the axis labels, to give it a cleaner look. To do that, we click on the sliding buttons next to X-axis and Y-axis.

Under the Default style, we’ll pick a green fill color to make it fit with the rest of the design.

display values

Progress Charts

Let’s try taking this even further using Canvas expressions. We’ll apply it to a new element that we’ll add, a Progress Gauge chart. 

This progress chart will show the percentage of traffic transferred to the top 5 IP addresses. 

After inserting this chart element, we apply this SQL query to its Data section:

SELECT SUM(bytes) AS total_transferred_top5 FROM nginx GROUP BY remote_ip ORDER BY total_transferred_top5 DESC NULLS LAST LIMIT 5

This query alone, however, isn’t enough. That’s because the Progress Gauge chart expects a single value, between 0 and 1, where something like 0.56 would represent 56%. Once again, we’ll achieve what we need with the very useful Expression editor

The expression will divide the total transferred to the top 5 IPs (which we get from the first SQL query) to the grand total, transferred to all IPs, which we get from the second essql query.

The whole expression will look like this:

And this is the important part that calculates the number between 0 and 1 (the percentage), for the progress gauge element.

| math
 {string "sum(total_transferred_top5)/" {filters | essql query="SELECT SUM(bytes) AS total_transferred FROM nginx GROUP BY remote_ip ORDER BY total_transferred DESC NULLS LAST" | math "sum(total_transferred)"}}

After adding this to the expression editor, we click Run

We should also choose the same green color fill for the element and also pick a bigger font size.

We finally created the entire infographic example we initially saw.

Congratulations, you’re ready to explore the incredible capabilities that Kibana Canvas has to offer – beautiful dynamic and interactive representations of your data in exactly the form your users need.

Cleaning Up

To make sure we can continue with the next lesson in a clean environment, let’s remove the changes we created in this lesson.

First, we can remove the Nginx log files like so:

rm nginx_*

And then, we remove the index where we saved the Nginx sample log data:

curl --request DELETE localhost:9200/nginx

Finally, we remove the workpad we created in Kibana Canvas. To do this, we click in the bottom-left corner, on My Canvas Workpad.

Afterward, we check the box to the left of My Canvas Workpad. Finally, we click on the Delete button. And that gets back to where we first started. 

Learn More About:

8 Common Elasticsearch Configuration Mistakes That You Might Have Made

Elasticsearch was designed to allow its users to get up and running quickly, without having to understand all of its inner workings. However, more often than not, it’s only a matter of time before you run into configuration troubles. 

Elasticsearch is open-source software that indexes and stores information in a NoSQL database and is based on the Lucene search engine. Elasticsearch is also part of the ELK Stack. Despite its increasing popularity, there are several common and critical mistakes that users tend to make while using the software. 

Below are the most common Elasticsearch mistakes when setting up and running an Elasticsearch instance and how you can avoid making them.

1. Elasticsearch bootstrap checks failed

Bootstrap checks inspect various settings and configurations before Elasticsearch starts to make sure it will operate safely. If bootstrap checks fail, they can prevent Elasticsearch from starting if you are in production mode or issue warning logs in development mode. Familiarize yourself with the settings enforced by bootstrap checks, noting that they are different in development and production modes. By setting the system property of ‘enforce bootstrap checks’ to true, you can avoid bootstrap checks altogether.

2. Oversized templating

Large templates are directly related to large mappings. In other words, if you create a large mapping for Elasticsearch, you will have issues with syncing it across your nodes in the cluster, even if you apply them as an index template. 

The issues with big index templates are mainly practical. You might need to do a lot of manual work with the developer as a single point of failure. It can also relate to Elasticsearch itself. You will always need to remember to update your template when you make changes to your data model.

Solution

A solution to consider is the use of dynamic templates. Dynamic templates can automatically add field mappings based on your predefined mappings for specific types and names. However, you should always try to keep your templates small in size. 

3. Elasticsearch configuration for capacity provisioning

Provisioning can help to equip and optimize Elasticsearch for operational performance. Elasticsearch is designed in such a way that will keep nodes up, stop memory from growing out of control, and prevent unexpected actions from shutting down nodes. However, with inadequate resources, there are no optimizations that will save you.

Solution

Ask yourself: ‘How much space do you need?’ You should first simulate your use-case. Boot up your nodes, fill them with real documents, and push them until the shard breaks. You can then start defining a shard’s capacity and apply it throughout your entire index. 

It’s important to understand resource utilization during the testing process. This allows you to reserve the proper amount of RAM for nodes, configure your JVM heap space, configure your CPU capacity, provision through scaling larger instances with potentially more nodes, and optimize your overall testing process. 

4. Not defining Elasticsearch configuration mappings

Elasticsearch relies on mapping, also known as schema definitions, to handle data properly according to its correct data type. In Elasticsearch, mapping defines the fields in a document and specifies their corresponding data types, such as date, long, and string. 

In cases where an indexed document contains a new field without a defined data type, Elasticsearch uses dynamic mapping to estimate the field’s type, converting it from one type to another when necessary. While this may seem ideal, Elasticsearch mappings are not always accurate. If, for example, you choose the wrong field type, then indexing errors will pop up. 

Solution

To fix this issue, you should define mappings, especially in production-based environments. It’s a best practice to index several documents, let Elasticsearch guess the field, and then grab the mapping it creates. You can then make any appropriate changes that you see fit without leaving anything up to chance.

5. Combinable data ‘explosions’

Combinable Data Explosions are computing problems that can cause an exponential growth in bucket generation for certain aggregations and can lead to uncontrolled memory usage. Elasticsearch’s ‘terms’ field builds buckets according to your data, but it cannot predict how many buckets will be created in advance. This can be problematic for parent aggregations that are made up of more than one child aggregation.

Solution

Collection modes can be used to help to control how child aggregations perform. The default collection mode of an aggregation is called ‘depth-first’. A depth-first collection mode will first build a data tree and then trim the edges. Elasticsearch will allow you to change collection modes in specific aggregations to something more appropriate such as ‘breadth-first’. This collection mode helps build and trims the tree one level at a time to control combinable data explosions.

6. Search timeout errors

If you don’t receive an Elasticsearch response within the specified search period, the request fails and returns an error message. This is called a search timeout. Search timeouts are common and can occur for many reasons, such as large datasets or memory-intensive queries.

Solution

To eliminate search timeouts, you can increase the Elasticsearch Request Timeout configuration, reduce the number of documents returned per request, reduce the time range, tweak your memory settings, and optimize your query, indices, and shards. You can also enable slow search logs to monitor search run time and scan for heavy searches.

7. Process memory locking failure

As memory runs out in your JVM, it will begin to use swap space on the disk. This has a devastating impact on the performance of your Elasticsearch cluster.

Solution

The simplest option is to disable swapping. You can do this by setting the bootstrap memory lock to true. You should also ensure that you’ve set up memory locking correctly by consulting the Elasticsearch configuration documentation

8. Shards are failing

When searching in Elasticsearch, you may encounter ‘shards failure’ error messages. This happens when a read request fails to get a response from a shard. This can happen if the data is not yet searchable because the cluster or node is still in an initial start process, or when the shard is missing, or in recovery mode and the cluster is red.

Solution

To ensure better management of shards, especially when dealing with future growth, you are better off reindexing the data and specifying more primary shards in newly created indexes. To optimize your use case for indexing, make sure you designate enough primary shards so that you can spread the indexing load evenly across all of your nodes. You can also factor disabling merge throttling, increasing the size of the indexing buffer, and refresh less frequently by increasing the refresh interval.

Summary

When set up, configured, and managed correctly, Elasticsearch is a fully compliant distributed full-text search, and analytics engine. It enables multiple tenants to search through their entire data sets, regardless of size, at unprecedented speeds. Elasticsearch also doubles as an analytics system and distributed database. While these capabilities are impressive on their own, Elasticsearch combines all of them to form a real-time search and analytics application that can keep up with customer needs.

Errors, exceptions, and mistakes arise while operating Elasticsearch. To avoid them, pay close attention to initial setup and configuration and be particularly mindful when indexing new information. You should have strong monitoring and observability in your system, which is the first basic component of quickly and efficiently getting to the root of complex problems like cluster slowness. Instead of fearing their appearance, you can treat errors, exceptions, or mistakes, as an opportunity to optimize your Elasticsearch infrastructure.

SIEM Tutorial: What should a good SIEM Provider do for you?

Modern day Security Information and Event Management (SIEM) tooling enterprise security technology combine systems together for a comprehensive view of IT security. This can be tricky, so we’ve put together a simple SIEM tutorial to help you understand what a great SIEM provider will do for you.

A SIEM’s responsibility is to collect, store, analyze, investigate and report on log and other data for incident response, forensics and regulatory compliance purposes. This SIEM tutorial will explain the technical capabilities that a SIEM provider should have.

Technical Capabilities of Modern SIEM Providers

Data Collection

One of the most understood SIEM capabilities, data collection or log management, collects and stores the log files from multiple disparate hosts into a centralized location. This allows your security teams to easily access this information. Furthermore, the log management process can reformat the data it receives so that it is all consistent, making analysis less of a tedious and confusing process.

A SIEM can collect data in several ways and the most reliable methods include:

  • Using an agent installed on the device to collect from
  • By directly connecting to a device using a network protocol or API call
  • By accessing log files directly from storage, typically in Syslog format or CEF (Common Event Format)
  • Via an event streaming protocol like Netflow or SNMP

A core function of a SIEM is the ability of collecting data from multiple devices, standardizing it and saving it in a format that enables analysis.

Advanced Analytics

A SIEM tool needs to analyse security data in real time time. Part of this process involves using ‘Advanced Analytics’. SIEM or ‘Next Gen SIEM’ tools of today, have been extending their capabilities to more frequently include analytics functions. These automated analytics run in the background to proactively identify possible security breaches within businesses’ systems. An effective SIEM provides advanced analytics by using statistics, descriptive and predictive data mining, machine learning, simulation and optimization. Together they produce additional critical insights. Key advanced analytics methods include anomaly detection, peer group profiling and entity relationship modeling.

New methods, modeling or machine learning techniques for generating these analytics, can be defined using SIEM tooling services from query editors (using KQL – Kibana Query Language), UEBA (User Entity Behavioral Analytics) and EBA (Entity Behavior Analysis) models. All configurable through user interfaces and can be managed using prebuilt or custom incident timelines. They can flag anomalies and display details of an incident for the full scope of the event and its context. 

As these analytics functions become more standard, some SIEM vendors are pairing the traditional log collection with threat detection and response automation. This is key to producing insights from high volumes of data, where machine learning can automate this analysis to identify hidden threats.

Advanced Threat Detection

Security threats will continually evolve. A ‘Next Gen’ SIEM should be able to adapt to new advanced threats by implementing network security monitoring, endpoint detection and response sandboxing and behavior analytics in combination with one another to identify and quarantine new potential threats. Security teams need specialized tools to monitor, analyze and detect threats across the cyber kill chain. A cyber kill chain being the steps that trace stages of a cyberattack from the early reconnaissance stages to the exfiltration of data. By configuring machine learning and search context analytics services, you can realize a risk-based monitoring strategy that automatically identifies and prioritizes attacks and threats, that the security team can quickly spot and investigate true dangers to your environment.

These are configurations that can be defined in modern day SIEMs, and will result in real time threat detection monitoring of endpoints from anti-virus logs, insecure ports to cloud services. The set goal of threat detection should be not only to detect threats, but also to determine the scope of those threats by identifying where a specific advance threat may have moved to after being initially detected, how that threat should be contained, and how information should be shared.

Incident Response

Our SIEM tutorial wouldn’t be complete without this feature. Incident response is an organizational process that allows security teams to contain security incidents or cyber attacks and prevent or control damages. Many organizations need to develop proactive incident response capabilities that will actively search corporate systems for signs of a cyber attack. Threat hunting is the core activity of proactive incident response, which is carried out by security analysts. It typically involves querying security data using a SIEM, and running vulnerability scans or penetration tests against organizational systems. The objective is to discover suspicious activity or anomalies that represent a security incident.

An effective incident response strategy needs a robust SIEM platform to identify, track and reassign incidents. This can add the capability of automation and orchestration to your SOC (Security Operation Center) making your cyber security incident response team more productive. A SOC can use customizable tools in the SIEM, designed for security incidents, ensuring that threats do not slip through the cracks. 

Other key capabilities should include the ability to either manually or automatically aggregates events, support for APIs that can be used to pull data from or push information to third-party systems, an ability to gather legally admissible forensics evidence, and playbooks that provide organizations with guidance on how to respond to specific types of incidents. SIEM solutions should display the security information in a simple format, such as graphics or dashboards.

SOC Automation

Modern day SIEMs will leverage Security Orchestration and Automation (SOAR) technology that helps identify and automatically respond to security incidents, and supports incident investigation by SOC team members. The foundational technology of a SOC is a SIEM system. The SIEM will use statistical models to find security incidents and raise alerts about them. These alerts will come with crucial contextual information. A SIEM functions as a ‘single pane of glass’ which enables the SOC to monitor enterprise systems.

Monitoring is a key function of tools used in the SOC. The SOC is responsible for enterprise-wide monitoring of IT systems and user accounts, and also monitoring of the security tools themselves. An example is ensuring antivirus is installed and updated on all organizational systems. The main tool that orchestrates monitoring is the SIEM.

SOC automation benefits from SOAR integration with SIEM. A SOAR platform will take things a step further by combining comprehensive data gathering, case management, standardization, workflow and analytics to provide organizations the ability to implement sophisticated defense-in-depth capabilities. A SOAR’s main benefit to a SOC, is that it can automate and orchestrate time-consuming, manual tasks using playbooks, including opening a ticket in a tracking system, such as JIRA, without requiring any human intervention. This would allow engineers and analysts to better use their specialized skills.

All in all…

SIEM tools are proving to be more important than ever in modern times of cyber security. They are becoming undeniably useful in being able to draw together data and threats from across your IT environment into a single easy-to-use dashboard. Many of ‘Next Gen SIEM’ tools are configured to flag suspect patterns on their own and sometimes even resolve the underlying issue automatically. The best SIEM tools are adept at using past trends to differentiate between actual threats and legitimate use, enabling you to avoid false alarms while simultaneously ensuring optimal protection.

This SIEM tutorial was aimed at providing context and clarity to the role of an effective SIEM provider. Any provider that is missing these capabilities should be investigated, to ensure they’re the right fit for you and your security operations.

Elasticsearch Hadoop Tutorial with Hands-on Examples

In this Hadoop Tutorial lesson, we’ll learn how we can use Elasticsearch Hadoop to process very large amounts of data. For our exercise, we’ll use a simple Apache access log to represent our “big data”. We’ll learn how to write a MapReduce job to ingest the file with Hadoop and index it into Elasticsearch.

What Is Hadoop?

When we need to collect, process/transform, and/or store thousands of gigabytes of data, thousands of terabytes, or even more, Hadoop can be the appropriate tool for the job. It is built from the ground up with ideas like this in mind:

  • Use multiple computers at once (forming a cluster) so that it can process data in parallel, to finish the job much faster. We can think of it this way. If one server needs to process 100 terabytes of data, it might finish in 500 hours. But if we have 100 servers, each one can just take a part of the data, for example, server1 can take the first terabyte, server2 can take the second terabyte, and so on. Now they each have only 1 terabyte to process and they can all work on their own section of data, at the same time. This way, the job can be finished in 5 hours instead of 500. Of course, this is theoretical and imaginary, as in practice we won’t get a 100 times reduction in the time it takes, but we can get pretty close to that if conditions are ideal.
  • Make it very easy to adjust the computing power when needed. Have a lot more data to process, and the problem is much more complex? Add more computers to the cluster. In a sense, it’s like adding more CPU cores to a supercomputer.
  • Data grows and grows, so Hadoop must be able to easily and flexibly expand its storage capacity too, to keep up with demand. Every computer we add to the cluster expands the total storage space available to the Hadoop Distributed File System (HDFS).
  • Unlike other software, it doesn’t just try to recover from hardware failure when it happens. The design philosophy actually assumes that some hardware will certainly fail. When having thousands of computers, working in parallel, it’s guaranteed that something, somewhere, will fail, from time to time. As such, Hadoop, by default, creates replicas of chunks of data and distributes them on separate hardware, so nothing should be lost when the occasional server goes up in flames or a hard-disk or SSD dies.

To summarize, Hadoop is very good at ingesting and processing incredibly large volumes of information. It distributes data across the multiple nodes available in the cluster and uses the MapReduce programming model to process it on multiple machines at once (parallel processing).

But this may sound somewhat similar to what Elasticsearch data ingestion tools do. Although they’re made to deal with rather different scenarios, they may sometimes overlap a bit. So why and when would we use one instead of the other?

Hadoop vs Logstash/Elasticsearch

First of all, we shouldn’t think in terms of which one is better than the other. Each excels at the jobs it’s created for. Each has pros and cons.

To try to paint a picture and give you an idea of when we’d use one or the other, let’s think of these scenarios:

  • When we’d need to ingest data from billions of websites, as a search engine like Google does, we’d find a tool like Elasticsearch Hadoop very useful and efficient.
  • When we need to store data and index it in such a way that it can later be searched quickly and efficiently, we’ll find something like Elasticsearch very useful.
  • And, finally, when we want to gather real time data, like the price of USD/EUR from many exchanges available on the Internet, we’d find a tool like Logstash is good for the job.

Of course, if the situation allows it, Hadoop and Elasticsearch can also be teamed up, so we can get the best of both worlds. Remember the scenario of scanning information on billions of websites? Hadoop would be great at collecting all that data, and send it to be stored in Elasticsearch. Elasticsearch would then be great at quickly returning results to the users that search through that data.

With Elasticsearch, you can think: awesome search capabilities, good enough in the analytics and data visualization department.

With Elasticsearch Hadoop, you can think: capable of ingesting and processing mind-blowing amounts of data, in a very efficient manner, and allow for complex, fine-tuned data processing.

How MapReduce Works

As mentioned, while tools like Logstash or even Spark are easier to use, they also confine us to the methods they employ. That is, we can only fine-tune the settings they allow us to adjust and we can’t change how their programming logic works behind the scenes. That’s not usually a problem, as long as we can do what we want.

With Hadoop, however, we have more control over how things work at a much lower level, allowing for much more customization and more importantly, optimization. When we deal with petabytes of data, optimization can matter a lot. It can help us reduce the time needed for a job, from months to weeks, and significantly reduce operation costs and resources needed.

Let’s take a first look at MapReduce, which adds complexity to our job but also allows for the higher level of control mentioned earlier.

A MapReduce procedure typically consists of three main stages: Map, Shuffle and Reduce.

Initially, data is split into smaller chunks that can be spread across different computing nodes. Next, every node can execute a map task on its received chunk of data. This kind of parallel processing greatly speeds up the procedure. The more nodes the cluster has, the faster the job can be done.

Pieces of mapped data, in the form of key/value pairs, now sit on different servers. All the values with the same key need to be grouped together. This is the shuffle stage. Next, shuffled data goes through the reduce stage.

This image exemplifies these stages in action on a collection of three lines of words.

Here, we assume that we have a simple text file and we need to calculate the number of times each word appears within.

The first step is to read the data and split it into chunks that can be efficiently sent to all processing nodes. In our case, we assume the file is split into three lines.

Map Stage

Next comes the Map stage. Lines are used as input for the map(key, value, context) method. This is where we’ll have to program our desired custom logic. For this word count example, the “value” parameter will hold the line input (line of text from file). We’ll then split the line, using the space character as a word separator, then iterate through each of the splits (words) and emit a map output using context.write(key, value). Here, our key will be the word, for example, “Banana” and the value will be 1, indicating it’s a single occurrence of the word. From the image above we can see that for the first line we get <Banana, 1>, <Apple, 1>, <Mango, 1> as key/value pairs.

Shuffle Stage

The shuffle stage is responsible for taking <key, value> pairs from the mapper, and, based on a partitioner, decide to which reducer each goes to.

From the image showing each stage in action, we can see that we end up with five partitions in the reduce stage. Shuffling is done internally by the framework, so we will not have any custom code for that here.

Reduce Stage

The output of the shuffle stage is fed into the reduce stage: as its input, each reducer receives one of the groups formed in the shuffle stage. This consists of a key and a list of values related to the key. Here, we again have to program custom logic we want to be executed in this stage. In this example, for every key, we have to calculate the sum of the elements in its value list. This way, we get the total count of each key, which ultimately represents the count for each unique word in our text file.

The output of the reduce stage also follows the <key, value> format. As mentioned, in this example, the key will represent the word and the value the number of times the word has been repeated.

Hands-On Exercise

OpenJDK Prerequisite

Wow! There’s a lot of theory behind Hadoop, but practice will help us cement the concepts and understand everything better.

Let’s learn how to set up a simple Hadoop installation.

Since Elasticsearch is already installed, the appropriate Java components are already installed too. We can verify with:

java -version

This should show us an output similar to this:

openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)

OpenJDK is required by Hadoop and on an instance where this is not available, you can install it with a command such as “sudo apt install default-jdk”.

Create a Hadoop User

Now let’s create a new user, called “hadoop”. Hadoop related processes, such as the MapReduce code we’ll use, will run under this user. Remember the password you set for this user, as it’s needed later, when logging in, and using sudo commands while logged in.

sudo adduser hadoop

We’ll add the user to the sudo group, to be able to execute some later commands with root privileges.

sudo usermod -aG sudo hadoop

Let’s log in as the “hadoop” user.

su - hadoop

Install Hadoop Binaries 

Note: For testing purposes commands below can be left unchanged. On production systems, however, you should first visit https://www.apache.org/dyn/closer.cgi/hadoop/common/stable and find out which Hadoop version is the latest stable one. Afterward, you will need to modify “https” links to point to the latest stable version and change text strings containing “hadoop-3.2.1” in commands used below to whatever applies to you (as in, change “3.2.1” version number to current version number). It’s a very good idea to also follow instructions regarding verifying integrity of downloads with GPG (verify signatures).

While logged in as the Hadoop user, we’ll download the latest stable Hadoop distribution using the wget command.

wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz

Next, let’s extract the files from the tar archive compressed with gzip.

tar -xvzf hadoop-3.2.1.tar.gz

Once this is done, we’ll move the extracted directory to “/usr/local/hadoop/”.

sudo mv hadoop-3.2.1 /usr/local/hadoop

With the method we followed, the “/usr/local/hadoop” directory should already be owned by the “hadoop” user and group. But to make sure this is indeed owned by this user and group, let’s run the next command.

sudo chown -R hadoop:hadoop /usr/local/hadoop

Hadoop uses environment variables to orient itself about the directory paths it should use. Let’s set these variables according to our setup.

nano ~/.bashrc

Let’s scroll to the end of the file and add these lines:

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"


To quit the nano editor and save our file we’ll first press CTRL+X, then type “y” and finally press ENTER.

To make the environment variables specified in the “.bashrc” file take effect, we’ll use:

source ~/.bashrc

Configure Hadoop Properties

Hadoop needs to know where it can find the Java components it requires. We point it to the correct location by using the JAVA_HOME environment variable.

Let’s see where the “javac” binary is located:

 readlink -f $(which javac)

In the case of OpenJDK 11, this will point to “/usr/lib/jvm/java-11-openjdk-amd64/bin/javac“.

We’ll need to copy the path starting with “/usr” and ending with “openjdk-amd64“, which means we exclude the last part: “/bin/javac” in this case.

In the case of OpenJDK 11, the path we’ll copy is:

/usr/lib/jvm/java-11-openjdk-amd64

and we’ll paste it at the end of the last line: export JAVA_HOME=

Let’s open the “hadoop-env.sh” file in the nano editor and add this path to the JAVA_HOME variable.

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

We’ll scroll to the end of the file and add this line:

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Remember, if the OpenJDK version you’re using is different, you will need to paste a different string of text after “export JAVA_HOME=“.

Once again, we’ll press CTRL+X, then type “y” and finally press ENTER to save the file.

Let’s test if our setup is in working order.

hadoop version

We should see an output similar to this

Hadoop 3.2.1
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks on 2019-09-10T15:56Z
Compiled with protoc 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

Creating MapReduce Project

In this exercise, we’ll index a sample access log file which was generated in the Apache Combined Log Format. We’ll use the maven build tool to compile our MapReduce code into a JAR file.

In a real scenario, you would have to follow a few extra steps:

  • Install an integrated development environment (IDE) that includes a code editor, such as Eclipse, to create a project and write the necessary code.
  • Compile project with maven, on local desktop.
  • Transfer compiled project (JAR), from local desktop to your Hadoop instance.

We’ll explain the theory behind how you would create such a project, but we’ll also provide a GitHub repository containing a ready-made, simple Java project. This way, you don’t have to waste time writing code for now, and can just start experimenting right away and see MapReduce in action. Furthermore, if you’re unfamiliar with Java programming, you can take a look at the sample code to better understand where all the pieces go and how they fit.

So, first, let’s look at the theory and see how we would build MapReduce code, and what is the logic behind it.

Setting Up pom.xml Dependencies

To get started, we would first have to create an empty Maven project using the code editor we prefer. Both Eclipse and IntelliJ have built-in templates to do this. We can skip archetype selection when creating the maven project; an empty maven project is all we require here.

Once the project is created, we would edit the pom.xml file and use the following properties and dependencies. Some versions numbers specified below may need to be changed in the future, when new stable versions of Hadoop and Elasticsearch are used.

 
< properties >
   < maven.compiler.source >1.8< /maven.compiler.source >
   < maven.compiler.target >1.8</ maven.compiler.target >
< /properties >


< dependencies >
   < dependency >
       < groupId >org.apache.hadoop< /groupId >
       < artifactId >hadoop-client< /artifactId >
       < version >3.2.1< /version >
   < /dependency >
   < dependency >
       < groupId >org.elasticsearch< /groupId >
       < artifactId >elasticsearch-hadoop-mr< /artifactId >
       < version >7.8.0< /version >
   < /dependency >
   < dependency >
       < groupId >commons-httpclient< /groupId >
       < artifactId >commons-httpclient< /artifactId >
       < version >3.1< /version >
   < /dependency >
< /dependencies >

The hadoop-client library is required to write MapReduce Jobs. In order to write to an Elasticsearch index we are using the official  elasticsearch-hadoop-mr library. commons-httpclient is needed too, because elasticsearch-hadoop-mr uses this to be able to make REST calls to the Elasticsearch server, through the HTTP protocol.

Defining the Logic of Our Mapper Class

We’ll define AccessLogMapper and use it as our mapper class. Within it, we’ll override the default map() method and define the programming logic we want to use.

import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class AccessLogIndexIngestion {

   public static class AccessLogMapper extends Mapper {
       @Override
       protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {
       }
   }

   public static void main(String[] args) {

   }
}

As mentioned before, we don’t need a reducer class in this example.

Defining the Elasticsearch Index for Hadoop

Here is a sample of what the log file looks like

77.0.42.68 - - [17/May/2015:23:05:48 +0000] "GET /favicon.ico HTTP/1.1" 200 3638 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0"
77.0.42.68 - - [17/May/2015:23:05:32 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "https://www.semicomplete.com/projects/xdotool/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0"
77.0.42.68 - - [18/May/2015:00:05:08 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "https://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0"
207.241.237.101 - - [18/May/2015:00:05:42 +0000] "GET /blog/geekery/find-that-lost-screen-session.html HTTP/1.0" 200 11214 "https://www.semicomplete.com/blog/tags/tools" "Mozilla/5.0 (compatible; archive.org_bot +https://www.archive.org/details/archive.org_bot)"
120.202.255.147 - - [18/May/2015:00:05:57 +0000] "GET /files/logstash/logstash-1.1.0-monolithic.jar HTTP/1.1" 304 - "-" "Mozilla/5.0 Gecko/20100115 Firefox/3.6"
207.241.237.104 - - [18/May/2015:00:05:43 +0000] "GET /geekery/find-that-lost-screen-session-2.html HTTP/1.0" 404 328 "https://www.semicomplete.com/blog/tags/tools" "Mozilla/5.0 (compatible; archive.org_bot +https://www.archive.org/details/archive.org_bot)"

We’ve dealt with theory only up to this point, but here, it’s important we execute the next command.

Let’s send this curl request to define the index in Elasticsearch. For the purpose of this exercise, we ignore the last two columns in the log in this index structure.

curl -X PUT "localhost:9200/logs?pretty" -H 'Content-Type: application/json' -d'
{
	"mappings" : {
    	"properties" : {
        	"ip" : { "type" : "keyword" },
        	"dateTime": {"type" : "date", "format" : "dd/MMM/yyyy:HH:mm:ss"},
        	"httpStatus": {"type" : "keyword"},
        	"url" : { "type" : "keyword" },
        	"responseCode" : { "type" : "keyword" },
        	"size" : { "type" : "integer" }

    	        }
	}
}
'

Having the dateTime field defined as a date is essential since it will enable us to visualize various metrics using Kibana. Of course, we also needed to specify the date/time format used in the access log, “dd/MMM/yyyy:HH:mm:ss”, so that values passed along to Elasticsearch are parsed correctly.

Defining map() Logic

Since our input data is a text file, we use the TextInputFormat.class. Every line of the log file will be passed as input to the map() method.

Finally, we can define the core logic of the program: how we want to process each line of text and get it ready to be sent to the Elasticsearch index, with the help of the EsOutputFormat.class.

The value input parameter of the map() method holds the line of text currently extracted from the log file and ready to be processed. We can ignore the key parameter for this simple example.

import org.elasticsearch.hadoop.util.WritableUtils;
import org.apache.hadoop.io.NullWritable;
import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map;

@Override
protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {

   String logEntry = value.toString();
   // Split on space
   String[] parts = logEntry.split(" ");
   Map<String, String> entry = new LinkedHashMap<>();

   // Combined LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"" combined
   entry.put("ip", parts[0]);
   // Cleanup dateTime String
   entry.put("dateTime", parts[3].replace("[", ""));
   // Cleanup extra quote from HTTP Status
   entry.put("httpStatus", parts[5].replace(""",  ""));
   entry.put("url", parts[6]);
   entry.put("responseCode", parts[8]);
   // Set size to 0 if not present
   entry.put("size", parts[9].replace("-", "0"));
   context.write(NullWritable.get(), WritableUtils.toWritable(entry));
}

We split the line into separate pieces, using the space character as a delimiter. Since we know that the first column in the log file represents an IP address, we know that parts[0] holds such an address, so we can prepare that part to be sent to Elasticsearch as the IP field. Similarly, we can send the rest of the columns from the log, but some of them need special processing beforehand. For example, when we split the input string, using the space character as a delimiter, the time field got split into two entries, since it contains a space between the seconds number and timezone (+0000 in our log). For this reason, we need to reassemble the timestamp and concatenate parts 3 and 4.

The EsOutputFormat.class ignores the “key” of the Mapper class output, hence in context.write() we set the key to NullWriteable.get()

MapReduce Job Configuration

We need to tell our program where it can reach Elasticsearch and what index to write to. We do that with conf.set(“es.nodes”, “localhost:9200”); and conf.set(“es.resource”, “logs”);.

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.elasticsearch.hadoop.mr.EsOutputFormat;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;


public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
   Configuration conf = new Configuration();
   conf.setBoolean("mapred.map.tasks.speculative.execution", false);
   conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);
   conf.set("es.nodes", "localhost:9200");
   conf.set("es.resource", "logs");

   Job job = Job.getInstance(conf);
   job.setInputFormatClass(TextInputFormat.class);
   job.setOutputFormatClass(EsOutputFormat.class);
   job.setMapperClass(AccessLogMapper.class);
   job.setNumReduceTasks(0);

   FileInputFormat.addInputPath(job, new Path(args[0]));
   System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Under normal circumstances, speculative execution in Hadoop can sometimes optimize jobs. But, in this case, since output is sent to Elasticsearch, it might accidentally cause duplicate entries or other issues. That’s why it’s recommended to disable speculative execution for such scenarios. You can read more about this, here: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration-runtime.html#_speculative_execution. These lines disable the feature:

conf.setBoolean(“mapred.map.tasks.speculative.execution”, false);
conf.setBoolean(“mapred.reduce.tasks.speculative.execution”, false);

Since the MapReduce job will essentially read a text file in this case, we use the TextInputFormat class for our input: job.setInputFormatClass(TextInputFormat.class);

And, since we want to write to an Elasticsearch index, we use the EsOutputFormat class for our output: job.setOutputFormatClass(EsOutputFormat.class);

Next, we set the Mapper class we want to use, to the one we created in this exercise: job.setMapperClass(AccessLogMapper.class);

And, finally, since we do not require a reducer, we set the number of reduce tasks to zero: job.setNumReduceTasks(0);

Building the JAR File

Once all the code is in place, we have to build an executable JAR. For this, we use the maven-shade-plugin, so we would add the following to “pom.xml“.

< build >
   < plugins >
       < plugin >
           < groupId >org.apache.maven.plugins< /groupId >
           < artifactId >maven-shade-plugin< /artifactId >
           < version >3.2.4< /version >
           < executions >
               < execution >
                   < phase >package< /phase >
                   < goals >
                       < goal >shade< /goal >
                   < /goals >
                   < configuration >
                       < transformers >
                           < transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer" >
                               < manifestEntries >
                                   < Main-Class >com.coralogix.AccessLogIndexIngestion< /Main-Class >
                                   < Build-Number >123< /Build-Number >
                               < /manifestEntries >
                           < /transformer >
                       < /transformers >
                   < /configuration >
               < /execution >
           < /executions >
       < /plugin >
   < /plugins >
< /build >  

Let’s pull in the finished project from GitHub. In case you don’t already have git installed on your machine, first install it with:

sudo apt update && sudo apt install git

Next, let’s download our Java project.

git clone https://github.com/coralogix-resources/elasticsearch-with-hadoop-mr-lesson.git

Let’s enter into the directory of this project.

cd elasticsearch-with-hadoop-mr-lesson/

We’ll need to install maven.

sudo apt install maven

And, finally, we can build our JAR file.

mvn clean package

We’ll see a lot of output and files being pulled in, and, when the process is done, we should see a “BUILD SUCCESS” message.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:23 min
[INFO] Finished at: 2020-07-25T22:11:41+03:00
[INFO] ------------------------------------------------------------------------

Running the MapReduce Job

Let’s download the Apache access log file that will represent the data we want to process with Hadoop.

wget https://raw.githubusercontent.com/linuxacademy/content-elastic-log-samples/master/access.log

Now let’s copy the JAR file we compiled earlier, to the same location where our access log is located (include the last dot “.” in this command, as that tells the copy command that “destination is current location”).

cp target/eswithmr-1.0-SNAPSHOT.jar .

Finally, we can execute the MapReduce job.

hadoop jar eswithmr-1.0-SNAPSHOT.jar access.log

When the job is done, the last part of the output should look similar to this:

File System Counters
		FILE: Number of bytes read=2370975
		FILE: Number of bytes written=519089
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=10000
		Map output records=10000
		Input split bytes=129
		Spilled Records=0
		Failed Shuffles=0
		Merged Map outputs=0
		GC time elapsed (ms)=33
		Total committed heap usage (bytes)=108003328
	File Input Format Counters 
		Bytes Read=2370789
	File Output Format Counters 
		Bytes Written=0
	Elasticsearch Hadoop Counters
		Bulk Retries=0
		Bulk Retries Total Time(ms)=0
		Bulk Total=10
		Bulk Total Time(ms)=1905
		Bytes Accepted=1656164
		Bytes Received=40000
		Bytes Retried=0
		Bytes Sent=1656164
		Documents Accepted=10000
		Documents Received=0
		Documents Retried=0
		Documents Sent=10000
		Network Retries=0
		Network Total Time(ms)=2225
		Node Retries=0
		Scroll Total=0
		Scroll Total Time(ms)=0

We should pay close attention to the Map-Reduce Framework section. In this case, we can see everything went according to plan: we had 10.000 input records and we got 10.000 output records.

To verify the records are indexed into Elasticsearch, let’s run the following command:

curl 'localhost:9200/_cat/indices?v'

We should see a docs.count matching the number of records, 10.000 in this case.

health status index uuid               	pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logs  WEPWCieYQXuIRp2LlZ_QIA   1   1  	10000        	0  	1.1mb      	1.1mb

Visualizing Hadoop Data with Kibana

Creating an Index Pattern in Kibana

In a web browser, let’s open up this address:

https://localhost:5601/app/kibana#/management/kibana/index_pattern?_g=()

We’ll create a new Index Pattern named “logs*”.

After clicking on “Next step“, from the drop-down list titled “Time Filter field name” we choose “dateTime” and then click on “Create index pattern“.

We’ll land on a screen like this:

Visualize Data in Kibana

In the left side menu, let’s navigate to the Discover page.

Now let’s set the time range from 16th of May to 21st of May 2015 and then click the “Update” button.

The visualized data should look like this:

From the “Available fields” section on the left, highlight “httpStatus”, “url” and “size“, and hit the “Add” button that appears next to them. Now we only see the metrics we’re interested in and get much cleaner output.

Filtering Data in Kibana

Since we have set the “size” property of the index to be of type integer, we can run filters based on the size. Let’s view all requests which returned data larger than 5MB.

In the Search box above, type

size >= 5000000

and press ENTER.

Above the bar chart, we can click on the drop-down list displaying “Auto” and change that time interval to “Hourly“. Now each bar displayed represents data collected in one hour.

Clean Up Steps

Let’s remove the index we have created in this lesson:

curl -XDELETE 'localhost:9200/logs'

In the terminal window where we are still logged in as the “hadoop” user, we can also remove the files we created, such as the JAR file, the Java code, access log, and so on. Of course, if you want to keep them and continue experimenting, you can skip the next command.

To remove all the files, we run:

cd && rm -rf elasticsearch-with-hadoop-mr-lesson/

And, finally, we remove the Hadoop installation archive:

rm hadoop-3.2.1.tar.gz

Conclusion

These are the basic concepts behind writing, compiling and executing MapReduce jobs with Hadoop. Although setting up a multi-node cluster is a much more complex operation, the concepts behind creating a MapReduce algorithm and running it, in parallel, on all the computers in the cluster, instead of a single machine, remain almost the same.

Source Code

package com.coralogix

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.elasticsearch.hadoop.mr.EsOutputFormat;
import org.elasticsearch.hadoop.util.WritableUtils;

import java.io.IOException;
import java.util.LinkedHashMap;
import java.util.Map;

public class AccessLogIndexIngestion {

   public static class AccessLogMapper extends Mapper {
       @Override
       protected void map(Object key, Object value, Context context) throws IOException, InterruptedException {

           String logEntry = value.toString();
           // Split on space
           String[] parts = logEntry.split(" ");
           Map<String, String> entry = new LinkedHashMap<>();

           // Combined LogFormat "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"" combined
           entry.put("ip", parts[0]);
           // Cleanup dateTime String
           entry.put("dateTime", parts[3].replace("[", ""));
           // Cleanup extra quote from HTTP Status
           entry.put("httpStatus", parts[5].replace(""",  ""));
           entry.put("url", parts[6]);
           entry.put("responseCode", parts[8]);
           // Set size to 0 if not present
           entry.put("size", parts[9].replace("-", "0"));

           context.write(NullWritable.get(), WritableUtils.toWritable(entry));
       }
   }

   public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
       Configuration conf = new Configuration();
       conf.setBoolean("mapred.map.tasks.speculative.execution", false);
       conf.setBoolean("mapred.reduce.tasks.speculative.execution", false);
       conf.set("es.nodes", "localhost:9200");
       conf.set("es.resource", "logs");

       Job job = Job.getInstance(conf);
       job.setInputFormatClass(TextInputFormat.class);
       job.setOutputFormatClass(EsOutputFormat.class);
       job.setMapperClass(AccessLogMapper.class);
       job.setNumReduceTasks(0);

       FileInputFormat.addInputPath(job, new Path(args[0]));
       System.exit(job.waitForCompletion(true) ? 0 : 1);
   }

}