Logstash CSV: Import & Parse Your Data [Hands-on Examples]
August 12, 2020
The CSV file format is widely used across the business and engineering world as a common file for data exchange. The basic concepts of it are fairly simple, but unlike JSON which is more standardized, you’re likely to encounter various flavors of CSV data. This lesson will prepare you to understand how to import and parse CSV using Logstash before being indexed into Elasticsearch.
What is a CSV?
CSV, short for Comma Separated Format, is a popular file format that stores tabular data and is used by spreadsheet software like Microsoft Excel and Google Sheets.
Here’s a quick rundown of the general rules for the CSV format:
The delimiter sets the boundaries between lists of items in the file. The name CSV stands for comma-separated values because these files mostly use commas as the delimiter.
CSV organizes data into the same number of items per row, so that every row has the same number of items, even if some of those are empty. This enables a program to reliably parse each item or quickly search across an entire column, instead of row-by-row.
Each row in the file is separated by a newline character.
Column headers, which describe each column’s data, will typically appear as the first row in the CSV file, but it’s an optional row. All other rows that follow include the actual data.
Some parts like the header rows are optional, and while many CSVs use commas to separate the items contained in the file, in reality, this separator can be any character, so the same parsing configuration may not work for all CSV files.
Let’s take a look at some sample CSV data: name,age,gender,country
The first row here is the “header row” and defines the names of the fields. The following rows display the actual values of those header row fields.
A program that processes this CSV data will associate the first field of “name” with the first value on every subsequent row after the header. In other words, John, Basil and Bella would all be associated with the “name” field.
This process of interpreting data, extracting values and converting them in the process to another format, is usually called parsing.
This will display the contents of the file id,timestamp,paymentType,name,gender,ip_address,purpose,country,age
1,2019-08-29T01:53:12Z,Amex,Giovanna Van der Linde,Female,18.104.22.168,Industrial,Philippines,55
5,2019-06-26T08:53:59Z,Visa,Sydney Garlett,Male,22.214.171.124,Computers,South Korea,25
Press the DOWN arrow key until you reach the last line and then delete the empty lines, by pressing BACKSPACE. It’s important there are no empty lines in this CSV file, otherwise Logstash will try to process them. If there are empty rows, Logstash would try to parse them and the rows would still be indexed, but without associated fields. So in effect the empty lines will create empty documents in Elasticsearch causing us to unnecessarily increase the document count.
Here, the “file” subsection indicates that a file will be used for the input. Within, we use these options:
path which indicates the full path to the input file (in our case, the CSV file we prepared earlier).
Start_position is where we instruct Logstash to read the file from the beginning. This setting only applies to files that are read for the first time. By default, the program would prefer to read from the end, since it expects to process a file that would have data added to it periodically. This way, it could import only the new data whenever it’s added to the end of the CSV file. If the file has been seen before, the next parameter will be used to decide what to do.
sincedb_path points to a database file that keeps track of the last line parsed in an input file (in this scenario, the CSV file). The next time the input file would be parsed, the process would continue from the position recorded in the sincedb file. Here, we set this to “/dev/null” so that Logstash won’t be able to record the last line it previously read for that particular file. This way, it will process the entire file each time, from beginning, rather than continue where it left off.
Now let’s break down the filter section of the configuration file.
We’re going to use “csv” as a subsection to specify that this is the type of file we intend to parse. Within that, we declare the following options:
separator where we pass the comma character “,”. This way, Logstash will know that commas are used as a delimiter to separate fields.
skip_header is important here, telling the parser to ignore the first line, which only indicates column names and not the actual data fields we intend to import.
columns where we specify the names of the columns that will be imported. In this case, we are telling our parser that the first field on each line is an ID, the second is a timestamp, and so on, until the last one which is an age.
Now let’s explore the final section in our configuration file, the “output” section:
There’s a lot of useful information in this output. For example, it shows us the file that was used for the imported data, column names, field values, and so on.
Logstash Mutate Filter Plugin
So far, we’ve only played around with the basics of importing CSV files but we can already see that it’s pretty straightforward.
But that’s only because we haven’t been picky about the exact nature of the data. Other times though, we may need to be specific with what data we use and how. For example, we may need to skip importing some fields that are unnecessary, change the mapping of the selected fields, convert the data type of some values or process them in some other way, by changing all characters to uppercase, for example.
These kinds of scenarios can be configured by adding another entry to our Logstash configuration located in the filter section which is called mutate.
To make the next part easier, you can download the configuration file which we are going to be working with. You can do so by typing in the following command:
A response such as the following shows us that, indeed, the age field is of the integer data type we specified. Notice that it’s called “long” here, but that’s just an integer type, called a long integer, which can store more digits than a short integer.