Importing Legacy CSV Data into Elasticsearch

CSV to ES

I use Salesforce at work quite a bit, and one of the things I find endlessly frustrating about it is the lack of good reporting functions.  Often I end up just dumping all the data I need into a CSV file and opening it up in Excel to build the reports I need.  Recently I was trying to run some analysis on cases and case events over time.  As usual, Salesforce “reports” (which are really little more than filtered lists with some limited predefined object joins) were falling well short of what I needed.  I’ve been playing around with Elasticsearch for some other purposes, such as graphing and analyzing air quality measurements taken over time.  I’ve also seen people on my team at Alfresco use Elasticsearch for some content analytics work with Logstash.  Elasticsearch and Kibana lends itself well to analyzing the kind of time-series data that I was working with in Salesforce.

The Data

Salesforce reporting makes it simple to export your data as a CSV file.  It’s a bit “lowest common denominator”, but it will work for my purposes.  What I’m dumping into that CSV is a series of support case related events, such as state transitions, comments, etc.  What I want out of it is the ability to slice and dice that data in a number of ways to analyze how customers and support reps are interacting with each other.  How to get that CSV data into Elasticsearch?  Logstash.  Logstash has a filter plugin that simplifies sucking CSV data into the index (because of course it does).

The Configuration

Importing CSV data to Elasticsearch using Logstash is pretty straightforward.  To do this, we need a configuration file for Logstash that defines where the input data comes from, how to filter it, and where to send it.  The example below is a version of the config file that I used.  It assumes that the output will be an Elasticsearch instance running locally on port 9200, and will stash the data in an index named “supportdata”.  It will also output the data to stdout for debugging purposes.  Not recommended for production if you have a huge volume, but for my case it’s handy to see.  The filter section contains the list of columns that will be imported.  Using filter options you can get some fine grained control over this behavior.

input {
file {
path =>     [“/path/to/my/file.csv”]
start_position => “beginning”
}
}

filter {
csv {
columns => [
“EventDateTime”,
“User”,
“ElapsedTime”,
“CustomerName”,
“…”,
“…”,
]
}
}

output {
elasticsearch {
hosts => “http://localhost:9200”
index => “supportdata”
}
stdout{}
}

Debugging

No project goes perfectly right the first time, and this was no exception.  I use a Mac for work, and when I first tried to get Logstash to import the data it would run, but nothing would show up in the index.  I turned on debugging to see what was happening, and saw the following output:

[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.inputs.file ] each: file grew:/path/to/my/file.csv: old size 0, new size 4674844
[DEBUG][logstash.pipeline ] Pushing flush onto pipeline

This block just repeated over and over again.  So what’s going wrong?  Obviously Logstash can see the file and can read it.  It is properly picking up the fact that the file has changed, but it isn’t picking up the CSV entries and moving them into Elasticsearch.  Turns out that Logstash is sensitive to line ending characters.  Simply opening the CSV in TextWrangler and saving it with Unix line endings fixed the problem.

Now that I can easily get my CSV formatted event data into Elasticsearch, the next step is to automate all of this so that I can just run my analysis without having to deal with manually exporting the report.  It looks like this is possible via the Salesforce Reports and Dashboards REST API.  I’m just getting my head around this particular Salesforce API, and at first glance it looks like there is a better way to do this than with CSV data.  I’m also looking into TreasureData as an option, since it appears to support pulling data from Salesforce and pushing it into Elasticsearch.  As that work progresses I’ll be sure to share whatever shakes out of it!

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s