How to use Elasticsearch with SQL Server
- select the contributor at the end of the page -
If you administer an SQL Server database but you'd like to expose all the data you've collected in more interesting and effective ways, you're in the right place.
Think about it like this: Adding Elasticsearch as a secondary index to your primary SQL Server data store is like adding Google or Bing to your application. Using Elasticsearch to index data from another data store lets you to add new features to your application including suggestive “more like this” searching, scoring of search results, fast aggregations and statistics, geo distance filtering and more. Oh, and get this: Its' free. So, let's take a look at some of the useful things Elasticsearch can do that SQL Server can't.
Syncing data between SQL Server and Elasticsearch
Of the many plugins available for Elasticsearch, one of the most useful is the river. An Elasticsearch river targets another primary data store and streams any additions or changes made into its own index. You can stream data from MongoDB, CouchDB, an SQL-based database, or even directly from Twitter! Any additions or changes made to the document or data row on the primary store will automatically be updated in the Elasticsearch index. Since the primary data store information is never altered during the sync, this affords you plenty of flexibility with different ways to setup your Elasticsearch index. You can always easily rebuild the index with new parameters or data types by simply changing your mappings and recreating your river! So, let's install the Elasticsearch river plugin and configure a river to work with SQL Server.
By the way, if you're new to Elasticsearch or need some help installing it, you're in good company. You can check out my Getting Started with Elasticsearch for .NET Developers course here on Pluralsight for more information on getting started.
Our SQL Server data
I've created a basic database in SQL Server for illustrative purposes. It has one table called “People,” with a few rows of data about fictional people.
Installing the _river plugin on Elasticsearch
The first thing we'll need to do is install the Elasticsearch river plugin that works with JDBC. Elasticsearch is written in Java, so to access SQL Server we'll need to go through a JDBC driver. The Elasticsearch JDBC river plugin is maintained here, but can be installed through the normal Elasticsearch plugin script.
- From the bin folder of your Elasticsearch installation, run
plugin --install jdbc --url.
- Next, since we're trying to communicate with SQL Server, we'll need the appropriate vendor JDBC drivers. You can download Microsoft's latest drivers thta work with SQL Server here.
- Unzip the JDBC drivers and find the .jar files in the sqljdbc_4.x/enu folder.
- Copy the sqljdbc4.jar and sqljdbc41.jar files to your plugins/jdbc folder in the Elasticsearch installation path. The folder should have already been created by the river installation. Do NOT copy the sqljdbc.jar file. This is for an older version of the JVM that you're probably not running, and will cause the river plugin to fail.
- Now, restart your Elasticsearch node or cluster.
Configuring the _river
Now that we've installed the river plugin and appropriate JDBC drivers, we're ready to instantiate a new river to stream data. You'll notice that we haven't created an index mapping or any templates in Elasticsearch. We're going to let Elasticsearch choose the data types for us when it creates the index. This will allow for the most flexibility, and for the purposes of this demo, will be the quickest to get us started. Alternatively, you could pre-configure your index mappings or use a template to dictate index mapping data types.
To create the new river, we'll need to send an http request to Elasticsearch. For illustrative purposes, I'll use the “Postman” plugin in Google Chrome, which is an excellent http client. The http request we want to send will be an http POST with a json payload that will contain all of our river parameters including where to find the primary data store, how to login to the database and how to select the data from SQL Server.
"schedule" : "0 0-59 0-23 ? * *",
"sql":"select ID as _id, FirstName, LastName, DateOfBirth from People",
There are more options available to the JDBC river but these will get you started. Now, let's talk about what each of these means:
- The url is pointing to localhost, you should change this to whatever you're using for your Elasticsearch node url. Port 9200 is the standard Elasticsearch listening port.
- The _river/ segment of the url refers to the river plugin. my_jdbc_river is the type name of the river we want to create. You should choose a name that pertains to your operation. Finally, the _meta segment lets the plugin know that this is a configuration document meant as parameters for the river.
- In the JSON payload, we have a number of interesting pieces of data:
- type: The kind of river we're instantiating in Elasticsearch; in this case, a JDBC driver.
- schedule: Since SQL Server has no form of push mechanism and the river plugin can't read the transaction log, our plugin will poll SQL Server periodically for changes in the data. The schedule in this example will cause our river to poll Elasticsearch once per minute, every hour of every day. You can modify this to your needs, according to the readme at the JDBC _river Github repository.
- jdbc: This section is fairly self explanatory. The “driver” line tells the river plugin to use the SQL Server jdbc driver we downloaded from Microsoft. The “url” is the connection string to your SQL Server. “User” and “password” are the login you've created on your SQL Server database for our river to use. The login should have select privileges.
- The “sql” parameter bears some explanation. This is the SQL that the river will use to select data from your database. As you may imagine, you can select as few or as many columns as you like to add to your index. I've created a select statement with an alias that tells SQL Server to return the primary key field “ID” as “_id”. This is the default key convention that Elasticsearch uses for all documents. It's important to keep this nomenclature when selecting your data so that Elasticsearch knows to update a document and not create a new one each time it polls.
When I push “Send” in Postman, the http request will be made to Elasticsearch and the _river will be created. As it is instantiated, it will automatically create the index that it is targeting and will begin to poll our SQL Server database.
Below, you can see that our _river index in the overview and the people index it generated for us. By browsing this data, I can see that our _river is successfully pulling documents over to Elasticsearch. If I modify any of the data in SQL Server, the updated data will appear in our Elasticsearch index almost instantly.
There are many considerations and interesting options when using Elasticsearch as a secondary search index. Depending on how often your data changes, you may not need to poll SQL Server for changes very often. You may also want to play around with different analyzers and recreate your index to find the best ones suited for your data. Finally, if you'd like to institute a strict schema for your data, you can explore using templates, so that each time you create a new index, the fields are created using predictable types. All of these options can help make your Elasticsearch index run more effectively and will make it easier to integrate into your environment.