Use Postgres with Elastic using a bridge

Deepjyoti Barman @deepjyoti30
Aug 20, 2021 2:51 PM UTC
Post cover

Since a few days, I had been looking into Elastic Search to set it up with my Postgres database. I just wanted a way to search my database of 1000's of records within seconds without the blink of an eye and what better than Elastic Search for that?

The problem

After a read a bit about setting up Elastic Search, I figured out the first and only problem on my way. Postgres only works with NoSQL databases. But, yeah you guessed it, Postgres is an SQL database.

I figured there can be a simple fix for that but it's actually a bit complex than just simple. So in my case, I had an SQL database that I wanted to connect to Elastic Search that works with NoSQL entites only.

This means I need some kind of a bridge in between that would convert the contents of the SQL database (Postgres) to NoSQL so that Elastic can start understanding it and indexing it. That would actually solve the issue.

We need a bridge in between Postgres and Elastic so that it will convert the SQL data into NoSQL data (JSON) and pass it to Elastic.

Possible Solutions

ZomboDB

One of the earliest things I tried was to use a native solution that wouldn't actually be a bridge between Elastic and Postgres. I came across ZomboDB which is a Postgres plugin. I tried setting it up but it had it's own issues.

It comes with an executable only if you sponsor the developer which, (BTW, I am not against it) is a big leap in itself for something I don't even know if it'll work. They do provide a free version but catch in that is that it is supposed to be built by the user and then used. Now, I am not against building something on my own (I am not a gentoo user either though!), I didn't really see how I could use my server with 1GB of RAM to build a tool that required a hefty number of toolchains in order to be able to build it properly.

This made me give up the idea of going with ZomboDB altogether. After taking a look at the toolchains that needed to be installed in order to build it, I entirely gave up the idea of going with ZomboDB.

LogStash

After I did a bit of google searches on connecting Elastic to Postgres, one result showed up a lot on my searches. It was LogStash. I looked into Logstash, turns out it's a data processing pipeline. What this means is that LogStash would act like the bridge I mentioned above. It would run every few minutes and convert the data from Postgres to NoSQL and feed it to ElasticSearch.

Elastic, in return will index the data and the end user will be able to search the data in a nice way without a lot of response overhead. At the moment, I thought setting up LogStash would actually require a whole different server instance since it would run periodically and convert that (it will use resources for that, duhh) so I gave up on the idea of using that.

One more reason I didn't go with this was because it's not free to use.

Kafka

My next stop to find a solution started with GitHub. I had this idea that there will be some kind of repository on GitHub that would be a template for a postgres and elastic stack along with a bridge setup all in a nicely done, one click setup (read Docker Compose).

I did stumble across a few repos. Some of them were way to complex and not properly tested, required a lot of time (which I didn't have). After a while, I came across sync-postgresql-with-elasticsearch-example.

This repo actually checked a lot of boxes:

  • It had a compose file
  • It had a decent number of stars, contributors
  • The README was straightforward and easy to setup

Here I came across Kafka which does what Log stash does. The advantage here, for me and Kafka possibly, was that I could set it all up by using the compose that other developers had tested and implemented over the years.

Do give the repo a star though, it's a nicely done repo!

The Solution

I won't repeat what the Readme of the repo already explains. The repo is simple enought. The stack uses Debezium and Kafka along with an Elastic Connector to act like a bridge.

  1. Debezium reads the SQL data from Postgres and passes it to Kafka.
  2. Kafka accepts the data passed by Debezium and passes it on to the Elastic Connector.
  3. Elastic gets the SQL data in a proper NoSQL form (JSON) from the Elastic Connector and indexes it.

Following diagram explains the setup. (Below is taken from the README of the repo above, all credits to them).

                   +-------------+
                   |             |
                   |  PostgreSQL |
                   |             |
                   +------+------+
                          |
                          |
                          |
          +---------------v------------------+
          |                                  |
          |           Kafka Connect          |
          |    (Debezium, ES connectors)     |
          |                                  |
          +---------------+------------------+
                          |
                          |
                          |
                          |
                  +-------v--------+
                  |                |
                  | Elasticsearch  |
                  |                |
                  +----------------+


Gotchas in the repo

While the code in the repo can be used as is with a few modifications (postgres password etc), a few things that might need to be changed are:

Use the latest docker image

The docker images used in the compose present in the repo are marked to certain version. This is obviously important for stability, however, in my case, I was getting an error from Kafka Connect that the message (from Debezium) wasn't understood. I quickly changed the version of the images to latest and the setup started working fine.

I would suggest you to take a look at the latest possible version for the images and use them or latest if you're cool with that.

Add some extra options to the config files

The config files present in the repo work nicely if the postgres setup is a container as well. However, in most cases, the database will probably be present in some remote server. In that case, Kafka Connect will start throwing an error regarding UTC time. This happens because the time on the server uses UTC and is not setup for some timezone.

In order to get rid of that add the following line to the es-sink-*.json file inside the config object:

"timezone": "UTC"

Also, Kafka Connect should use the plugin name as pgoutput by default but if the server is remote then it defaults to something else (don't remember the value now) which will result in an error as well. In order to get rid of that, add the following line to the source.json file in the config object:

"plugin.name": "pgoutput"

Test your remote postgres connection

Before running the compose, test your postgres connection to your server in order to check if your database is accepting incoming requests from your machine.

Run the following command from shell:

psql 'postgresql://<server_ip>:5432/<db_name>?user=<db_user>&password=<db_password>'

You need to have postgres installed to use psql.

If the above connection goes well, you'll see a postgres shell on your console. This would mean you're good to go with the setup.

If you see an error from the above command, it might be because your server is not accepting requests to the 5432 port. Check how to open ports and then open the 5432 port for your machines IP. For other errors, StackOverflow always helps as well as StackExchange and other similar forums.

Discussion