A Brief Introduction of Elasticsearch

How Elasticsearch came into existence?

To think about Elasticsearch, we have to go back to 1999 when the platform Lucene came out. Doug Cutting originally wrote this back then and it was available on SourceForge at that time. Lucene was added to the Apache Software Foundation in 2001 and became a top-level project in February 2005. Now Lucene includes many top-level projects such as Mahout and Nutch which you now may know as HDFS and Apache Mahout. It was the nature of Lucene which helped all the search engines index the data they were adjusting from the internet and provide reasonable ways of retrieving the information based on fuzzy matching. That means, if you searched for something e.g., funny cat videos, it would return documents or websites that were indexed which contains information related to your search. A real innovation here was how the search engine was able to extract the text from almost any type of content and let you retrieve it without knowing those exact terms. A few years later, Shay Banon created Compass which was built on top of Lucene and provided essentially the same services but in a more scalable manner. The idea here was to provide a distributed search solution which used common web transfer protocols and document formats. This is where Elasticsearch was born. Elasticsearch is a distributed, RESTful search and analytics engine which helps with all kinds of use cases in today’s technology landscape. For its data format, Elasticsearch uses JSON and for its interface it uses HTTP. Both of these are incredibly common on web. Elasticsearch is developed in Java and open source with the Apache license. There are clients available in Java, .NET, Python, and many other languages. Elasticsearch is the most popular enterprise search engine across the world now. Also, because of its incredible ability to scan documents and find information, it has become useful for data scientists and analysts as well.

The company “Elastic” provides the open source ELK stack which is Elasticsearch, Logstash, and Kibana as well as some additional add-on components that do require licensing. Their ELK stack is open source but the other components they offer such as security, monitoring, and alerting are paid known as the X-Pack. They also offer hosting, training and consulting which are typically grouped into what we call professional services. So, this is a common model for many successful open source companies similar to Cloudera and what they offer with Hadoop or what Teradata does for Presto.

 

Elasticsearch

At the bottom of the stack, you have two components which are focused on getting data into your cluster. You have Logstash, which has been around for a while and is great at ingesting log data but also has evolved to become a full-fledged ETL (extract, transform and load) platform.

If you’re running Elasticsearch, you’re most likely going to want Logstash as a good way to ingest data. Another platform for getting data into your cluster is Beats. This component will help you ingest data in real-time by looking at transactions occurring in the database or potentially new data being written to a file that you’re watching. It’s a lightweight ingestion engine that helps in keeping your cluster fresh with near real-time data. Sitting on top of both of these is Elasticsearch.

The focus of Elasticsearch is to ingest the data which is being given and create indexes of these documents and distribute them across all the nodes in the cluster. The Elasticsearch provides an interface using HTTP to search the documents and has many built-in algorithms to score them and allow an intense customisation of your search results. When it comes to using Elasticsearch as an analytics platform or doing more advanced tuning, it may make sense to have Kibana to your platform. Kibana offers another layer that you can build your applications on top of.

In this course we will use Kibana because it provides a nice web user interface and we can run all of the code through their console. Outside of these components we have X-Pack (Extension Pack) which is paid. In X-Pack you have many components which you may want if you’re running this in an Enterprise environment. X-Pack offers security integration with Kerberos, alerting of data-drive events, monitoring every cluster, reporting results and a graph engine to help with certain types of search. All of this is also available in Elastic Cloud which is their host adoption. We’ll take a look at that later so you can get a sense of how it looks.

The four key advantages that the Elastic Stack:

  1. Scalability – It gives you scalability. You can spin up your elastic cluster to handle any amount of data.
  2. Near Real Time – It offers near real-time results with log session Beats ingestion tool and it helps you with anomaly detection, identify fraud as it’s all happening in real-time.
  3. Schemaless – This is a NoSQL platform which means you don’t need a schema when you ingest the data. You can figure out the schema later and this really helps with the speed at which you can ingest the data because you don’t have to define that structure upfront.
  4. Advanced Query Language – There’s also an advanced query language built in to Elasticsearch.

 

Most common use cases that people are using Elasticsearch for:

A big one is Security and Log Analytics – It’s incredibly common to have Logstash pulling in your web server logs or security logs from your firewall. For example, throw them into an index where you have dashboards, alerting and all kinds of things that you may want to look at to see what’s going on with your traffic, whether someone is trying to login to your website where they shouldn’t be or someone is issuing a type of attack on your network. In any event, it’s a great tool to analyse these things as it does it in near real-time and can provide alerting and anomaly protection.

Another use case which may not be totally obvious is Marketing Following along with the idea that we have web logs in Elasticsearch that we can run queries against. We can use this data to find things and drill paths, how people found our website and where they came from, even what device they’re using or what part of the world they’re coming from. So, the marketing team can really gain a lot of insight about their efforts and their role by looking at the data simply from our web logs.

Elasticsearch would also help us with the operational needs. If you’re monitoring the sever health of your cluster or maybe you have a web app and you want to check on the response time, you can pull all that data as the it is already being generated by the system naturally. And now you can build workflows on top of it which are totally automated to help you identify these things. It doesn’t just have to be with a web-based company or a technology focus here. This also would apply for manufacturing. In a manufacturing plant, you have tons of machines which are working the whole day and those machines can be generating information that will help you understand how your line is running. So, while we typically think of these things as more tech-focused, they can also be used in many other physical cases as well. And of course, there’s search. Elasticsearch was built with the idea of providing a great search engine. Today, it is the most widely used one in the enterprise. One of the key points of focus here is how this platform is easily able to parse search queries and retrieve relative results based on any type of data that you might be looking for.

 

General understanding of how all the pieces fit together:

First we have our cluster which is a collection of our nodes. A cluster has a unique name with a default of Elasticsearch. It’s totally okay to have a cluster with a single node. The node is the part of your cluster which stores the data. It provides the search and index capabilities and has its own unique name. Now the nodes contain indexes and an index is a collection of similar documents such as customer data or product information. The node names should be in all lower case.

You can have as many of them as you want. When you’re doing almost anything in Elasticsearch, you’re going to be referencing an index. So it’s important to have a consistent naming pattern for all of them in your cluster. Inside of an index you have a type, and this is a category or a partition of your index. You can have multiple types within a single index. For example, one index may be for orders but you may have multiple types for product information and shipping information. At the base unit, you have a document which would be for a single customer or order or an event e.g. on your website. These documents are in JSON format and physically reside in your index. The index, in order to be scalable has to be distributed and it does it by using shards and replicas. Now a replica is a segment of an index and a shard is a portion of that index. Because of its nature a replica can never be located on the same node as the primary shard that it’s a backup for. The default when creating an index is to have five shards and one replica which would be equal to five primary shards and five replica shards distributed across two different nodes. If we visualise this, we start out with our cluster here and inside of the cluster we have multiple nodes. Inside of a node we have our index which is our primary index and inside that we have our types. We have customer type and an order type. From there we would have our documents which would actually have the data. We also have a shard, so this is a partition of our index. Now we have our same types and our documents in there as well. Additionally we have our replicas which are the backup copies in case we have any issues with the primary shards of our index.

 

Installing Elasticsearch and Kibana Locally:

Go to https://www.elastic.co/products and we need to download the product first. So click on the links for products and download Elasticsearch and Kibana.

You can see that there are versions for Windows and Linux/MAC, so download the Linux or Windows version as per your requirement.

 

Instaling Elasticsearch on Linux:

1. Download it and run below commands to unzip and start elasticsearch.

# tar xvzf elasticsearch-6.5.4.tar.gz

# cd elasticsearch-6.5.4

# cd bin

# ./elasticsearch

Note: Make sure Java 7 or above is installed and you have set the environment variables. Also, you won’t be able to run it by the root user. Make sure to change the user before running it.

2. Once it’s started, please run “curl http://localhost:9200/” command on your CLI and it should give you below output:

{

“name” : “Z4wHOZg”,

“cluster_name” : “elasticsearch”,

“cluster_uuid” : “voC1ciP2R2KrLvlZpEX7vQ”,

“version” : {

“number” : “6.5.4”,

“build_flavor” : “default”,

“build_type” : “tar”,

“build_hash” : “d2ef93d”,

“build_date” : “2018-12-17T21:17:40.758843Z”,

“build_snapshot” : false,

“lucene_version” : “7.5.0”,

“minimum_wire_compatibility_version” : “5.6.0”,

“minimum_index_compatibility_version” : “5.0.0”

},

“tagline” : “You Know, for Search”

}

That means it is setup and you can check the same in your browser as well by going to http://localhost:9200.

 

Below are the steps for Kibana:

1. Download it and run below commands to unzip and start Kibana.

# tar xvzf kibana-6.5.4-linux-x86_64.tar.gz

# cd kibana-6.5.4-linux-x86_64/

# cd bin

# ./kibana

2. While running, it’ll show you “Server running at http://localhost:5601” in the output.

3. Point your browser at http://localhost:5601 and you will see Kibana dashboard which looks like below screenshot:

 

Loading sample data into Elasticsearch:

# curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/bank/account/_bulk?pretty' --data-binary @accounts.json

Loading data into Elasticsearch using curl and the bulk data API.

We’re doing a post to localhost:9200. I’m passing in the index of bank and the type of account. Then giving it the bulk API and telling it that it’s pretty print-formatted. Then I have the data binary flag and the file itself indicated by the @accounts.json.  When we issue this command, Elasticsearch will automatically do that for us. We just send data to the right endpoint and it does everything else for us.

Switch over to Kibana and see what you actually have. So in Kibana, you can take a look and see what your indices are. Type “GET/_cat/indices” and hit enter (in Dev Tools). You will see the new index bank along with all the indices. You can type “GET bank” and see what you have in there.

Now you can go to “Management” tab and create a new index there. Just click on “Create Index Pattern” and then “Enter the index name” e.g. bank in this case and click on Next and Create. It will create a new index pattern for you. It will automatically parse the data which you just loaded. It will give you the data types and it will give you the information if it’s aggregatable, if it’s analyzed, etc.

That’s it in this article, hope you enjoyed it. Please share it across if you think it’s good.

Leave a Reply

Your email address will not be published. Required fields are marked *