mle chang

Elasticsearch is an open source distributed document store and search engine, used by companies like Wikipedia and Stack Overflow. It specializes in full-text search, utilizing something known as an inverted index.

What the heck is an inverted index, you ask?

Well, let's say we're storing every blog post as a document in an 'index' called 'blogs'. Every blog post is considered a document, or row, in the index. Every time we store a blog post in Elasticsearch, it stores every unique term and its frequency as an inverted index.

For example, let's say we have a document with the words "I like yellow cake" and a second document with the words, "I hate yellow cake." (As a lover of yellow cake, it hurt me to type those words, but I did it for you, dear reader.) Our inverted index would look like this:

	Doc 1	Doc 2
I	X	X
cake	X	X
hate		X
like	X
yellow	X	X

Now that we have our inverted index ready, once we search for "hate yellow" both of the documents will come up, but doc 2 would score higher since it contains two out of two of the terms in the query.

Cool, right?

OK, so why pyelasticsearch? Why not just Elasticsearch?

When I'm learning something new, the more I can focus on one thing, the better. Elasticsearch is written in Java, but it has some great clients for programming languages like PHP and Python. I decided to try the one for Python.

Two of the main Python clients are installed via pip install elasticsearch and pip install pyelasticsearch. I started with the first one but discovered that pyelasticsearch is slightly more intuitive for me.

Below are some cool basic actions you can do with pyelasticsearch. Note: make sure you're running Elasticsearch before you start doing stuff with pyelasticsearch. To get Elasticsearch up, navigate to where you downloaded elasticsearch and then run bin/elasticsearch . This tutorial assumes you are running it on the default port, localhost:9200.

from pyelasticsearch import ElasticSearch   #note the capital S
es = ElasticSearch('http://localhost:9200/')

#index a document
es.index('blogs', 'post', {'title': 'how to use Elasticsearch', 'tags': ['python', 'elasticsearch', 'programming']}, id=1)
#this creates an index called `blogs` (if it doesn't exist already) with a type of `post` and stores a document with a title and tags with an id of 1.

#refresh an index
es.refresh('blogs')

#get by id
es.get('blogs', 'post', 1)

#search an index
es.search('python', index='blogs')

#delete an index
es.delete_index('blogs')

#flush an index
es.flush('blogs')

#get a mapping
es.get_mapping('blogs')
#or do es.get_mapping() without any arguments to get mappings for all indexes