Skip to content

Instantly share code, notes, and snippets.

@andrewvc
Last active December 18, 2015 11:58

Revisions

  1. andrewvc revised this gist Jul 24, 2013. No changes.
  2. andrewvc revised this gist Jul 24, 2013. 1 changed file with 28 additions and 18 deletions.
    46 changes: 28 additions & 18 deletions laruby-elasticsearchtalk.md
    Original file line number Diff line number Diff line change
    @@ -157,26 +157,36 @@ ElasticSearch can report counts of common terms in documents, frequently seen on

    ## Let's Facet

    ```ruby
    # Create a mapping for bands, with a 'name' and a 'genre'
    server.index(:bands).create(mappings: {band: {properties: {name: {type: :string}, genre: {type: :string, index: :not_analyzed} }}})

    #Import some docs
    [["Stone Roses", "madchester"], ["Boards of Canada", "IDM"], ["Aphex Twin", "IDM"], ["Mogwai", "Post Rock"], ["Godspeed", "Post Rock"], ["Harry Belafonte", "Calypso"]].
    each_with_index {|b,i|
    server.index(:bands).type(:band).put(i, {name: b[0], genre: b[1]})
    }

    # Perform a search
    server.index(:bands).search(facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
    # => [["Post Rock", 2], ["IDM", 2], ["madchester", 1], ["Calypso", 1]]

    # A more specific search
    server.index(:bands).search(query: {match: {name: "Boards"}}, facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
    # => [["IDM", 1]]
    ```
    POST /bands
    PUT /bands/band/_mapping
    {"band":{"properties":{"name":{"type":"string"},"genre":{"type":"string","index":"not_analyzed"}}}}
    POST /_bulk
    {"index": {"_index": "bands", "_type": "band", "_id": 1}}
    {"name": "Stone Roses", "genre": "madchester"}
    {"index": {"_index": "bands", "_type": "band", "_id": 2}}
    {"name": "Aphex Twin", "genre": "IDM"}
    {"index": {"_index": "bands", "_type": "band", "_id": 4}}
    {"name": "Boards of Canada", "genre": "IDM"}
    {"index": {"_index": "bands", "_type": "band", "_id": 5}}
    {"name": "Mogwai", "genre": "Post Rock"}
    {"index": {"_index": "bands", "_type": "band", "_id": 6}}
    {"name": "Godspeed", "genre": "Post Rock"}
    {"index": {"_index": "bands", "_type": "band", "_id": 7}}
    {"name": "Harry Belafonte", "genre": "Calypso"}
    // Perform a search
    POST /bands/band/_search
    {"size": 0, "facets":{"bands":{"terms":{"field":"genre"}}}}
    // A more specific search
    POST /bands/band/_search
    {"size": 5, "query": {"match": {"name": "Harry"}}, "facets":{"bands":{"terms":{"field":"genre"}}}}
    ```

    ## Integrating With Rails
    ## Integrating With an App Server

    ## Key Rails Integration Criteria

  3. andrewvc revised this gist Jul 17, 2013. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion laruby-elasticsearchtalk.md
    Original file line number Diff line number Diff line change
    @@ -1,4 +1,4 @@
    # An Elasticsearch in Ruby Crash Course!
    # An Elasticsearch in Crash Course!

    ### By Andrew Cholakian

  4. andrewvc revised this gist Jun 14, 2013. 1 changed file with 1 addition and 2 deletions.
    3 changes: 1 addition & 2 deletions laruby-elasticsearchtalk.md
    Original file line number Diff line number Diff line change
    @@ -293,11 +293,10 @@ Good because:

    ### Links

    * This talk: http://bit.ly/142wv13
    * http://www.elasticsearch.org/
    * http://exploringelasticsearch.com (my free book on elasticsearch)
    * https://github.com/PoseBiz/stretcher
    * Paramedic Cluster Monitoring tool: https://github.com/karmi/elasticsearch-paramedic
    * This presentation: https://gist.github.com/andrewvc/ebbe0e832cdd2ff7b431


    ## This Page Intentionally Left Blank
  5. andrewvc created this gist Jun 14, 2013.
    303 changes: 303 additions & 0 deletions laruby-elasticsearchtalk.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,303 @@
    # An Elasticsearch in Ruby Crash Course!

    ### By Andrew Cholakian

    *All examples use the Stretcher ruby gem*

    ## What is Elasticsearch?

    * An Information Retrieval (IR) System
    * A way to search your data in terms of natural language, and so much more
    * A distributed version of Lucene with a JSON API
    * A fancy clustered, eventually consistent database

    ## Elasticsearch and Lucene

    Lucene is an information retrieval library providing full-text indexing and search. Elastisearch provides a RESTish HTTP interface, clustering support, and other tools on top of it.

    ## Modeling Data

    * Data is stored in an **index**, similar to an SQL DB
    * Each index can store multiple **types**, similar to an SQL table
    * Items inside the index are **documents** that have a type
    * All documents are nested JSON data
    * Strongly typed schema

    ## Creating a Schema

    ```ruby
    # Setup our server
    server = Stretcher::Server.new('http://localhost:9200')
    # Create the index with its schema
    server.index(:foo).create(mappings: {
    tweet: {
    properties: {
    text: {type: 'string',
    analyzer: 'snowball'}}}}) rescue nil
    ```

    ## Create some fake data

    ```ruby
    words = %w(Many dogs dog cat cats candles candleizer abscond rightly candlestick monkey monkeypulley deft deftly)
    words.each.with_index {|w,idx|
    server.index(:foo).type(:tweet).put(idx+1, {text: w })
    }
    ```

    * The document is a simple JSON hash: `{"text": "word" }`
    * Each document has a unique ID
    * We use `put`, elasticsearch has a RESTish API

    ## And Perform a Search!

    ```ruby
    # A simple search
    server.index(:foo).search(query: {match: {text: "abscond"}}).results.map(&:text)
    => ["abscond"]
    ```

    * our query is actually a JSON object
    * our response is also JSON!

    ## What is Analysis?

    Analysis is the process whereby words are transformed into tokens.
    The Snowball analyzer, for instance, turns english words into tokens based on their stems.

    ![An Analyzer in Action](https://www.evernote.com/shard/s46/sh/d5eb1481-b9a1-459f-ba93-8ebd9bcae64f/dd6870867a5a06fb6f561b8eade356ef/deep/0/analysis-rollerblading.png)

    ## Analysis Using the API

    ```ruby
    server.analyze("deft", analyzer: :snowball).tokens.map(&:token)
    => ["deft"]
    server.analyze("deftly", analyzer: :snowball).tokens.map(&:token)
    => ["deft"]
    server.analyze("deftness", analyzer: :snowball).tokens.map(&:token)
    => ["deft"]
    server.analyze("candle", analyzer: :snowball).tokens.map(&:token)
    => ["candl"]
    server.analyze("candlestick", analyzer: :snowball).tokens.map(&:token)
    => ["candlestick"]
    ```

    ## Analysis in Action

    ```ruby
    # Will match deft and deftly
    server.index(:foo).search(query: {match: {text: "deft"}}).results.map(&:text)
    => ["deft", "deftly"]
    # Will match candle, but not candlestick
    server.index(:foo).search(query: {match: {text: "candle"}}).results.map(&:text)
    # => ["candles"]
    ```

    ## More kinds of Analysis

    ```ruby
    # NGram
    server.analyze("news", tokenizer: "ngram", filter: "lowercase").tokens.map(&:token)
    # => ["n", "e", "w", "s", "ne", "ew", "ws"]

    # Stop word
    server.analyze("The quick brown fox jumps over the lazy dog.", analyzer: :stop).tokens.map(&:token)
    #=> ["quick", "brown", "fox", "jumps", "over", "lazy", "dog"]

    # Path Hierarchy
    server.analyze("/var/lib/racoons", tokenizer: :path_hierarchy).tokens.map(&:token)
    # => ["/var", "/var/lib", "/var/lib/racoons"]
    ```

    ## Searching With An NGram

    ```ruby
    # Create the index
    server.index(:users).create(settings: {analysis: {analyzer: {my_ngram: {type: "custom", tokenizer: "ngram", filter: 'lowercase'}}}}, mappings: {user: {properties: {name: {type: :string, analyzer: :my_ngram}}}})

    # Store some fake data
    users = %w(bender fry lela hubert cubert hermes calculon)
    users.each_with_index {|name,i| server.index(:users).type(:user).put(i, {name: name}) }

    # Our analyzer in action
    server.index(:users).analyze("hubert", analyzer: :my_ngram).tokens.map(&:token)
    # => ["h", "u", "b", "e", "r", "t", "hu", "ub", "be", "er", "rt"]

    # Some queries

    # Exact
    server.index(:users).search(query: {match: {name: "Hubert"}}).results.map(&:name)
    => ["hubert", "cubert", "bender", "hermes", "fry", "calculon", "lela"]

    # A Mis-spelled query
    server.index(:users).search(query: {match: {name: "Calclulon"}}).results.map(&:name)
    => ["calculon", "lela", "cubert", "bender", "hubert"]
    ```

    ## Boosting

    ```ruby
    # Individual docs can be boosted
    server.index(:users).type(:user).put(1000, {name: "boiler", "_boost" => 1_000_000})

    server.index(:users).search(query: {match: {name: "bender"}}).results.map(&:name)
    # Wha?
    # => ["boiler", "bender", "hermes", "cubert", "hubert", "calculon", "fry", "lela"]

    server.index(:users).search(query: {match: {name: "lela"}}).results.map(&:name)
    # Sweet Zombie Jesus!
    => ["boiler", "lela", "calculon", "bender", "hermes", "cubert", "hubert"]
    ```

    ## Faceting

    ElasticSearch can report counts of common terms in documents, frequently seen on the left-hand side of web-sites these are 'facets'

    ![Facets on Amazon](https://www.evernote.com/shard/s46/sh/dcc9a51c-9296-40ac-83b3-ae0ad66379d5/7b1a9c6e980f87c6adc8c3dfed93993a/deep/0/Amazon.com.jpg)

    ## Let's Facet

    ```ruby
    # Create a mapping for bands, with a 'name' and a 'genre'
    server.index(:bands).create(mappings: {band: {properties: {name: {type: :string}, genre: {type: :string, index: :not_analyzed} }}})

    #Import some docs
    [["Stone Roses", "madchester"], ["Boards of Canada", "IDM"], ["Aphex Twin", "IDM"], ["Mogwai", "Post Rock"], ["Godspeed", "Post Rock"], ["Harry Belafonte", "Calypso"]].
    each_with_index {|b,i|
    server.index(:bands).type(:band).put(i, {name: b[0], genre: b[1]})
    }

    # Perform a search
    server.index(:bands).search(facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
    # => [["Post Rock", 2], ["IDM", 2], ["madchester", 1], ["Calypso", 1]]

    # A more specific search
    server.index(:bands).search(query: {match: {name: "Boards"}}, facets: {bands: {terms: {field: :genre}}}).facets.bands.terms.map {|f| [f[:term], f[:count]]}
    # => [["IDM", 1]]
    ```

    ## Integrating With Rails

    ## Key Rails Integration Criteria

    * Generally use an RDBMS(SQL) as primary store
    * Elasticsearch data should respond correctly to RDBMS transactions
    * Elasticsearch data can be rebuilt from RDBMS any time
    * ActiveRecord objects do not necessarily map 1:1 w/ ES objects
    * ES should fail gracefully whenever possible. If ES dies, your app should degrade, not stop.

    ## What NOT to do!

    ```ruby
    after_save do
    es_client.put(self.id, self.as_json)
    end
    ```

    Bad because:

    * Another after_save block fails causing a transaction rollback, won't rollback elasticsearch
    * ES goes down, your app goes down
    * Even if you handle ES going down, you have to figure out which records need re-indexing when it comes back up

    ## How We Solved This at Pose

    ```ruby
    after_save do
    # Add to RBDMS queue of objects needing indexing
    IndexRequest.create(self)
    end
    ```

    Good because:

    * Processed in background
    * Transaction safe
    * If ES dies, our queue backs up
    * BONUS: Efficient bulk update now possible

    ## Queue Visualized

    ![The Queue](http://blog.andrewvc.com/assets/images/elasticsearch_model_pipeline.png)

    ## How We Implemented Bulk Updates

    * Indexes are rebuilt w/o using queue
    * Multiple DelayedJob workers run mod sharded queries over table
    * High-speed, parallel re-imports possible
    * New content will use queue

    ## Complex Schema Update Problems

    * No (good) Way in ES to change field type.
    * Delete / Rebuild may leave site inoperable too long

    ## Complex Schema Update Solutions

    * Allow N indexes per model
    * All indexes are updated in real-time, IndexRequest queue centralizes reqs
    * Batch job runs in background retroactively adding new records
    * When new index caught up, point queries at it, delete old

    ## Requirements For Multi-Schema Solutions

    * Ability to map models:indexes 1:n. We implemented m:n
    * Simultaneous bulk range and real-time indexing
    * Fast enough bulk operations that you don't take ∞ time

    ## Problem: Building BIG Queries

    * Some queries will be large and programmatically generated
    * Our largest query > 100 lines expanded JSON
    * Sometimes need to run A/B tests between queries

    ## Solution: Class Per Query

    * Each query gets own class
    * Plenty of space for DRY helpers within classes
    * When running A/B tests, subclassing for variations

    ## Search API Class Structure

    ![class structure](http://blog.andrewvc.com/assets/images/elasticsearch-classes.png)

    ## Does ElasticSearch Support Clustering?

    ## You're Damn Right it Supports Clustering!

    ![ES Clustering](https://www.evernote.com/shard/s46/sh/85bb4d5b-0b8f-4bb0-bf1e-3d5ed01b6048/41e5d0d37143ce4276d6783d61da6b4f/deep/0/Paramedic%20%7C%20pose-cluster.png)

    ## The Clustering Story

    * All queries run across all shards in the cluster
    * Shards are allocated automatically to nodes and rebalanced
    * A query to any node will work, the actual queries will be executed on the proper shard / node
    * Shards are rack aware
    * Indexes have a configurable number of replicas, set this based on your failure tolerance


    ## The Ops Side of elasticsearch

    * elasticsearch is easy to set up!
    * Just a java jar, all you need is java installed
    * Has a .deb package available

    ## Clustering just works

    * Clustering just works...
    * If on a LAN they will find each other and figure everything out
    * If on EC2, install the EC2 plugin and they will find each other
    * There is no built-in security, but proxying nginx in front works well

    ## Thank You for Listening!

    ### Links

    * http://www.elasticsearch.org/
    * http://exploringelasticsearch.com (my free book on elasticsearch)
    * https://github.com/PoseBiz/stretcher
    * Paramedic Cluster Monitoring tool: https://github.com/karmi/elasticsearch-paramedic
    * This presentation: https://gist.github.com/andrewvc/ebbe0e832cdd2ff7b431


    ## This Page Intentionally Left Blank