Kacper Golinski The Blog

Optimizing full-text search with Postgres materialized view in Rails


My recent side project is an aggregator for remote dev jobs https://remotestack.club. To keep things simple, I decided to use Postgres full-text search. It offers powerful search capabilities. More than enough for a side project and early startups.

I built the project with Ruby on Rails and I used pg_search gem to handle PostgreSQL’s full-text search. I wanted to offer a search across the job’s details, skills’ names, and the company’s name.

It is easy to search columns on associated models. Unfortunately, there is no simple solution to speed up those searches. This article shows how to optimize search with Postgres materialized view.

Full-text search is a technique for searching natural-language documents that satisfy a query. In our case, a query is a text provided by a user.

A document is the unit of searching in a full text search system; for example, a magazine article or email message

Postgres text search intro

In PostgreSQL, a document usually is a text field or a combination of fields. Possibly stored across multiple tables. During the search, each document is converted into tsvector.

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word

Text search data types

Below we can see a tsvector in action.

postgres=# SELECT to_tsvector('Ruby on Rails, is a server-side web application framework');
                                       to_tsvector
-----------------------------------------------------------------------------------------
 'applic':10 'framework':11 'rail':3 'rubi':1 'server':7 'server-sid':6 'side':8 'web':9
(1 row)


Schema

class CreateJobPostsSkillsAndCompanies < ActiveRecord::Migration
  def change
    create_table :companies do |t|
      t.string :name
    end

    create_table :job_posts do |t|
      t.string :title
      t.text :description
      t.belongs_to :company

      t.timestamps
    end

    create_table :skills do |t|
      t.string :name

      t.timestamps
    end

    create_table :job_post_skills do |t|
      t.belongs_to :skill
      t.belongs_to :job_post

      t.timestamps
    end
  end
end


Seed data

Let’s create some seed data. 10_000 job posts should be enough.

https://github.com/caspg/optimizing-postgresql-full-text-search-rails/blob/master/db/seeds.rb


Full text with pg_search gem

Pg_search gem builds ActiveRecord named scopes that take advantage of PostgreSQL’s full-text search. We can configure a search scope using pg_search_scope. The first parameter is a scope that we will use for full-text search.

We want to search against columns in JobPost but also against columns on associated models, Skill and Company. pg_search supports searching through associations with :associated_against options.

class JobPost < ApplicationRecord
  has_many :job_post_skills, dependent: :destroy
  has_many :skills, through: :job_post_skills
  belongs_to :company

  include PgSearch::Model
  pg_search_scope(
    :search,
    against: [:title, :description],
    associated_against: { skills: :name, company: :name },
    using: {
      tsearch: {
        dictionary: 'english',
      },
    },
  )
end

After adding a couple of lines of code, we can already use a full-text search. As we can see below, performance is not that great.

> JobPost.search("ruby on rails")
JobPost Load (1076.5ms)  SELECT  "job_posts".* FROM "job_posts" ...


Potential optimizations

We can use database indexes to speed up data retrieval. Postgres gives us two types of indexes for full-text searches.

  • GIN (Generalized Inverted Index)
  • GiST (Generalized Search Tree)

Acording to the documentation, GIN indexes are the preferred type.

CREATE INDEX name ON table USING GIN (column);
-- or
CREATE INDEX name ON table USING GIN (to_tsvector('english', column));

The column must be of tsvector type or must be converted to this type with to_tsvector function. We can populate the column of tsvector type using database triggers. With searches across associated tables, we have to do some extra work to build such indexes.

We could use database denormalization and triggers to ensure data integrity. This would give us up to date indexes but would introduce extra complexity and would slow down updates. Another solution is materialized view.


Quick intro to materialized views

A View is a virtual table created by a query based on one or more tables. It can be used for wrapping commonly used complex queries. Materialized Views are special kind of View that persist results in table-like form.

They give us faster access to data but increase database size and data are not always current. They are perfect in scenarios when data does not have to be always fresh or when we have more or less static data. For example, a job aggregator which imports new posts a couple of times per day. In this case, we can refresh data after each import.

We can generate fresh data with:

REFRESH MATERIALIZED VIEW my_materialized_view;


Materialized views in Rails with Scenic gem

Scenic gem adds methods to create and manage database views (and materialized views) in Rails.

rails generate scenic:view job_post_search --materialized
      create  db/views/job_post_searches_v01.sql
      create  db/migrate/[TIMESTAMP]_update_job_post_searches_to_version_1.rb

job_post_searches_v01.sql defines a query we will use to build a materialized view. We have to build a view with two columns, job_post_id and tsv_document. tsv_document is a combination of associated fields in tsvector data type.

-- db/views/job_post_searches_v01.sql

SELECT
  job_posts.id AS job_post_id,
  (
    to_tsvector('english', coalesce(job_posts.title, ''))
    || to_tsvector('english', coalesce(job_posts.description, ''))
    || to_tsvector('english', coalesce(companies.name, ''))
    || to_tsvector('english', coalesce(string_agg(skills.name, ' ; '), ''))
  ) AS tsv_document
FROM job_posts
JOIN companies ON companies.id = job_posts.company_id
JOIN job_post_skills ON job_post_skills.job_post_id = job_posts.id
JOIN skills ON skills.id = job_post_skills.skill_id
GROUP BY job_posts.id, companies.id;

The above query returns the following results:

 job_post_id |                                                                                                            tsv_document
-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
           1 | 'applic':17 'block':31 'dach':30 'develop':5 'framework':18 'licens':25 'mit':24 'parker':28 'rail':4,8,10 'rubi':2,6,21,32 'senior':1 'server':14 'server-sid':13 'side':15 'smith':27 'spacex':26 'von':29 'web':16 'written':19

...more rows

Scenic adds create_view method. It creates a materialized view based on the corresponding SQL statement. Finally, we can also create a GIN index on tsv_document column.

class CreateJobPostSearches < ActiveRecord::Migration
  def change
    create_view :job_post_searches, materialized: true

    # below line added by us
    add_index :job_post_searches, :tsv_document, using: :gin
  end
end


Full-text search using materialized view

Thanks to ActiveRecord, a model can be backed by a view. We can define search scope on such model in the same way we did with JobPost model. This time, we want to search against tsvector type column, instead of using an expression (which is used by default). It won’t create tsvector during each search and will use a previously created index.

class JobPostSearch < ApplicationRecord
  self.primary_key = :job_post_id

  include PgSearch::Model
  pg_search_scope(
    :search,
    against: :tsv_document,
    using: {
      tsearch: {
        dictionary: 'english',
        # specify tsvector column name
        tsvector_column: 'tsv_document',
      },
    },
  )
end

class JobPost < ApplicationRecord
  ... # omitted lines

  def self.faster_search(query)
    # protip: when using `select` instead of `pluck` we have one query less
    where(id: JobPostSearch.search(query).select(:job_post_id))
  end
end


Results

> JobPost.search("ruby on rails")
JobPost Load (1076.5ms)  SELECT  "job_posts".* FROM "job_posts" ...

> JobPost.faster_search("ruby on rails")
JobPost Load (1.2ms)  SELECT  "job_posts".* FROM "job_posts" ...


Refreshing materialied view

There is one more thing that we will have to take care of. We will have to refresh the materialized view periodically. Scenic gives us a handy method to do that.

When the refresh is running in nonconcurrent mode, the view is locked for selects. We can avoid that with the concurrent mode. The concurrent mode requires at least PostgreSQL 9.4 and view to have at least one unique index that covers all rows.

class JobPostSearch < ApplicationRecord
  ... # omitted lines

  def self.refresh_materialized_view
    Scenic.database.refresh_materialized_view(
      :job_post_searches,
      concurrently: true,
    )
  end
end

We can add an index to our view as to any other table.

class AddUniqueIndexToJobPostSerches < ActiveRecord::Migration
  def change
    add_index :job_post_searches, :job_post_id, unique: true
  end
end

Now we can run the below method when we want to generate fresh data.

> JobPostSearch.refresh_materialized_view
(1411.2ms)  REFRESH MATERIALIZED VIEW CONCURRENTLY "job_post_searches";


Comments

You can reach my via email or discuss on Twitter.