Ever checked out PostgreSQL’s full text search feature? It’s easy to use, and, together with Go, makes it possible to quickly build web apps like this one here – Scenes of Shakespeare – that can search through all of Shakespeare’s plays.

Check out the app now! Like it? Let’s see how to build it!

Storing the data

Each of Shakespeare’s 37 plays are divided into acts, and each act into scenes. We’ll store each scene as a row in the scenes table, which looks like this:

    workid text,       -- denotes the name of the play (like "macbeth")
    act integer,       -- the act (like 1)
    scene integer,     -- the scene within the act (like 7)
    description text,  -- short desc of the scene (like "Macbeth's castle.")
    body text          -- full text of the scene

We want to perform full text searches over the “description” and the “body” fields, and give a higher weightage to the “description” field. The way to do this in Postgres is to add a column of type tsvector. As the contents of this column, we’ll store the weighted lexemes of “description” and “body” columns.

Say what? A lexeme is a normalized representation of piece of English text, so that a query for “move” can match texts like “moving”, “moved” etc. Check this out:

main=> select to_tsvector('moving'), to_tsvector('moved'), to_tsvector('move');
 to_tsvector | to_tsvector | to_tsvector
 'move':1    | 'move':1    | 'move':1
(1 row)

to_tsvector is how you convert plain text to tsvector. See how all the three words end up as the same lexeme? There’s more about this stuff that we cannot cover here, but read this if you want to know more.

We also want to give a higher weightage to the scene description, relative to the text of the scene, so that a match in the description is ranked higher than a match in the text. Postgres supports four weights, called unimaginatively as A, B, C and D. By default, A ranks higher than B, which ranks higher than C etc. You add a weight to a tsvector using the setweight function, like this:

setweight(tsvec, 'A')

Now let’s put it all together. First let’s add the tsvector column, which we’ll call “tsv”:

ALTER TABLE scenes ADD COLUMN tsv tsvector;

Next, we’ll populate it with the weighted lexemes of the other columns. We’ll give a weight of ‘A’ to the description and ‘B’ to the body:

UPDATE scenes SET tsv =
    setweight(to_tsvector(description), 'A') ||
    setweight(to_tsvector(body), 'B');

As a last step, we’ll create an index on the tsv column. This will let Postgres run our full text queries against the index rather than the table rows. Postgres has different types of indexes, and for our purpose we’ll use GIN (Generalized Inverted Index) – read this if you want to know why.

CREATE INDEX ix_scenes_tsv ON scenes USING GIN(tsv);

The actual data

The site Open Source Shakespeare has downloadable databases of structured Shakespeare texts. While we can’t use this directly, here is a manipulated version of it, with the tables like we described above.

If you’re following along at home, be sure to download init.sql.gz and execute it on an empty database.

Querying the data

Now we’re all set to query the data. The query goes something like this:

SELECT * FROM scenes WHERE tsv @@ q;

We see a @@ operator, and an operand called q. The @@ operator is a boolean operator that returns true when the right-hand-side value of type tsquery matches the left-hand-side value of type tsvector. Here q is a value of type tsquery.

A tsquery, described here, holds lexemes and operators that are used to match the lexemes against a tsvector. For e.g., a tsquery of 'heaven' & 'earth' means that both the lexemes “heaven” and “earth” should be present in the tsvector. There are other operators, including the “followed by” operator that was just introduced in 9.6. However, we’ll not dive deeper into tsquery, and will stick with a simple way of producing tsquery values from plain text:

SELECT * FROM scenes WHERE tsv @@ plainto_tsquery('heaven earth');

or equivalently:

  workid, act, scene, description, body
  scenes, plainto_tsquery('heaven earth') q
  tsv @@ q;

Here the plainto_tsquery function is used to create a tsquery value from simple text by joining it’s lexemes with the AND operator.

The query is perfectly valid. Run it now if you have the database ready. It should return 157 rows. However, it makes sense to return only the top 10 or so – but wait! hast not thou forgotten something? What about the ranking?

There are two rank functions – ts_rank and ts_rank_cd, described here – that can quantify “how well does the tsquery match the tsvector?”. For example, here are two lines that are matched against “smell sweet”, and although both match, one matches “better”:

ftdemo=# select ts_rank(
ftdemo(#   to_tsvector('By any other name would smell as sweet;'),
ftdemo(#   plainto_tsquery('smell sweet'));
(1 row)

ftdemo=# select ts_rank(
ftdemo(#   to_tsvector('I smell sweet savours, and I feel soft things.'),
ftdemo(#   plainto_tsquery('smell sweet'));
(1 row)

Let’s add this into our query, along with a limit of 10 rows:

  workid, act, scene, description, body, ts_rank(tsv, q) as rank
  scenes, plainto_tsquery('heaven earth') q
  tsv @@ q
  rank DESC

There’s just one more thing left to do. As it stands, the query returns the whole of the body column. What we really want are highlighted snippets from the body, just like how a Google search returns snippets from a page with parts highlighted. For this, use the ts_headline function:

SELECT ..., ts_headline(body, q), ...

This will add HTML bold start/end tags (this is the default, you can change this) around matching text, and returns only snippets of relevant text:

<b>heaven</b> to <b>earth</b>, from <b>earth</b> to <b>heaven</b>;
And as imagination bodies forth
The forms of things

However, just fetching ts_headline(body, q) instead of body has a gotcha: ts_headline will be called for each row in the table, before the WHERE clause is applied. And ts_headline is expensive. It works by lexemizing the whole text, matching it with the tsquery and then inserting the HTML tags. So instead, we’ll use a subquery:

  workid, act, scene, description, ts_headline(body, q)
    workid, act, scene, description, body, ts_rank(tsv, q) as rank, q
    scenes, plainto_tsquery('heaven earth') q
    tsv @@ q
    rank DESC
  rank DESC;

The inner query will select 10 rows, each returning the full body, and the outer query will convert the body to ts_headline(body). This way the cost of ts_headline remains proportional to the search result limit of 10, rather than the number of rows in the table.

And that is our final query. (Well almost, since the actual code also has another table that maps “workid” to the full name of the play.)

Making a Web App

Let’s put together a Go-based web app to let people interactively search through the scene text. Thanks to Go’s built-in HTTP server and HTML templates, making a 3-page web app is rather trivial.

Connecting to the Postgres database is done via lib/pq, which provides a pure-Go database/sql driver for Postgres. This is the only dependency of our app.

The whole code, excluding the templates and the SQL queries is only about 120 LoC – read it right here. It has a couple of handler functions that display the web pages, rendered via templates. The error check is deliberately simple, and any errors result in a 404.

The app is deployable to Heroku, but remember to setup the database first. Here it is on Heroku, running on a free tier.

The code is available on Github here, and is MIT-licensed. Feel free to fork it and use it for all your text search needs!

Closing notes

It’s fairly easy to build a tool like this using Postgres and Go, and arguably simpler to run and maintain than a comparable Elasticsearch/Solr-based design.

PostgreSQL supports full text search in other languages too, like German, Italian, Spanish etc. Use \dF at the psql prompt to see the full list.

We also didn’t exploit all things that tsquery is capable of. In 9.6, tsquery got another operator (<N>) that can be used to match phrases, which is really much nicer than &.

The PostgreSQL documentation about Full Text Search is the best place to start digging further.

Happy searching!