Automatic Summary Tables in PostgreSQL

Has your app begun to slow down over time, whether you're seeing a large increase in traffic or not? It might be that those complex database queries that were snappy when you first launched are now taking longer and longer to execute. The first step is proper indexing, but you likely did this when you created your tables. It might be time for summary tables!

What are Summary Tables?

Summary tables are a simplified view of your complicated data that can be queried more quickly and easily. Are they magic? Absolutely, but not how you might think. Real-life magic requires a real-life magician pulling strings behind the scenes, and that's what you'll have to do. You will setup the trick, and your app's users will enjoy the show. But with just a little extra knowledge, you can be magic and lazy at the same time.

The Process

The trick follows this formula:

  • Create a table with the summary data as you'd like to see it.
  • Build a stored procedure that updates the summary table.
  • Setup a trigger that will run when your original tables are updated.

We're going to begin with a simple data model: a table for species of animals, including how many legs each species has, and a table for pets, who belong to a specific species. We want to easily query the database for the number of pets who have, say, four legs. Here's the structure for our animal tables, minus any keys:

CREATE TABLE species (
    id integer NOT NULL,
    name text,
    leg_count integer
);

CREATE TABLE pets (
    id integer NOT NULL,
    name text,
    species_id integer
);

And here's the query that gives us a tally of how many pets have each number of legs:

SELECT species.leg_count as leg_count, count(*) as pet_count
FROM pets
JOIN species on pets.species_id = species.id
GROUP BY species.leg_count

Now, to the magic.

Create the Summary Table

Our summary table is easy; it has the same format as the output of our complex join query:

CREATE TABLE leg_counts (
    id integer NOT NULL,
    leg_count integer,
    pet_count integer
);

Each record stores a leg count, and the number of pets that have that many legs.

Build the Summary Table

Here's the code:

CREATE FUNCTION update_leg_counts() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
    DECLARE
     pet_leg_count INTEGER;
     pet_count INTEGER;
    BEGIN
      SELECT INTO pet_leg_count species.leg_count
        FROM species
        WHERE species.id = NEW.species_id;
    
      SELECT INTO pet_count count(*)
        FROM pets
        JOIN species ON species.id = pets.species_id
        WHERE species.leg_count = pet_leg_count;
    
      DELETE FROM leg_counts where leg_counts.leg_count = pet_leg_count;
      INSERT INTO leg_counts (leg_count, pet_count) values (pet_leg_count, pet_count);
  
      RETURN NULL;
    END;
    $$;

Now that's a mouthful! We start by creating the function, with a return type of "trigger". This particular stored procedure has no parameters - it will be triggered by the creation of a new "pets" record, and that new record will be available within the function as "NEW". We set the language of the stored procedure as plpgsql, and declare a couple of variables that will be used in our code.

We set pet_leg_count to the number of legs that our new pet has. This is important, because we don't want to update ALL the records in our summary table, just the one that keeps track of our pet's same-legged friends. Next, we set pet_count to the number of pets with this many legs. Finally, we delete any summary table records with the same number of legs, and recreate it with the new leg count. We do it this way so our code works whether a record for that many legs previously existed, or not.

Setup the Trigger

Now for probably the easiest part of this process: telling Postgres to execute the stored procedure whenever a new pet is added:

    CREATE TRIGGER update_leg_counts
    AFTER INSERT ON pets
    FOR EACH ROW EXECUTE PROCEDURE update_leg_counts();

I gave the trigger the same name as the stored procedure, sorry if that's confusing. I didn't see a good reason to give them distinct names.

The Results

The difference is...magic? As we add pet records, we can measure the speed of querying pet counts by number of legs, both before and after the summary table. These were my actual performance numbers, which you can recreate with the code in the git repository I provided at the top of this article.

Number of Pets Query Time w/o Summary Query Time with summary Speed Improvement
100 0.0006 0.0002 65.98%
500 0.0007 0.0002 71.71%
1,000 0.0013 0.0002 79.86%
2,000 0.0019 0.0002 88.52%
3,000 0.0026 0.0002 91.04%
4,000 0.0033 0.0002 92.66%
5,000 0.0040 0.0002 93.41%
6,000 0.0049 0.0002 94.91%
7,000 0.0055 0.0002 95.77%
8,000 0.0062 0.0002 96.34%
9,000 0.0069 0.0002 96.46%
10,000 0.0094 0.0002 97.12%

The speed of querying the summary table remains constant no matter how many pet records we add to the database, while performing the complex join query gets increasingly sluggish. Even at just a hundred pet records, there is marked benefit. By the time we're storing 10,000 pets, we're saving over 97% of the time it takes to find out how many same-legged friends Fido has. Magic!

Posted in Database Performance and tagged , , , , , , , , .

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>