Fast Postgres from .NET

It's often said that abstractions slow down your program, since they add layers which makes your application slower.
While this is generally correct, it's not always true.
Performance can be improved somewhat by removing layers, but the best way to improve performance is to change algorithms.

So let's see how we can beat performance of writing SQL and doing object materialization by hand, as it is (wrongly) common knowledge that this is the fastest way to talk to the database.

First use case, simple single table access

CREATE TABLE Post
(
  id UUID PRIMARY KEY,
  title VARCHAR NOT NULL,
  created DATE NOT NULL
)

So – the standard pattern to access such a table would be:

SELECT * FROM Post

ignoring (for now) that it would probably be a better style to explicitly name columns. Alternatively, in Postgres we can also do:

SELECT p FROM Post p

which would return a tuple for each row in the table.

For the first query, without going too deep into the actual Postgres protocol, we would get three "columns" with length and content. Parsing such a response would look something like this:

IDataReader dr = ...
return new Post { 
    id = dr.GetGuid(0), 
    title = dr.GetString(1),
    created = dr.GetDateTime(2) 
};

The second query, on the other hand, has only one "column" with length and content. Parsing such a response requires knowledge of Postgres rules for tuple assembly and is similar to parsing JSON. The code would look like this:

IDataReader dr = ...
return PostgresDriverLibrary.Parse<Post>(dr.GetValue(0));

In the TEXT protocol, the response from Postgres would look like this:

(f4d84c89-c179-4ae4-991a-e2e6bc12d879,"some text",2015-03-12)

So, now we can raise a couple of questions:

  • is it faster or slower for Postgres to return the second version?
  • can we parse the second response faster than the first response on the client side?

To make things more interesting, let's investigate how would it compare talking to Postgres using BINARY protocol in first case and using TEXT protocol for second case. Common knowledge tells us that binary protocols are much faster then textual ones, but this also isn’t always true:

chart

(DSL Platform – serialization benchmark)

Verdict: for such a simple table, performance of both approaches is similar


(DSL Platform DAL benchmark – single table)

Second use case, master-detail table access

Common pattern in DB access is reading two tables to reconstruct an object on the client side. While we could use several approaches, let's use the "standard one" which first reads from one table and then from a second one. This can sometimes lead to reading inconsistent data, unless we change the isolation level.

For this example, let's use an Invoice and Item tables:

CREATE TABLE Invoice
(
  NUMBER VARCHAR(20) PRIMARY KEY,
  dueDate DATE NOT NULL,
  total NUMERIC NOT NULL,
  paid TIMESTAMPTZ,
  canceled BOOL NOT NULL,
  version BIGINT NOT NULL,
  tax NUMERIC(15,2) NOT NULL,
  reference VARCHAR(15),
  createdAt TIMESTAMPTZ NOT NULL,
  modifiedAt TIMESTAMPTZ NOT NULL
);
 
CREATE TABLE Item
(
  invoiceNumber VARCHAR(20) REFERENCES Invoice,
  _index INT,
  PRIMARY KEY (invoiceNumber, _index),
  product VARCHAR(100) NOT NULL,
  cost NUMERIC NOT NULL,
  quantity INT NOT NULL,
  taxGroup NUMERIC(4, 1) NOT NULL,
  discount NUMERIC(6, 2) NOT NULL
);

To make things more interesting we'll also investigate how performance would compare if we used a type instead of table for the items property. In that case we don't need a join or two queries to reconstruct the whole object.

So let's say that we want to read several invoices and their details. We would usually write something along the lines of:

SELECT * FROM Invoice
WHERE NUMBER IN ('invoice-1', 'invoice-2', ...)
 
SELECT * FROM Item 
WHERE invoiceNumber IN ('invoice-1', 'invoice-2', ...)

and if we wanted to simplify materialization we could add ordering:

SELECT * FROM Invoice
WHERE NUMBER IN ('invoice-1', 'invoice-2', ...)
ORDER BY NUMBER
 
SELECT * FROM Item
WHERE invoiceNumber IN ('invoice-1', '"invoice-2', ...) ORDER BY invoiceNumber, _index

While this is slightly more taxing on the database, if we did a more complicated search, it would be much easier to process stuff in order via the second version.

On the other hand, by combining records into one big object directly on the database, we can load it in a single query:

SELECT inv, ARRAY_AGG(
  SELECT it FROM Item it
  WHERE it.invoiceNumber = inv.NUMBER
  ORDER BY it._index) AS items  
FROM Invoice inv
WHERE inv.NUMBER IN ('invoice-1', 'invoice-2', ...)

The above query actually returns two columns, but it could be changed to return only one column.

Materialization of such objects on the client for the first version would look like this:

IDataReader master = ...
IDataReader detail = ...
var memory = new Dictionary<string, Invoice>();
while (master.Read())
{
  var head = new Invoice { 
    number = master.GetString(0), 
    dueDate = master.GetDateTime(1), ... 
  }
  ...
}
while (detail.Read())
{
  var invoice = memory[detail.GetString(0)];
  var detail = new Item { 
    product = detail.GetString(2),
    cost = detail.GetDecimal(3) ... 
  }
  invoice.Items.Add(detail);
}

Postgres native format would be materialized as in first example along the lines of:

IDataReader dr = ...
return PostgresDriverLibrary.Parse<Invoice>(dr.GetValue(0));

Postgres response in TEXT protocol would start to suffer from nesting and escaping, and would look something like:

(invoice-1,2015-03-16,"{""(invoice-1,1,""""product name"""",...)...}",...)

With each nesting layer more and more space would be spent on escaping. By developing optimized parsers for this specific Postgres TEXT response we can parse such a response very quickly.

Verdict: manual coding of SQL and materialization has become non-trivial. Joins introduce noticeable performance difference. Manual approach is losing ground.


(DSL Platform DAL benchmark – parent/child)

Third use case, master-child-detail table access

Sometimes we have nesting two levels deep. Since Postgres has rich type support this is something which we can leverage. So, how would our object-oriented modeling approach look like if we had to store bank account data into a database?

CREATE TYPE Currency AS ENUM ('EUR','USD','Other');
 
CREATE TYPE TRANSACTION AS
(
  DATE DATE,
  description VARCHAR(200),
  currency Currency,
  amount NUMERIC(15,2)
);
 
CREATE TYPE Account AS
(
  balance NUMERIC(15,2),
  NUMBER VARCHAR(40),
  name VARCHAR(100),
  notes VARCHAR(800),
  transactions TRANSACTION[]
);
 
CREATE TABLE BankScrape
(
  id INT PRIMARY KEY,
  website VARCHAR(1024) NOT NULL,
  at TIMESTAMPTZ NOT NULL,
  info HSTORE NOT NULL,
  externalId VARCHAR(50),
  ranking INT NOT NULL,
  tags VARCHAR(10)[] NOT NULL,
  createdAt TIMESTAMPTZ NOT NULL,
  accounts Account[] NOT NULL
);

Our SQL queries and materialization code will look similar to before (although complexity will have increased drastically). Escaping issue is even worse than before and while reading transactions we are mostly skipping escaped chars. Not to mention that due to LOH issues we can’t just process a string, it must be done using TextReader.

Verdict: manual coding of SQL and materialization is really complex. Joins introduce a noticeable performance difference. Manual approach is not comparable on any test:


(DSL Platform DAL benchmark – parent/child/child)

Conclusions

  • Although we have looked into simple reading scenarios here, insert/update performance is maybe even more interesting.
  • Approach took by the Revenj and backing compiler is not something which can realistically be reproduced by manual coding.
  • Postgres is suffering from parsing complex tuples – but with smart optimizations that can yield net win. There are also few "interesting" behaviors of Postgres which required various workarounds.
  • It would be interesting to compare BINARY and TEXT protocol on deep nested aggregates.
  • JSON might have similar performance to Postgres native format, but it's probably more taxing on Postgres.

5 thoughts on “Fast Postgres from .NET

  1. Chris

    Thank you for researching this. I’ve been wondering about this myself. I was hoping you would include more of the .NET code you used for parsing the array_agg in the third scenario.

    Also, I’m curious on what the performance would be for wide tables, since there would be a lot more columns to handle with more escapes. Hopefully your next blog post will be about that 🙂

    One other thing that I’ve been thinking about since I first heard about Google ProtoBuf, was if there was a quick way to have Postgress return a parent-child-detail (e.g. your 3rd use case) in ProtoBuf format? That may help with the escapes, I just don’t know how performant it would be to the db server do. It could however reduce the wire transfer bytes. The only issue is that the index would matter, but could probably use what is defined in the table / custom type.

    Example ProtoBuf:

    SELECT inv, PROTOBUF(ARRAY_AGG(
    SELECT it FROM Item it
    WHERE it.invoiceNumber = inv.number
    ORDER BY it._index)) as items
    FROM Invoice inv
    WHERE inv.number IN (‘invoice-1’, ‘invoice-2’, …)

    -Chris

    1. Rikard Pavelic

      Hi Chris,

      Actually, the source is available at: https://github.com/ngs-doo/dal-benchmark
      The used framework is BSD licensed, so you pretty much can dig into how it’s done.

      When I find some time I’ll look into how tuples are transferred over binary protocol. If they don’t quote stuff, it might be somewhat faster.

      1. Chris

        Thanks.

        If tuples could be transferred over binary protocol without quotes/escapes that would be better. Assuming it can be efficiently translated on the web server (or client) that would be the best approach!

        Awesome stuff!!

    2. ppafford

      Off topic (Kinda), looking for a ASP.NET Identity Provider for PostgreSQL, since this is both .NET and PostgreSQL I thought I would ask

Leave a Reply

Your email address will not be published. Required fields are marked *