Embeding model with data?

JSON is an example of simple data structure which has its model embedded with data.
While it’s very popular as an interchange format for good reasons, it’s really a bad fit as main data storage format: it’s wasteful and has limited type support.
For a model with a mostly fixed structure, JSON only makes sense for small dynamic parts of that model.

Postgres stores data structure definition separately from data.
In SQL, Data Definition Language is used to define explicit data structures (which can have dynamic parts in them – like JSON).

    Let’s review a few use cases for evolving the data structure:

  • adding/removing fields in new objects
  • changing the type of existing fields
  • renaming of existing fields

In JSON, adding new fields is as simple as it gets.
Just add a new field and it will be embedded as a name:data pair.
When working with SQL more overhead is required; we have to declare this information along with its data type using DDL.
This is done using ALTER statements.

Since during early development most model changes will consist of adding new fields, it seems that approach which favors less formalism should be preferred.
Early in development, deep model insights will happen often.
They result in renaming of existing fields, changing types and removing unnecessary information.
While in early development stage, old data is mostly non-existent, which means that the option to ignore it is an valid one.

When data starts getting used, the rules change. Application must cope with data changes.

Standard approach in JSON-powered and many event sourcing solutions is to embed version information with data and redirect processing to the correct version handler. If required, during maintenance window, convert old data to the new schema.

    In SQL there are a lot of options for handling such changes, but most of them require the use of SQL as more than just a dumb data store:

  • backward compatible views
  • multiple stored procedures for data persistence

In heterogeneous environments both approaches are equally complicated and beyond the scope of this article.
But if we consider only homogeneous environment where there is only one application consuming the data and it needs to be upgraded, along with model changes, what is the difference for non-trivial data sets?

1) Adding new optional field is same for both approaches
In Postgres only table metadata is changed and no locking happens as long as a nullable field is being added.
Old code should work as expected, as long as best database practices were followed:

INSERT INTO foo(column1, column2) VALUES(:v1, :v2);
SELECT t.column1, t.column2 FROM foo t;
 
-- instead of
 
INSERT INTO foo VALUES(:v1, :v2);
SELECT * FROM foo;

2) Removing a field is better with Postgres
Since Postgres has unique MVCC architecture …

ALTER TABLE foo DROP COLUMN column2;

… doesn’t require a table rewrite. Column is simply marked as dropped in the metadata.

3) Changing a type of an existing field is slightly better with Postgres
Since data is stored optimally and there is a fixed schema with metadata defined outside, there is more room for optimizations and less processing power is required.
Both approaches can use various tricks to reduce their downtime (such as a helper table which is swapped with the original at the end).

4) Renaming existing field is vastly better with SQL approach
Since data is separated from the metadata databases can rename field without much hassle, while JSON-like approach must rewrite all affected data.

While some of these points are unique to Postgres (dropping of an existing column), the importance point is the separation of data and metadata.

If this is the case, why isn’t Postgres recognized as a better solution?
Because, as most technologies do, it has its own sets of “problems”.
Postgres is very strict about its data types, dependencies and invalidation.
While this is a very good thing from a data safety perspective, in practice it means that managing Postgres changes requires a lot more work when views are involved.
When you create a view on a table, you can’t do an ALTER TYPE or DROP COLUMN, without recreating the view. If another view or stored procedure is dependent on that view, you’ll have to recreate a lot of objects. This is hard to set up correctly;

CREATE TABLE foo (i INT, j INT);
CREATE VIEW foo_view AS SELECT * FROM foo;
ALTER TABLE foo ADD z FLOAT; -- works, but view will still have only i and j
ALTER TABLE foo DROP COLUMN j; -- doesn't work since this column is used in a view
ALTER TABLE foo ALTER j TYPE FLOAT; -- doesn't work since column is used in a view

If you are using a type in a table (since without it it’s almost useless), you are restricted in changes you can do to it. Sometimes the only solution is to create an additional attribute in a type, copy old data and remove the old attribute.

CREATE TYPE complex AS (i FLOAT, j FLOAT);
CREATE TABLE numbers(f FLOAT, c complex);
ALTER TYPE complex ADD ATTRIBUTE z FLOAT; -- works and it's visible in column c, but now you have to be careful about your database dumps, since in plain sql dumps, metadata about type is missing
ALTER TYPE complex ALTER ATTRIBUTE i TYPE INT; -- doesn't work, since type is used in a column
--instead we have to manage it by hand
ALTER TYPE complex ADD ATTRIBUTE i_tmp INT;
UPDATE numbers SET c.i_tmp = (c).i;
ALTER TYPE complex DROP ATTRIBUTE i;
ALTER TYPE complex RENAME ATTRIBUTE i_tmp TO i;

It becomes obvious why such an approach is not a viable solution, since it requires a lot of maintenance.
But, what if there exists a unicorn, which can do all this things automatically for us, so that we can have data safety of a database and ease of development/maintenance.

Well, you are reading a blog of such a unicorn.
DSL platform does this behind the scenes and we are very proud of how this works in practice.

Postgres – bridge between worlds

    NoSQL databases promise to solve:

  • scaling
  • easier database interaction

Scaling often means automatic sharding, replication and multiple write points.
Easier database interaction usually means schema-less data, simpler migrations and removal of object-relational impedance mismatch.

Postgres is an ORDBMS, not just plain relational database, which means that many of the points that NoSQL proponents argue about relational database don’t apply to it.
In practice this means that Postgres can have best of both worlds.

While origin of “joins are slow” argument is probably from people not too familiar with databases and relational theory, let’s look at real world scenario people usually have on mind.
If our model looks something like

post {
  string name
  string[] tags
}

in relational database we must implement tags in a separate table and do a join to reconstruct the post. In Postgres you can do it in more application developer friendly way using varchar[] data type, ie:

CREATE TABLE post (name VARCHAR, tags VARCHAR[])

and if we wish to lookup posts by tags, Postgres supports indexes optimized for arrays.

If we don’t really know in advance what can be in our data and need something like xml, key-value or json, Postgres with its advanced type support doesn’t fall short.
XML, HSTORE and even json (while not really useful currently) are available.

What about some non-trivial schemas, for example:

actor {
  string name
  date born
  movie[] playedIn
  address[] knownAddresses
}

address {
  string country
  string town
  string street
}

movie {
  string name
  date released
  string[] categories
}

where both actor and movie are objects with identity.
Postgres has support for both types and tables, where each table is also a type.
When we don’t need identity we can use composite types

CREATE TYPE address AS (country VARCHAR, town VARCHAR, street VARCHAR)

When we need identity we will use table with a primary key (for this example let’s use name to keep it simple):

CREATE TABLE actor (
  name VARCHAR PRIMARY KEY,
  born DATE,
  knownAddresses address[]
);
CREATE TABLE movie (
  name VARCHAR PRIMARY KEY,
  released DATE,
  categories VARCHAR[]
);
CREATE TABLE actor_playedIn (
  actor_name VARCHAR REFERENCES actor,
  movie_name VARCHAR REFERENCES movie,
  PRIMARY KEY(actor_name, movie_name)
);

Finally we need to create this complex object (using joins of course)

CREATE VIEW actor_document AS
SELECT a.name, a.born, played.playedIn, a.knownAddresses
FROM actor a
LEFT JOIN
(
  SELECT p.actor_name, array_agg(m) AS playedIn
  FROM actor_playedIn p
  JOIN movie m ON m.name = p.movie_name
  GROUP BY p.actor_name
) played ON played.actor_name = a.name

So, this document (aggregate root) can be used as key -> value if actor primary key is used as key and whole object is used as value.
This makes it cache friendly so we can store it in Memcached and bypass database on requests for actor by its id.

In next post we will analyze easier migration argument and common misconceptions around it.