JSON is an example of simple data structure which has its model embedded with data.
While it’s very popular as an interchange format for good reasons, it’s really a bad fit as main data storage format: it’s wasteful and has limited type support.
For a model with a mostly fixed structure, JSON only makes sense for small dynamic parts of that model.
Postgres stores data structure definition separately from data.
In SQL, Data Definition Language is used to define explicit data structures (which can have dynamic parts in them – like JSON).
- Let’s review a few use cases for evolving the data structure:
- adding/removing fields in new objects
- changing the type of existing fields
- renaming of existing fields
In JSON, adding new fields is as simple as it gets.
Just add a new field and it will be embedded as a name:data pair.
When working with SQL more overhead is required; we have to declare this information along with its data type using DDL.
This is done using ALTER statements.
Since during early development most model changes will consist of adding new fields, it seems that approach which favors less formalism should be preferred.
Early in development, deep model insights will happen often.
They result in renaming of existing fields, changing types and removing unnecessary information.
While in early development stage, old data is mostly non-existent, which means that the option to ignore it is an valid one.
When data starts getting used, the rules change. Application must cope with data changes.
Standard approach in JSON-powered and many event sourcing solutions is to embed version information with data and redirect processing to the correct version handler. If required, during maintenance window, convert old data to the new schema.
- In SQL there are a lot of options for handling such changes, but most of them require the use of SQL as more than just a dumb data store:
- backward compatible views
- multiple stored procedures for data persistence
In heterogeneous environments both approaches are equally complicated and beyond the scope of this article.
But if we consider only homogeneous environment where there is only one application consuming the data and it needs to be upgraded, along with model changes, what is the difference for non-trivial data sets?
1) Adding new optional field is same for both approaches
In Postgres only table metadata is changed and no locking happens as long as a nullable field is being added.
Old code should work as expected, as long as best database practices were followed:
INSERT INTO foo(column1, column2) VALUES(:v1, :v2); SELECT t.column1, t.column2 FROM foo t; -- instead of INSERT INTO foo VALUES(:v1, :v2); SELECT * FROM foo;
2) Removing a field is better with Postgres
Since Postgres has unique MVCC architecture …
ALTER TABLE foo DROP COLUMN column2;
… doesn’t require a table rewrite. Column is simply marked as dropped in the metadata.
3) Changing a type of an existing field is slightly better with Postgres
Since data is stored optimally and there is a fixed schema with metadata defined outside, there is more room for optimizations and less processing power is required.
Both approaches can use various tricks to reduce their downtime (such as a helper table which is swapped with the original at the end).
4) Renaming existing field is vastly better with SQL approach
Since data is separated from the metadata databases can rename field without much hassle, while JSON-like approach must rewrite all affected data.
While some of these points are unique to Postgres (dropping of an existing column), the importance point is the separation of data and metadata.
If this is the case, why isn’t Postgres recognized as a better solution?
Because, as most technologies do, it has its own sets of “problems”.
Postgres is very strict about its data types, dependencies and invalidation.
While this is a very good thing from a data safety perspective, in practice it means that managing Postgres changes requires a lot more work when views are involved.
When you create a view on a table, you can’t do an ALTER TYPE or DROP COLUMN, without recreating the view. If another view or stored procedure is dependent on that view, you’ll have to recreate a lot of objects. This is hard to set up correctly;
CREATE TABLE foo (i INT, j INT); CREATE VIEW foo_view AS SELECT * FROM foo; ALTER TABLE foo ADD z FLOAT; -- works, but view will still have only i and j ALTER TABLE foo DROP COLUMN j; -- doesn't work since this column is used in a view ALTER TABLE foo ALTER j TYPE FLOAT; -- doesn't work since column is used in a view
If you are using a type in a table (since without it it’s almost useless), you are restricted in changes you can do to it. Sometimes the only solution is to create an additional attribute in a type, copy old data and remove the old attribute.
CREATE TYPE complex AS (i FLOAT, j FLOAT); CREATE TABLE numbers(f FLOAT, c complex); ALTER TYPE complex ADD ATTRIBUTE z FLOAT; -- works and it's visible in column c, but now you have to be careful about your database dumps, since in plain sql dumps, metadata about type is missing ALTER TYPE complex ALTER ATTRIBUTE i TYPE INT; -- doesn't work, since type is used in a column --instead we have to manage it by hand ALTER TYPE complex ADD ATTRIBUTE i_tmp INT; UPDATE numbers SET c.i_tmp = (c).i; ALTER TYPE complex DROP ATTRIBUTE i; ALTER TYPE complex RENAME ATTRIBUTE i_tmp TO i;
It becomes obvious why such an approach is not a viable solution, since it requires a lot of maintenance.
But, what if there exists a unicorn, which can do all this things automatically for us, so that we can have data safety of a database and ease of development/maintenance.
Well, you are reading a blog of such a unicorn.
DSL platform does this behind the scenes and we are very proud of how this works in practice.