Type Constraints in 65 lines of SQL

2023-02-17

10 minute read

PostgreSQL has a rich and extensible type system. Beyond enums and composite types, we can:

  • apply data validation rules
  • override comparison operators like = / + / -
  • create custom aggregations
  • define casting rules between types

With a little effort, a user-defined type can feel indistinguishable from a built-in. In this article we focus on validation and ergonomics while quickly touching on a few other concepts.

To illustrate, we’ll create an semver data type to represent Semantic Versioning values. We’ll then add validation rules to make invalid states unrepresentable.

SemVer

A (very) loose primer on SemVer:

SemVer is a specification for representing software versions that communicate information about backwards compatibility. The type is typically represented as a string with 5 components.

image

Where pre-release and metadata are optional.

The intent of each component is outside the scope of this article but, as an example, incrementing the major version number notifies users that the release includes at least one backwards incompatible change.

For a concise representation of the full spec, check out the grammar.

SQL

For our purposes, we’ll assume that the SemVer type is a critical component of the application that needs to be queried flexibly and efficiently.

Storing Components

To that end, we’ll store each component of the version as a separate field on a composite type.


_10
create type semver_components as (
_10
major int,
_10
minor int,
_10
patch int,
_10
pre_release text[],
_10
build_metadata text[]
_10
);

We can create an instance of this type in SQL by casting a tuple as the semver_components type.


_10
select
_10
(1, 2, 3, array['beta', '1'], array['meta'])::semver_components
_10
-- returns: (1,2,3,{'beta','1'},{'meta'})

Unfortunately, our definition is far too permissive.


_10
select
_10
(null, -500, null, array['?'], array[''])::semver_components
_10
-- returns: (,-500,,{'?'},{''

Our data type has no problem accepting invalid components. To list a few of the SemVer rules we violated:

  • Major version must not be null
  • Minor version must be ≥ 0
  • Patch version must not be null
  • Pre-release elements must only include characters [A-z0-9]
  • Build metadata elements may not be empty strings

We need to add some validation rules to meet our “make invalid states unrepresentable” goal.

Validation

Domains are Postgres’ solution for optionally layering constraints over a data type. Domains are to types what check constraints are to tables. If you’re not familiar with check constraints, you can think of them as equivalent to zod/pydantic in javascript/python.

Let's codify some SemVer rules, layer them on the semver_components type, and give the new domain a friendly name.


_18
create domain semver
_18
as semver_components
_18
check (
_18
-- major: non-null positive integer
_18
(value).major is not null and (value).major >= 0
_18
-- minor: non-null positive integer
_18
and (value).minor is not null and (value).minor >= 0
_18
-- patch: non-null positive integer
_18
and (value).patch is not null and (value).patch >= 0
_18
and semver_elements_match_regex(
_18
(value).pre_release,
_18
'^[A-z0-9]{1,255}$'
_18
)
_18
and semver_elements_match_regex(
_18
(value).build_metadata,
_18
'^[A-z0-9\.]{1,255}$'
_18
)
_18
);

which references a helper function:


_23
create or replace function semver_elements_match_regex(
_23
parts text[],
_23
regex text
_23
)
_23
returns bool
_23
language sql
_23
as $$
_23
-- validates that *parts* nullable array of non-empty strings
_23
-- where each element of *parts* matches *regex*
_23
select
_23
$1 is null
_23
or (
_23
(
_23
select (
_23
bool_and(pr_arr.elem is not null)
_23
and bool_and(pr_arr.elem ~ $2)
_23
)
_23
from
_23
unnest($1) pr_arr(elem)
_23
)
_23
and array_length($1, 1) > 0
_23
)
_23
$$;

Now, if we repeat our positive and negative test cases using the semver type (vs semver_components) we still accept valid states:


_10
-- Success Case
_10
select
_10
(1, 2, 3, array['beta', '1'], array['meta'])::semver,
_10
-- returns: (1,2,3,{'beta','1'},{'meta'})

while invalid states are rejected with an error:


_10
-- Failure Case
_10
select
_10
(null, -500, null, array['?'], array[''])::semver
_10
-- ERROR: value for domain semver violates check constraint "semver_check"
_10
-- SQL state: 23514

Testing

Our validation doesn’t have to be called manually. The semver domain can be used anywhere you’d use the semver_components type and the validations are automatically applied.


_19
-- A table with a semver column
_19
create table package_version(
_19
id bigserial primary key,
_19
package_name text not null,
_19
package_semver semver not null -- semver column
_19
);
_19
_19
-- Insert some valid records
_19
insert into package_version( package_name, package_semver )
_19
values
_19
('supabase-js', (2, 2, 3, null, null)),
_19
('supabase-js', (2, 0, 0, array['rc', '1'], null)
_19
);
_19
_19
-- Attempt to insert an invalid record (major is null)
_19
insert into package_version( package_name, package_semver )
_19
values
_19
('invalid-js', (null, 1, 0, array['asdf'], null));
_19
-- ERROR: value for domain semver violates check constraint "semver_check"

Good stuff!

We’re 48 lines of SQL in and have solved for making invalid states unrepresentable. Now lets think about ergonomics.

Displaying

Now that our data type is well constrained, you might notice that selecting values from a semver typed column returns a tuple, rather than the SemVer string we’re used to seeing.


_10
select
_10
*
_10
from
_10
package_version
_10
/*
_10
id | package_name | package_semver
_10
-------------------------------------
_10
1 | supabase-js | (2,2,3,,)
_10
2 | supabase-js | (2,0,0,"{rc,1}",)
_10
*/

For example: (2,0,0,"{rc,1}",) vs 2.0.0-rc.1

We could work around that problem with some custom casts, but I’d recommend keeping everything explicit with a function call.


_16
create or replace function semver_to_text(semver)
_16
returns text
_16
immutable
_16
language sql
_16
as $$
_16
select
_16
format('%s.%s.%s', $1.major, $1.minor, $1.patch)
_16
|| case
_16
when $1.pre_release is null then ''
_16
else format('-%s', array_to_string($1.pre_release, '.'))
_16
end
_16
|| case
_16
when $1.build_metadata is null then ''
_16
else format('+%s', array_to_string($1.build_metadata, '.'))
_16
end
_16
$$;

Which allows us to query the package_version table and retrieve a string representation of the data.


_12
select
_12
id,
_12
package_name,
_12
semver_to_text(package_semver) as ver -- cast as text
_12
from
_12
package_version
_12
/*
_12
id | package_name | ver
_12
------------------------------
_12
1 | supabase-js | 2.2.3
_12
2 | supabase-js | 2.0.0-rc.1
_12
*/

Or, better yet, use a generated column


_10
create table package_version(
_10
id bigserial primary key,
_10
package_name text not null,
_10
package_semver semver not null,
_10
semver_text text generated always as (semver_to_text(package_semver)) stored
_10
);

so the text representation is persisted along with the semver type and incurs no query/filter penalty.

Other Tricks

Postgres provides all the tools you could want to make your data types/domains work with SQL as seamlessly as builtins.

For example, you could:

to name a few.

Aligning the right parts of your business’ logic with the database can dramatically improve throughput, decrease IO, and simplify application code.

Conclusion

Admittedly, building performant and ergonomic custom data types in Postgres involves a lot of ceremony.

That said, in cases where:

  • the type’s data integrity is critical
  • the type is well specified
  • the type’s spec does not change (or changes infrequently)

Teaching Postgres to have first class support for your custom type can be transformative for data integrity and performance.

Share this article

Build in a weekend, scale to millions