The writings of Merlin Moncure, professional database developer, about work, life, family, and everything else.

Thursday, September 06, 2007

Managing Trees with Arrays

One of my favorite problems in databases is working with recursive structures. In particular, I've made it my quest to debunk the prevailing myth that recursive structures are not 'relational' and thus should be handled in application code or some alternative format. One strategy in this war is demonstrating how easy it is to deal with recursive structures with a little thought and some clever queries. What follows is a demonstration of how one might use the arrays of integers in PostgreSQL in a variation of the classic 'materialized path' method of organizing data in trees, along with some scaffolding to reduce the complexity from the point of view of the client application. This method is efficient, scalable, and useful for many distributions of data. Since trees usually store heterogeneous elements, we will make sure to allow for that in the examples.

-- 8.3 brings enum support. in 8.2, we use a foreign key or a check constraint
create type item_type as enum('root', 'foo', 'bar');

-- the classic way
create table item
(
item_id serial primary key,
parent_id int references item(item_id) on delete cascade
item_type item_type
);
create index item_parent_id_idx on item(parent_id);

insert into item values
(default, null, 'root'),
(default, 1, 'foo'),
(default, 1, 'foo'),
(default, 3, 'bar'); -- etc

While this is very elegant and expressive, the major problem is that to pull any data out of the system based on hierarchy requires recursion, either in the application or in a stored procedure framework. This scales poorly for large sets, especially with deep structures. Let's see how things might look using arrays to materialize the path to the parent:


-- arrays ftw
create table item
(
item_id serial primary key,
parents int[], -- one minor disadvantage is losing the RI constraint
item_type item_type
);

create index item_parents_idx on item(parents);

insert into item values
(default, '{1}', 'root'),
(default, '{1, 2}', 'foo'),
(default, '{1, 3}', 'foo'),
(default, '{1, 3, 4}', 'bar');

Here, we materialize the path (series of parents) to the item in an array, including the item's id. Looking up the family for the 'bar' element (id #4), is done by:

select p.* from item p
join
(
select explode_array(parents) as item_id
from item where item_id = 4
) q using (item_id);

Let's go ahead and generalize this into a view:

create view item_tree as
select q.item_lookup, p.* from item p
join
(
select item_id as item_lookup,
explode_array(parents) as item_id
from item
) q using (item_id);

We can use this to specifically look at item #4 by:

select item_id, type from item_tree where item_lookup = 4;
item_id | item_type
---------+-----------
1 | root
3 | foo
4 | bar

That's pretty nifty. The above query will use the index on item_id for the id lookup as well as when it goes back to item to fetch the elements. Let's expand our view to include child items of the item of interest, and add a flag that can be queried in case that's desirable (this is where the array is particularly helpful):

create view item_tree as
select q.item_lookup, is_child, p.* from item p
join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

The above query creates a remarkably efficient plan and postgresql is smart enough to optimize the case where you are only interested in child or non-child items. The array term is properly utilizes index when necessary. Well, how do we add elements to the structure? While arrays are neat, dealing with them on the client can be a pain (all those strings) and it would be nice to not have to construct them on the fly. Let's expand our view to do this for us:

create view item_tree as
select
q.item_lookup,
p.parents[array_upper(p.parents, 1) - 1] as parent_id, -- null on zero reference
is_child,
p.* from item p

join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

create or replace rule insert_item_tree as on insert to item_tree do instead
insert into item (item_id, parents, item_type)
select
coalesce(new.item_id, nextval('item_item_id_seq')),
(select i.parents || coalesce(new.item_id, currval('item_item_id_seq'))::int from item i where item_id = new.parent_id),
new.item_type;

insert into item_tree(item_id, parent_id, item_type) values (null, 4, 'bar'); -- we can pass in null to get the sequence to assign a value
insert into item_tree(item_id, parent_id, item_type) values (null, 5, 'foo'); -- other columns of view are ignored

select * from item_tree where item_lookup = 3;
item_lookup | parent_id | is_child | item_id | parents | item_type
-------------+-----------+----------+---------+-------------+-----------
3 | | f | 1 | {1} | root
3 | 1 | f | 3 | {1,3} | foo
3 | 3 | t | 4 | {1,3,4} | bar
3 | 4 | t | 5 | {1,3,4,5} | bar
3 | 5 | t | 6 | {1,3,4,5,6} | foo


This is a functional example that could be used to build real applications. Many portions are left as an exercise to the reader, including performance testing (it's pretty good), extending the base item table into properties and specific subtables, a more robust constraint system, and better error handling. Lately I am coming to the perspective that is better to try and preserve sql-ish interface to application facing structures as opposed to providing an API to build recursive structures that the application must interact with. While this is a bit more difficult and involves some hacky things, the complexity is neatly tucked away and the user is free to build recursive structures using familiar paradigms (insert, select, etc).

merlin

Tuesday, September 04, 2007

PostgreSQL 8.3 Features: Plan Invalidation

As previously stated, PostgreSQL 8.3 is shaping up to be a great release of the software. One of the biggest beneficiaries of the new feature set are pl/pgsql developers. 8.3 stands to be the best thing to happen to pl/pgsql since 8.0 brought dollar quoting to the table...before which some might argue serious development with the language bordered on the impractical. The major new features are vastly improved cursor handling (including UPDATE/DELETE), improved handling for set returning functions, arrays of composite types (for efficient passing of data to/from functions), and, especially, automatic invalidation of cached plans. While plan invalidation solves many tangential issues their greatest impact will certainly be in server side development of stored procedures.

When a pl/pgsql function executes for the first time in a session, the server 'compiles' it by parsing the static (not passed through EXECUTE) sql and generating plans for all the queries. This is an essential mechanism for fast repeated execution of server side functions because it allows many tedious, cpu intensive portions of query execution to be optimized. One part of this optimization involves looking up various database objects involved in the query and storing identifiers that are internal to the database. While this is useful and good, it has an unfortunate side effect in that if the structures the database were referencing in the plan were no longer valid, the function plan itself is no longer valid and may raise errors if executed. There are various workarounds to address this that are mostly obsolete (plan invalidation doesn't catch references to functions yet).

Here is the canonical example (and, arguably, the justification for the effort) of demonstrating how plan invalidation works:

create or replace function test_func() returns void as
$$
begin
create temp table tt as select 'hello world!'::text;
perform * from tt;
drop table tt;
end;
$$ language plpgsql;

select test_func();
select test_func();

ERROR: relation with OID 87831 does not exist

The first invocation of test_func() generates a plan for the function. The second time around, the plan is pointing to a stale reference to 'tt' and the function fails. In PostgreSQL 8.3, this function succeeds without error. The database determines when tt is dropped that there are plans referencing it and throws them out. Historically, this was a huge impediment to pl/pgsql development because there was no particularly safe way to create and drop temporary tables from within a function. It is natural to want to create a table for temporary storage, do all kinds of things to it, and release it when done -- but sadly was not generally possible without various workarounds. For this reason, historically it was often better to use cursors inside pl/pgsql functions for holding working data which pushes code into a more procedural style. Now there is room for more set oriented style data processing which should be an attractive alternative to any dba.

Monday, September 03, 2007

PostgreSQL 8.3 Features: Arrays of Compound Types

This will be the first in what hopefully be a series of entries describing the new and interesting features in PostgreSQL 8.3. The database is entering in the final stages of the development cycle before going into beta with all the major patches having been submitted weeks ago (with one possible stunning exception, lazy xid assignment described here. Thus, aside from the hard decisions that are going to have to be made on the HOT patch (more on that later), 8.3 is shaping up nicely. On a scale of one to ten, this release is a strong eight -- and if all the outstanding patches get in -- a vicious '10'. Excited yet? If not, let me prime the pump with my favorite new feature, Arrays of Composite Types.

David Fetter laid the groundwork for this new feature and was ultimately accepted with the help from many others, including PostgreSQL core developers Tom Lane and Andrew Dunstan. The feature combines two neat features of PostgreSQL, the ability to build complex types out of Plain Old Data Types and arrays (follow the links for an overview of both features).
create table foo(id int, b text);
create table bar(id int, foos foo[]); -- create an array of the type 'foo' in bar
create table baz(id int, bars bar[]); -- create an array of the type 'bar' in baz

insert into foo values(1, 'abc');
insert into foo values(2, 'def');
insert into bar values(1, array(select foo from foo)); -- there are two elements in this array
insert into bar values(2, array(select foo from foo)); -- and this
insert into bar values(3, array(select foo from foo)); -- and this
insert into baz values(1, array(select bar from bar)); -- three elements in here

select ((bars[1]).foos[2]).b from baz; -- use parenthesis to get elements out of composites
postgres=# select ((bars[1]).foos[2]).b from baz;
b
-----
def
(1 row)
Being able to nest composite types in array structures is a powerful feature. While abusive in the extreme to good normalization and relational principles it allows
flexible approaches to some difficult problems:

  • Input Data to Functions: pl/sql and pg/pgsql functions have some limitations in how you can get data into the function. There are various workarounds for this that all have application in specific circumstances, but composite arrays allow you to pass a heap of data to a function that doesn't happen to be in a table. While this could have scalability issues for large arrays (> 10k elements), it can be convenient.
  • Sometimes, Arrays are Just Better. Although this is rarely a good idea, sometimes its beneficial to store data in arrays and the ability to nest complex structures is helpful in this regard. Just remember that in most cases the entire array has to be read from and written back to disk (the server only understands the array as a field and treats it as such).
  • Targeted Denormalization: When passing data between functions or to the client, it may be efficient to accumulate a particular column into an array to reduce duplication of data in the other columns. We may now do this with composites:


drop table bar, baz cascade;       -- we would rarely rely directly on the generated types
alter table foo add column data text default 'data';
insert into foo values (1, 'ghi'); -- make some duplicated on foo by id
insert into foo values (2, 'jlk');
create type foo_data as (b text, data text);
select id, array_accum ((b, data)::foo_data) from foo group by id order by id;

id | array_accum
----+-----------------------------
1 | {"(abc,data)","(ghi,data)"}
2 | {"(def,data)","(jlk,data)"}


Arrays perform best as a server side feature. Because it's only currently recommended to render them to the client in text, Arrays of Composites introduce some interesting parsing issues going back and forth from the server. While this is ok in limited cases like the above example, it's inefficient and error prone to build and properly escape a nested array structure. While PostgreSQL has the ability to transfer arrays in binary over the protocol, there is no API to access the custom format data on the client side. Until that changes, I expect this feature will be most useful to stored procedure developers.

While relatively short on substance, I hope that this writing provides a good introduction to this interesting feature. Some of you will probably seize on this feature for it's utility in solving a narrow, but important class of problems. It's one more weapon in the mighty arsenal of the PostgreSQL developer who is already armed with the most practical, versatile, and generally kick-ass piece of software for solving the world's data management issues.