The writings of Merlin Moncure, professional database developer, about work, life, family, and everything else.

Thursday, September 06, 2007

Managing Trees with Arrays

One of my favorite problems in databases is working with recursive structures. In particular, I've made it my quest to debunk the prevailing myth that recursive structures are not 'relational' and thus should be handled in application code or some alternative format. One strategy in this war is demonstrating how easy it is to deal with recursive structures with a little thought and some clever queries. What follows is a demonstration of how one might use the arrays of integers in PostgreSQL in a variation of the classic 'materialized path' method of organizing data in trees, along with some scaffolding to reduce the complexity from the point of view of the client application. This method is efficient, scalable, and useful for many distributions of data. Since trees usually store heterogeneous elements, we will make sure to allow for that in the examples.

-- 8.3 brings enum support. in 8.2, we use a foreign key or a check constraint
create type item_type as enum('root', 'foo', 'bar');

-- the classic way
create table item
(
item_id serial primary key,
parent_id int references item(item_id) on delete cascade
item_type item_type
);
create index item_parent_id_idx on item(parent_id);

insert into item values
(default, null, 'root'),
(default, 1, 'foo'),
(default, 1, 'foo'),
(default, 3, 'bar'); -- etc

While this is very elegant and expressive, the major problem is that to pull any data out of the system based on hierarchy requires recursion, either in the application or in a stored procedure framework. This scales poorly for large sets, especially with deep structures. Let's see how things might look using arrays to materialize the path to the parent:


-- arrays ftw
create table item
(
item_id serial primary key,
parents int[], -- one minor disadvantage is losing the RI constraint
item_type item_type
);

create index item_parents_idx on item(parents);

insert into item values
(default, '{1}', 'root'),
(default, '{1, 2}', 'foo'),
(default, '{1, 3}', 'foo'),
(default, '{1, 3, 4}', 'bar');

Here, we materialize the path (series of parents) to the item in an array, including the item's id. Looking up the family for the 'bar' element (id #4), is done by:

select p.* from item p
join
(
select explode_array(parents) as item_id
from item where item_id = 4
) q using (item_id);

Let's go ahead and generalize this into a view:

create view item_tree as
select q.item_lookup, p.* from item p
join
(
select item_id as item_lookup,
explode_array(parents) as item_id
from item
) q using (item_id);

We can use this to specifically look at item #4 by:

select item_id, type from item_tree where item_lookup = 4;
item_id | item_type
---------+-----------
1 | root
3 | foo
4 | bar

That's pretty nifty. The above query will use the index on item_id for the id lookup as well as when it goes back to item to fetch the elements. Let's expand our view to include child items of the item of interest, and add a flag that can be queried in case that's desirable (this is where the array is particularly helpful):

create view item_tree as
select q.item_lookup, is_child, p.* from item p
join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

The above query creates a remarkably efficient plan and postgresql is smart enough to optimize the case where you are only interested in child or non-child items. The array term is properly utilizes index when necessary. Well, how do we add elements to the structure? While arrays are neat, dealing with them on the client can be a pain (all those strings) and it would be nice to not have to construct them on the fly. Let's expand our view to do this for us:

create view item_tree as
select
q.item_lookup,
p.parents[array_upper(p.parents, 1) - 1] as parent_id, -- null on zero reference
is_child,
p.* from item p

join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

create or replace rule insert_item_tree as on insert to item_tree do instead
insert into item (item_id, parents, item_type)
select
coalesce(new.item_id, nextval('item_item_id_seq')),
(select i.parents || coalesce(new.item_id, currval('item_item_id_seq'))::int from item i where item_id = new.parent_id),
new.item_type;

insert into item_tree(item_id, parent_id, item_type) values (null, 4, 'bar'); -- we can pass in null to get the sequence to assign a value
insert into item_tree(item_id, parent_id, item_type) values (null, 5, 'foo'); -- other columns of view are ignored

select * from item_tree where item_lookup = 3;
item_lookup | parent_id | is_child | item_id | parents | item_type
-------------+-----------+----------+---------+-------------+-----------
3 | | f | 1 | {1} | root
3 | 1 | f | 3 | {1,3} | foo
3 | 3 | t | 4 | {1,3,4} | bar
3 | 4 | t | 5 | {1,3,4,5} | bar
3 | 5 | t | 6 | {1,3,4,5,6} | foo


This is a functional example that could be used to build real applications. Many portions are left as an exercise to the reader, including performance testing (it's pretty good), extending the base item table into properties and specific subtables, a more robust constraint system, and better error handling. Lately I am coming to the perspective that is better to try and preserve sql-ish interface to application facing structures as opposed to providing an API to build recursive structures that the application must interact with. While this is a bit more difficult and involves some hacky things, the complexity is neatly tucked away and the user is free to build recursive structures using familiar paradigms (insert, select, etc).

merlin

5 comments:

Anonymous said...

Ok, it's your idea it's fine...
I've added a "level" column to the view (array_upper(p.parents, 1) AS "level") (as you have in connect by)

But, the problem with this is when you update or delete...

How can you change the parent of an item which has children?

So now, i don't see anything usable in posgresql to do hiearchical queries... (In fact we use the connect by patch : http://gppl.moonbone.ru/)

Merlin Moncure said...

Deleting is no problem...assuming you want to delete an item's children when it goes the query essentially boils down to delete from item where parents between a and b. This beats classic recursion methods.

Updating is a little more complicated, but not bad. When you want to move an item from parent a to parent b, your query will look something like

update item set parents = newparent.prefix + item.suffix where parents between a and b.

This is a very efficient query for searching purposes but means every item that is getting moved must be updated (in classic approach, only first level does). This is a problem common with all materialized path methods and should be taken into consideration. For tables that are updated frequently and have very large and/or deep trees, this could mean another apporach might be better suited.

Materialized path approaches should beat hierarchal methods in all types of retrieval however which is generally what people are interested in.

Anonymous said...

Unfortunately Moonbone ſeems to be more often off ðan online. And what we really need is the implementation of recurſive WITH queries, as per the ſtandard.

Einhverfr said...

Out of curiosity, have you looked at contrib/tablefunc and its connectby() function? We use it and the parent_id solution in LSMB and it works pretty well. It also allows for a cleaner semantics because we don't have to store the full path in the record (this and "level" are both generated by connectby!).

Best Wishes,
Chris Travers
LSMB Core Team.

Merlin Moncure said...

Decided to follow up your comments on a new post...