Merlin's stuff

Thursday, September 06, 2007

Managing Trees with Arrays

One of my favorite problems in databases is working with recursive structures. In particular, I've made it my quest to debunk the prevailing myth that recursive structures are not 'relational' and thus should be handled in application code or some alternative format. One strategy in this war is demonstrating how easy it is to deal with recursive structures with a little thought and some clever queries. What follows is a demonstration of how one might use the arrays of integers in PostgreSQL in a variation of the classic 'materialized path' method of organizing data in trees, along with some scaffolding to reduce the complexity from the point of view of the client application. This method is efficient, scalable, and useful for many distributions of data. Since trees usually store heterogeneous elements, we will make sure to allow for that in the examples.

-- 8.3 brings enum support. in 8.2, we use a foreign key or a check constraint
create type item_type as enum('root', 'foo', 'bar');

-- the classic way
create table item
(
item_id serial primary key,
parent_id int references item(item_id) on delete cascade
item_type item_type
);
create index item_parent_id_idx on item(parent_id);

insert into item values
(default, null, 'root'),
(default, 1, 'foo'),
(default, 1, 'foo'),
(default, 3, 'bar'); -- etc

While this is very elegant and expressive, the major problem is that to pull any data out of the system based on hierarchy requires recursion, either in the application or in a stored procedure framework. This scales poorly for large sets, especially with deep structures. Let's see how things might look using arrays to materialize the path to the parent:

-- arrays ftw
create table item
(
item_id serial primary key,
parents int[], -- one minor disadvantage is losing the RI constraint
item_type item_type
);

create index item_parents_idx on item(parents);

insert into item values
(default, '{1}', 'root'),
(default, '{1, 2}', 'foo'),
(default, '{1, 3}', 'foo'),
(default, '{1, 3, 4}', 'bar');

Here, we materialize the path (series of parents) to the item in an array, including the item's id. Looking up the family for the 'bar' element (id #4), is done by:

select p.* from item p
join
(
select explode_array(parents) as item_id
from item where item_id = 4
) q using (item_id);

Let's go ahead and generalize this into a view:

create view item_tree as
select q.item_lookup, p.* from item p
join
(
select item_id as item_lookup,
explode_array(parents) as item_id
from item
) q using (item_id);

We can use this to specifically look at item #4 by:

select item_id, type from item_tree where item_lookup = 4;
item_id | item_type
---------+-----------
1 | root
3 | foo
4 | bar

That's pretty nifty. The above query will use the index on item_id for the id lookup as well as when it goes back to item to fetch the elements. Let's expand our view to include child items of the item of interest, and add a flag that can be queried in case that's desirable (this is where the array is particularly helpful):

create view item_tree as
select q.item_lookup, is_child, p.* from item p
join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0) and (l.parents || 2147483647)
) q using (item_id);

The above query creates a remarkably efficient plan and postgresql is smart enough to optimize the case where you are only interested in child or non-child items. The array term is properly utilizes index when necessary. Well, how do we add elements to the structure? While arrays are neat, dealing with them on the client can be a pain (all those strings) and it would be nice to not have to construct them on the fly. Let's expand our view to do this for us:

create view item_tree as
select
q.item_lookup,
p.parents[array_upper(p.parents, 1) - 1] as parent_id, -- null on zero reference
is_child,
p.* from item p
join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0) and (l.parents || 2147483647)
) q using (item_id);

create or replace rule insert_item_tree as on insert to item_tree do instead
insert into item (item_id, parents, item_type)
select
coalesce(new.item_id, nextval('item_item_id_seq')),
(select i.parents || coalesce(new.item_id, currval('item_item_id_seq'))::int from item i where item_id = new.parent_id),
new.item_type;

insert into item_tree(item_id, parent_id, item_type) values (null, 4, 'bar'); -- we can pass in null to get the sequence to assign a value
insert into item_tree(item_id, parent_id, item_type) values (null, 5, 'foo'); -- other columns of view are ignored

select * from item_tree where item_lookup = 3;
item_lookup | parent_id | is_child | item_id | parents | item_type
-------------+-----------+----------+---------+-------------+-----------
3 | | f | 1 | {1} | root
3 | 1 | f | 3 | {1,3} | foo
3 | 3 | t | 4 | {1,3,4} | bar
3 | 4 | t | 5 | {1,3,4,5} | bar
3 | 5 | t | 6 | {1,3,4,5,6} | foo

This is a functional example that could be used to build real applications. Many portions are left as an exercise to the reader, including performance testing (it's pretty good), extending the base item table into properties and specific subtables, a more robust constraint system, and better error handling. Lately I am coming to the perspective that is better to try and preserve sql-ish interface to application facing structures as opposed to providing an API to build recursive structures that the application must interact with. While this is a bit more difficult and involves some hacky things, the complexity is neatly tucked away and the user is free to build recursive structures using familiar paradigms (insert, select, etc).

merlin

Tuesday, September 04, 2007

PostgreSQL 8.3 Features: Plan Invalidation

As previously stated, PostgreSQL 8.3 is shaping up to be a great release of the software. One of the biggest beneficiaries of the new feature set are pl/pgsql developers. 8.3 stands to be the best thing to happen to pl/pgsql since 8.0 brought dollar quoting to the table...before which some might argue serious development with the language bordered on the impractical. The major new features are vastly improved cursor handling (including UPDATE/DELETE), improved handling for set returning functions, arrays of composite types (for efficient passing of data to/from functions), and, especially, automatic invalidation of cached plans. While plan invalidation solves many tangential issues their greatest impact will certainly be in server side development of stored procedures.

When a pl/pgsql function executes for the first time in a session, the server 'compiles' it by parsing the static (not passed through EXECUTE) sql and generating plans for all the queries. This is an essential mechanism for fast repeated execution of server side functions because it allows many tedious, cpu intensive portions of query execution to be optimized. One part of this optimization involves looking up various database objects involved in the query and storing identifiers that are internal to the database. While this is useful and good, it has an unfortunate side effect in that if the structures the database were referencing in the plan were no longer valid, the function plan itself is no longer valid and may raise errors if executed. There are various workarounds to address this that are mostly obsolete (plan invalidation doesn't catch references to functions yet).

Here is the canonical example (and, arguably, the justification for the effort) of demonstrating how plan invalidation works:

create or replace function test_func() returns void as
$$
begin
create temp table tt as select 'hello world!'::text;
perform * from tt;
drop table tt; end;
$$ language plpgsql;

select test_func();
select test_func();

ERROR: relation with OID 87831 does not exist

The first invocation of test_func() generates a plan for the function. The second time around, the plan is pointing to a stale reference to 'tt' and the function fails. In PostgreSQL 8.3, this function succeeds without error. The database determines when tt is dropped that there are plans referencing it and throws them out. Historically, this was a huge impediment to pl/pgsql development because there was no particularly safe way to create and drop temporary tables from within a function. It is natural to want to create a table for temporary storage, do all kinds of things to it, and release it when done -- but sadly was not generally possible without various workarounds. For this reason, historically it was often better to use cursors inside pl/pgsql functions for holding working data which pushes code into a more procedural style. Now there is room for more set oriented style data processing which should be an attractive alternative to any dba.

Monday, September 03, 2007

PostgreSQL 8.3 Features: Arrays of Compound Types

This will be the first in what hopefully be a series of entries describing the new and interesting features in PostgreSQL 8.3. The database is entering in the final stages of the development cycle before going into beta with all the major patches having been submitted weeks ago (with one possible stunning exception, lazy xid assignment described here. Thus, aside from the hard decisions that are going to have to be made on the HOT patch (more on that later), 8.3 is shaping up nicely. On a scale of one to ten, this release is a strong eight -- and if all the outstanding patches get in -- a vicious '10'. Excited yet? If not, let me prime the pump with my favorite new feature, Arrays of Composite Types.

David Fetter laid the groundwork for this new feature and was ultimately accepted with the help from many others, including PostgreSQL core developers Tom Lane and Andrew Dunstan. The feature combines two neat features of PostgreSQL, the ability to build complex types out of Plain Old Data Types and arrays (follow the links for an overview of both features).

create table foo(id int, b text);
create table bar(id int, foos foo[]); -- create an array of the type 'foo' in bar
create table baz(id int, bars bar[]); -- create an array of the type 'bar' in baz

insert into foo values(1, 'abc');
insert into foo values(2, 'def');
insert into bar values(1, array(select foo from foo)); -- there are two elements in this array
insert into bar values(2, array(select foo from foo)); -- and this
insert into bar values(3, array(select foo from foo)); -- and this
insert into baz values(1, array(select bar from bar)); -- three elements in here

select ((bars[1]).foos[2]).b from baz; -- use parenthesis to get elements out of composites
postgres=# select ((bars[1]).foos[2]).b from baz;
 b 
-----
def
(1 row)

Being able to nest composite types in array structures is a powerful feature. While abusive in the extreme to good normalization and relational principles it allows
flexible approaches to some difficult problems:

Input Data to Functions: pl/sql and pg/pgsql functions have some limitations in how you can get data into the function. There are various workarounds for this that all have application in specific circumstances, but composite arrays allow you to pass a heap of data to a function that doesn't happen to be in a table. While this could have scalability issues for large arrays (> 10k elements), it can be convenient.
Sometimes, Arrays are Just Better. Although this is rarely a good idea, sometimes its beneficial to store data in arrays and the ability to nest complex structures is helpful in this regard. Just remember that in most cases the entire array has to be read from and written back to disk (the server only understands the array as a field and treats it as such).
Targeted Denormalization: When passing data between functions or to the client, it may be efficient to accumulate a particular column into an array to reduce duplication of data in the other columns. We may now do this with composites:

drop table bar, baz cascade;       -- we would rarely rely directly on the generated types
alter table foo add column data text default 'data';
insert into foo values (1, 'ghi'); -- make some duplicated on foo by id
insert into foo values (2, 'jlk');
create type foo_data as (b text, data text);
select id, array_accum ((b, data)::foo_data) from foo group by id order by id;

id |         array_accum        
----+-----------------------------
 1 | {"(abc,data)","(ghi,data)"}
 2 | {"(def,data)","(jlk,data)"}

Arrays perform best as a server side feature. Because it's only currently recommended to render them to the client in text, Arrays of Composites introduce some interesting parsing issues going back and forth from the server. While this is ok in limited cases like the above example, it's inefficient and error prone to build and properly escape a nested array structure. While PostgreSQL has the ability to transfer arrays in binary over the protocol, there is no API to access the custom format data on the client side. Until that changes, I expect this feature will be most useful to stored procedure developers.

While relatively short on substance, I hope that this writing provides a good introduction to this interesting feature. Some of you will probably seize on this feature for it's utility in solving a narrow, but important class of problems. It's one more weapon in the mighty arsenal of the PostgreSQL developer who is already armed with the most practical, versatile, and generally kick-ass piece of software for solving the world's data management issues.

Friday, August 31, 2007

Dell md1000 bonnie++ benchmarks

Following are the results of our testing of the Dell MD1000 with two raid controllers (Perc 5/e). Server was dell 2950, 8gb ram. Pay close attention to the seeks figure, as this important figure for databases (the sequential rates are not all that important). We tried various combinations of software raid and hardware raid in both active/active and active passive configurations.

scroll
down
for
the
results!

:-)

		Sequential Output						Sequential Input				Random Seeks
Description	Size:Chunk Size			Output		Rewrite				Input		Random Seeks
		K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	K/sec	% CPU	/ sec
single drive perc 5i ext2 wal	16G			56600	5	32088	4			90382	6	444.9
dual controller 5 drive hardware raid 0 striped in software ext3	16G			361790	75	177948	33			368836	31	1115.2
singe controller 10 drive raid 0 hardware xfs	16G			405602	47	109537	15			510528	34	1111.8
single controller 5 drive hardware raid 0 ext3	16G			162477	34	110203	19			423957	26	1062.0
dual controller 5 drive hardware raid 0 mirrored in software xfs	16G			255660	30	110531	18			379800	32	1471.4
single controller 10 drive hardware raid 10 xfs	16G			299039	36	104697	14			402601	27	1460.9
dual controller 10 drive hardware jbod software raid 0 xfs	16G			243036	25	126913	16			793798	52	1081.5
dual controller 10 drive hardware jbod software raid 10	16G			244808	27	90410	12			355593	24	1509.9
RUN1 dual controller 5 drive raid 0 xfs separate volumes dual bonnie	16G			287469	36	95400	13			423656	31	1246.3
RUN2	16G			291124	37	109373	14			367032	28	755.4
RUN1 single controller 5 drive raid 0 xfs dual bonnie same volume	16G			140861	21	53683	8			129144	9	506.6
RUN2	16G			137158	21	60206	9			111646	8	487.2
single controller 5 drive raid 5 xfs	16G			292219	34	83425	11			262651	17	1282.4
RUN 1 dual controller 2x4 drive hardware raid 5	16G			245259	31	83647	11			229561	16	767.0
RUN 2	16G			248950	32	74472	10			261117	18	1025.7
dual controller 2x5 drive hardware raid 5 striped in software	16G			423363	44	80409	11			195092	12	1063.6
dual controller 2x5 drive hardware raid 5 mirrored in software	16G			40920	8	26515	4			75975	5	375.0

Wednesday, December 13, 2006

Advisory Locks Part 2

Previously, I wrote about PostgreSQL Advisory Locks and how important and useful they are. One thing I forgot to mention is how in very specifc cases they can improve concurrency in your application. This is because they are both long-term (lock duration can exceed a single transaction) and short-term (lock duration can be shorter than a single transaction). This entry will describe the short-term aspect and what it can be used for.

PostgreSQL uses a high performance, high concurrency MVCC engine to manage mutiple sessions reading and writing to the database. All locks in PostgreSQL use this engine and must obey transaction sementics -- except for advisory locks. While advisory locks can be used for various tricks, their most important role is to provide a database managed mutex of precisely controlled duration irresepctive of what else is going on in the system at the time. The obvious use for this is for long term pessimistic locks, but how else might they be used?

One scenario is to manage contention to resources that also do not follow transactional semantics. One such resource is a sequence -- even if in a serializable transaction other sessions can access that resource and alter your expected results. For example, if you pulled 10 values from a sequence in a pl/pgsql FOR loop, there is no guarantee that those values are contiguous, except to lock a resource before the first value is fetched and hold that lock for the duration of the transaction. Advisory locks provide the tools to give that guarantee with a shorter term lock.

Imagine you had a pl/pgsql function that was long or was part of a transaction that was long and had the following requirements:

the sequence s must be incremented
after incrementing s, execute an operation with a possible (handled) failure condition. If that fails, return s to the value before incrementing
until the failure state of the operation is checked, no other sessions must increment s

The third requirement is basically saying s must be locked inside this transaction. The standard way to do this is to lock something, either a table or a record. However, since all locks, once acquired, must be held until transaction terminatation (via COMMIT/ROLLBACK), our concurrency is poor especially if the lock is acuired early on and the transaction is long running, as there is no facility to unlock resources in-transaction.

Our transaction time line is:
B-TS----------C
B=Begin, S=increment S, C=Commit, T=Standard SQL transaction lock
using the lock, we have a mutex guarding S to C.

With advisory locks, a better solution emerges: lock some number that represents s and unlock that number once the failure state is checked and handled. Now our transaction timeline is:
B-LSU---------C
L=Advisory lock, U=Advisory unlock

Obviously, this lowers contention and improves concurrency. In certain rare cases, this can provide enormous benefits.

Sunday, December 10, 2006

About the author

Who am I?

My name is Merlin Moncure and I am a computer programmer living in Florida. I have a beautiful wife and two beautiful children and am pretty successful by most reasonable measures. This blog is going to be moslty about things related to computer and technology but also about the various things I am concerned about. Particular areas of focus will be the economy, PostgreSQL, a smattering of policitcs and all kinds of other long winded pontificating about whatever else comes to mind.

I previously had a some PostgreSQL related information in planetpostgresql, but I found the blogging software in use there, Serendipity, very frustrating and difficult to use. Also, Blogger integrates with my GMail account in two ways: first, I can login with my Google credentials, and secondly I can publish entries from Google Docs which is a much nicer interface than the inline editor I use here. So, here I am. My first real entry on this blog is actually a post that Serendipity ate.

I would like to give a quick shout out to a few of favorite places on the internet. If you are in the programming business, do yourself a favor and check out PostgeSQL. Also, if you like to fret about the world economy like me, you absolutely have to read Mish's blog. Brilliant stuff. If you live in West Michigan and need furniture, be sure to get it from Longstreet Furniture and tell them Merlin sent you, and you will get a good deal. Also, if you are into RF engineering, check out RadioSoft, one of the biggest names in the business.

merlin

Saturday, December 09, 2006

PostgreSQL 8.2 Advisory Locks

Recently released PostgreSQL 8.2 is a treasure trove of helpful features for the database developer. Dubbed a performance release by the advocacy team, the most powerful open source database engine has once again shown that it is greater than the sum of its parts, and the breadth and scope of the project is greater than its developers. Many internal structural improvements were made, and several user level features were sprinkled in for good measure. One such feature was a overhaul of one of the more useful (and overlooked) contrib modules. Previously known as userlocks, the module was re-branded 'advisory' locks and brought into the core core PostgreSQL architecture.

Advisory locks are 'long-term, cooperative' locks. They are long-term in the sense that they do not follow mvcc rules, and are in fact completely orthogonal to transactional locks. This is a slightly long-winded way of saying that advisory locks are not released upon transaction commit/rollback and can be held outside of a transaction. This is noteworthy because advisory locks are one of the very few things that can be used to maintain session state without using persistent storage (prepared statements, temporary tables, and listen/notify also qualify here), and especially noteworthy because they are the only mechanism in PostgreSQL that I know that can affect the behavior of another session without consideration of either session's transactional state. Advisory locks are cooperative in that all users of them are on the honor system -- this is quite different from standard mvcc locks which follow very strict rules. Anything can be locked at any time. Advisory locks can be thought of as a low level interface to the internal locking engine and will generally need some scaffolding to make them useful.

These properties make advisory locks practical for a number of useful things:

Emulation of 'pessimistic locks' common in non-MVCC databases
High-performance simple session IPC (like setting a boolean in a table, but faster and MVCC penalty free)
Allow cooperation locking of resources without using long duration transactions (which are evil!)

Advisory locks are very simple to use:

Administrator=# select pg_advisory_lock(1);
pg_advisory_lock
------------------
(1 row)

Administrator=# select locktype, classid, objid, pid, mode, granted from pg_locks where locktype = 'advisory';
locktype | classid | objid | pid | mode | granted
----------+---------+-------+------+---------------+---------
advisory | 0 | 1 | 3896 | ExclusiveLock | t
(1 row)

Administrator=# select pg_advisory_unlock(1);
pg_advisory_unlock
--------------------
t
(1 row)

The above commands acquired an advisory lock, demonstrated it in the pg_locks view, and released it using the built in interface. The meaning of the number '1' is completely application defined. It could represent the ID of a table, a special application mutex, a session boolean flag, or just about anything. Keep in mind that if all sessions do not agree on what the number one represents, problems are bound to follow. Regardless of how the session terminates, the locks are released immediately upon disconnect.

If you want to lock an ID of a table, the lock command can be inlined into the select statement:

Administrator=# select pg_advisory_lock(id), * from foo where id = 1;
pg_advisory_lock | id | f
------------------+----+-----
| 1 | abc
(1 row)

And the unlock is just as easy. Advisory locks are extremely fast:

Administrator=# explain analyze select pg_advisory_lock(v) from generate_series(1,1000) v;
QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series v (cost=0.00..17.50 rows=1000 width=4) (actual time=0.304..4.365 rows=1000 loops=1) Total runtime: 6.244 ms
(2 rows)

If I read that right the time on my iMac to acquire a lock is about 6 microseconds....not bad! A note of caution here: advisory locks are drawn out of shared memory and if they are abused the results can be catastrophic:

Administrator=# select pg_advisory_lock(v) from generate_series(1,10000) v;

WARNING: out of shared memory
ERROR: out of shared memory
HINT: You may need to increase max_locks_per_transaction.

Administrator=# select * from pg_locks;
WARNING: out of shared memory

When shared memory is used up, the server basically cannot do anything at all. Quitting the session returned everything to normal however. All the same, extreme care must be taken not to leak locks. Also, they are an extremely obvious Denial of Service target following an SQL injection attack. If this is an issue you may want to consider locking them down:

Administrator=# revoke execute on function pg_advisory_lock(bigint) from public;
REVOKE

Advisory locks provide a 64 bit space for locks. You have the choice of locking two 32 bit integers or a single 64 bit integer (these are drawn from the same lock pool and are thus only superficially different). In addition, there are the 'shared' flavors of the advisory locks which do not block other shared lockers but block and are blocked by exclusive lockers. Best of all, there are 'try' flavors of the functions which do not wait for the lock to be held if another session hold it -- this is similar to the NOWAIT clause of some other SQL statements that can hold locks. The full listing of all the advisory lock functions is here.

Merlin

Saturday, October 14, 2006

OK, I'm testing posting to blogger from google docs. If it works, it will be the first 'killer feature' of google office. Spreadsheets has a long way to go, but I can see docs being a Word replacement with a few improvements. Well, here goes!

merlin