Merlin's stuff

The writings of Merlin Moncure, professional database developer, about work, life, family, and everything else.

Monday, February 10, 2014

Bitcoin is fine; Surrogate keys are the problem.

If you've been following the bitcoin saga you may have heard about Mt Gox's halting of currency withdrawals.  Well, it's come out that due to their (completely preventable) improper transaction tracking bad actors have been gaming them.  Oops.

Mt Gox (bitcoin exchange) uses surrogate key improperly and pays the price.

I've determined a long time ago that overuse of surrogate keys is a huge problem in the database industry.  They encourage lazy model design and poor understanding of database fundamentals.  Their main advantage over natural keys is they facilitate cheap updates of the key since it isn't used for purposes of relating.  Unfortunately that can lead to disastrous consequences since the data underlying the key is often important and that is exactly what happened here.

Wednesday, September 12, 2012

psql -- now with a splash of color!

I was tired of looking at psql's dull grey output and decided to see if I could work in some color.   I was able to do this -- it's pretty hacky but seems to work well enough for me to publish here.  first, if you haven't already, please read the instructions here for a good primer for how to drastically improve psql data browsing through the less utility (this will work on windows, but only through cygwin based terminals like the excellent rxvt).

My .bashrc is now including this:

export YELLOW=`echo -e '\033[1;33m'`
export LIGHT_CYAN=`echo -e '\033[1;36m'`
export NOCOLOR=`echo -e '\033[0m'`

export LESS="-iMSx4 -FXR"

PAGER="sed \"s/\([[:space:]]\+[0-9.\-]\+\)$/${LIGHT_CYAN}\1$NOCOLOR/;" 
PAGER+="s/\([[:space:]]\+[0-9.\-]\+[[:space:]]\)/${LIGHT_CYAN}\1$NOCOLOR/g;" 
PAGER+="s/|/$YELLOW|$NOCOLOR/g;s/^\([-+]\+\)/$YELLOW\1$NOCOLOR/\" 2>/dev/null  | less"
export PAGER


My .psqlrc typically looks like this (you might want to get in the habit for disabling it with psql -X for output sensitive scripts):

\pset pager always
\timing

you should see output like this:













pretty cool, huh?  it works for table properties too:

















relying on sed of course is pretty sketchy -- if just the right combination of characters come up you'll see garbled output -- but it seems to work pretty well.  of course, disabling the pager is always an option.  Ideally, we'd have tighter control over output formatting for rich color displays (like, having stderr ERROR messages in red for example).

edit: blogspot was garbling the string...breaking it up fixed it
edit2: see comments for method of non-global adjustment of LESS (in case other uses of LESS are impacted by the settings)


Monday, December 17, 2007

Advisory Locks, an Update

Update: more testing has shown this to not work as well as I thought...details will follow
Update 2: this doesn't work at all. I need to give it some more thought...M

I've made no secret that 'advisory locks' is one one of the coolest features in PostgreSQL (see my posts here and here). As a general purpose database mutex, you can do all sorts of wonderful things with them...long term locks, cooperative locks, sub transaction locks, etc.

But what about sub statement locks?

Consider the gapless sequence as described here (normal sequences produce 'holes' in the number line when rollbacks or database crashes occur). In the example, an external record is used to track the counter and provide concurrency protection. Well, how about this:

create table t(id int primary key);

insert into t select coalesce(id, 1) from
(
select pg_advisory_lock(1),
id + 1 as id,
pg_advisory_unlock(1) from t
order by id desc limit 1
) q;

The above statement (ab)uses the left to right order of execution of the fields of the statement. No external object is locked, and the outer select statement could do a bunch of other time consuming things without tying up the lock and killing concurrency. There is no chance of two transactions grabbing the same value of id in read committed mode (there may be a small chance of a race to grab the first value of id). This may not be the best or even a good way to do this, but it's an interesting demonstration of the technique.

Sunday, December 16, 2007

12 Days of Christmas, PostgreSQL style

In case you have trouble remembering the lyrics...

select v || case
when v=1 then ' partridge in a pear tree'
when v=2 then ' turtle doves'
when v=3 then ' french hens'
when v=4 then ' calling birds'
when v=5 then ' GO-O-OLDEN RINGS!'
when v=6 then ' swans a swimming'
when v=7 then ' geese a laying'
when v=8 then ' maids a milking'
when v=9 then ' ladies dancing'
when v=10 then ' lords a leaping'
when v=11 then ' pipers piping'
when v=12 then ' drummers drumming'
end from
(select generate_series(generate_series(1,12), 1, -1) as v) q;

Happy Holidays!

[update: now the lyrics are in the right order for day #2 and beyond...also decided to correctly spell 'pear' against my better judgement]

Managing Trees with Arrays, Part 2

Einhverfr writes:
Out of curiosity, have you looked at contrib/tablefunc and its connectby() function? We use it and the parent_id solution in LSMB and it works pretty well. It also allows for a cleaner semantics because we don't have to store the full path in the record (this and "level" are both generated by connectby!).

Thanks for your interest.

I have looked at connectby, and I have a lot of issues with it. It's a good approach and probably better if your data is highly volatile (updates are much cheaper), but less flexible. If it meets your requirements perfectly I'd use it, but I've tried a few times and end up bumping into limitations. The materialized path approach as advertised scales extremely well, except for the insert problem or if your trees are extremely deep because the index gets huge.

First, the planner knows zero about the connectby function. This can lead to planning problems for complicated joins that it interfaces with. The materialized path approach is mostly wide open to the planner. I find myself often writing complex queries, and absolutely detest pushing data issues into the application layer. This is why I favor views over functions generally when there is a choice. The view abstracts the interface a bit for application without hobbing the planner. The view abstraction approach I used is what really sells the idea, btw.

Secondly, connectby takes a single key as a parameter only. This can lead to very awkward situations if you need to generalize the input over a range of records (give me all the families for items in range of (a,b)), and the query will be slow because the table is unordered. Since the materialized path approach orders the table over the key, the query is simple and fast.

Third, the indexing strategies are different...all operations traversing the tree with the parent/child index relationship require recursion. The connectby function hides the recursion from you but the recursion is still there. This means that for many operations the net result will be less efficient than when the path is handed to you.

Fourth, while it's kind of a pain because you have to create an opclass (and other reasons), the array approach at least opens the door to composite keys.

Fifth, connectby gives you a delimited text string, which has to be parsed and wrapped in a view (my examples above do this already). You also need to wrap if you want to be able to traverse up the tree in a single operation. I would advise doing this...it's probably not a good idea to have the application invoke the connectby function directly.

In fact, sacrificing a little bit of cpu overhead, I could write the parent child relationship abstracted into a few recursive sql functions that traverse the tree. While this can get dangerous as you only want to do really heavy recursion in C as iteration, it's probably good enough for most circumstances and sidesteps some of the land mines the planner throws you with connectby.

Thanks again for commenting. I really should write some benchmarks demonstrating the various techniques. Maybe, if there is a enough interest, I'll take the time to do that.

merlin

Friday, October 19, 2007

A better psql with less

psql is a great tool, but not very good at browsing data. Or is it? The following settings will make psql much more usable.

in your user profile:
export PAGER=less
export LESS="-iMSx4 -FX"

in your .psqlrc (make it if it's not already there)

\timing
\pset pager always

try it!! now wide query results will paginate much better and searching is much easier.

update: fixed quotes for LESS, thanks depesz

Thursday, September 06, 2007

Managing Trees with Arrays

One of my favorite problems in databases is working with recursive structures. In particular, I've made it my quest to debunk the prevailing myth that recursive structures are not 'relational' and thus should be handled in application code or some alternative format. One strategy in this war is demonstrating how easy it is to deal with recursive structures with a little thought and some clever queries. What follows is a demonstration of how one might use the arrays of integers in PostgreSQL in a variation of the classic 'materialized path' method of organizing data in trees, along with some scaffolding to reduce the complexity from the point of view of the client application. This method is efficient, scalable, and useful for many distributions of data. Since trees usually store heterogeneous elements, we will make sure to allow for that in the examples.

-- 8.3 brings enum support. in 8.2, we use a foreign key or a check constraint
create type item_type as enum('root', 'foo', 'bar');

-- the classic way
create table item
(
item_id serial primary key,
parent_id int references item(item_id) on delete cascade
item_type item_type
);
create index item_parent_id_idx on item(parent_id);

insert into item values
(default, null, 'root'),
(default, 1, 'foo'),
(default, 1, 'foo'),
(default, 3, 'bar'); -- etc

While this is very elegant and expressive, the major problem is that to pull any data out of the system based on hierarchy requires recursion, either in the application or in a stored procedure framework. This scales poorly for large sets, especially with deep structures. Let's see how things might look using arrays to materialize the path to the parent:


-- arrays ftw
create table item
(
item_id serial primary key,
parents int[], -- one minor disadvantage is losing the RI constraint
item_type item_type
);

create index item_parents_idx on item(parents);

insert into item values
(default, '{1}', 'root'),
(default, '{1, 2}', 'foo'),
(default, '{1, 3}', 'foo'),
(default, '{1, 3, 4}', 'bar');

Here, we materialize the path (series of parents) to the item in an array, including the item's id. Looking up the family for the 'bar' element (id #4), is done by:

select p.* from item p
join
(
select explode_array(parents) as item_id
from item where item_id = 4
) q using (item_id);

Let's go ahead and generalize this into a view:

create view item_tree as
select q.item_lookup, p.* from item p
join
(
select item_id as item_lookup,
explode_array(parents) as item_id
from item
) q using (item_id);

We can use this to specifically look at item #4 by:

select item_id, type from item_tree where item_lookup = 4;
item_id | item_type
---------+-----------
1 | root
3 | foo
4 | bar

That's pretty nifty. The above query will use the index on item_id for the id lookup as well as when it goes back to item to fetch the elements. Let's expand our view to include child items of the item of interest, and add a flag that can be queried in case that's desirable (this is where the array is particularly helpful):

create view item_tree as
select q.item_lookup, is_child, p.* from item p
join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

The above query creates a remarkably efficient plan and postgresql is smart enough to optimize the case where you are only interested in child or non-child items. The array term is properly utilizes index when necessary. Well, how do we add elements to the structure? While arrays are neat, dealing with them on the client can be a pain (all those strings) and it would be nice to not have to construct them on the fly. Let's expand our view to do this for us:

create view item_tree as
select
q.item_lookup,
p.parents[array_upper(p.parents, 1) - 1] as parent_id, -- null on zero reference
is_child,
p.* from item p

join
(
select
item_id as item_lookup,
false as is_child,
explode_array(parents) as item_id
from item
union all
select
l.item_id as item_lookup,
true as is_child,
r.item_id
from item l
join item r on r.parents between (l.parents || 0)
and (l.parents || 2147483647)
) q using (item_id);

create or replace rule insert_item_tree as on insert to item_tree do instead
insert into item (item_id, parents, item_type)
select
coalesce(new.item_id, nextval('item_item_id_seq')),
(select i.parents || coalesce(new.item_id, currval('item_item_id_seq'))::int from item i where item_id = new.parent_id),
new.item_type;

insert into item_tree(item_id, parent_id, item_type) values (null, 4, 'bar'); -- we can pass in null to get the sequence to assign a value
insert into item_tree(item_id, parent_id, item_type) values (null, 5, 'foo'); -- other columns of view are ignored

select * from item_tree where item_lookup = 3;
item_lookup | parent_id | is_child | item_id | parents | item_type
-------------+-----------+----------+---------+-------------+-----------
3 | | f | 1 | {1} | root
3 | 1 | f | 3 | {1,3} | foo
3 | 3 | t | 4 | {1,3,4} | bar
3 | 4 | t | 5 | {1,3,4,5} | bar
3 | 5 | t | 6 | {1,3,4,5,6} | foo


This is a functional example that could be used to build real applications. Many portions are left as an exercise to the reader, including performance testing (it's pretty good), extending the base item table into properties and specific subtables, a more robust constraint system, and better error handling. Lately I am coming to the perspective that is better to try and preserve sql-ish interface to application facing structures as opposed to providing an API to build recursive structures that the application must interact with. While this is a bit more difficult and involves some hacky things, the complexity is neatly tucked away and the user is free to build recursive structures using familiar paradigms (insert, select, etc).

merlin

Tuesday, September 04, 2007

PostgreSQL 8.3 Features: Plan Invalidation

As previously stated, PostgreSQL 8.3 is shaping up to be a great release of the software. One of the biggest beneficiaries of the new feature set are pl/pgsql developers. 8.3 stands to be the best thing to happen to pl/pgsql since 8.0 brought dollar quoting to the table...before which some might argue serious development with the language bordered on the impractical. The major new features are vastly improved cursor handling (including UPDATE/DELETE), improved handling for set returning functions, arrays of composite types (for efficient passing of data to/from functions), and, especially, automatic invalidation of cached plans. While plan invalidation solves many tangential issues their greatest impact will certainly be in server side development of stored procedures.

When a pl/pgsql function executes for the first time in a session, the server 'compiles' it by parsing the static (not passed through EXECUTE) sql and generating plans for all the queries. This is an essential mechanism for fast repeated execution of server side functions because it allows many tedious, cpu intensive portions of query execution to be optimized. One part of this optimization involves looking up various database objects involved in the query and storing identifiers that are internal to the database. While this is useful and good, it has an unfortunate side effect in that if the structures the database were referencing in the plan were no longer valid, the function plan itself is no longer valid and may raise errors if executed. There are various workarounds to address this that are mostly obsolete (plan invalidation doesn't catch references to functions yet).

Here is the canonical example (and, arguably, the justification for the effort) of demonstrating how plan invalidation works:

create or replace function test_func() returns void as
$$
begin
create temp table tt as select 'hello world!'::text;
perform * from tt;
drop table tt;
end;
$$ language plpgsql;

select test_func();
select test_func();

ERROR: relation with OID 87831 does not exist

The first invocation of test_func() generates a plan for the function. The second time around, the plan is pointing to a stale reference to 'tt' and the function fails. In PostgreSQL 8.3, this function succeeds without error. The database determines when tt is dropped that there are plans referencing it and throws them out. Historically, this was a huge impediment to pl/pgsql development because there was no particularly safe way to create and drop temporary tables from within a function. It is natural to want to create a table for temporary storage, do all kinds of things to it, and release it when done -- but sadly was not generally possible without various workarounds. For this reason, historically it was often better to use cursors inside pl/pgsql functions for holding working data which pushes code into a more procedural style. Now there is room for more set oriented style data processing which should be an attractive alternative to any dba.