Labels

R (15) Admin (12) programming (11) Rant (6) personal (6) parallelism (4) HPC (3) git (3) linux (3) rstudio (3) spectrum (3) C++ (2) Modeling (2) Rcpp (2) SQL (2) amazon (2) cloud (2) frequency (2) math (2) performance (2) plotting (2) postgresql (2) DNS (1) Egypt (1) Future (1) Knoxville (1) LVM (1) Music (1) Politics (1) Python (1) RAID (1) Reproducible Research (1) animation (1) audio (1) aws (1) data (1) economics (1) graphing (1) hardware (1)

06 June 2009

Postgresql Poetry? Aggregate median with PL/R

No, no poetry yet, but i think i'm getting closer to thinking like a query planner. Today i learned aggregates. Have i mentioned how much i love the postgresql documentation? Unknownst to myself before today, postgres's aggregates are smart - they process as a stream when at all possible - so average keeps track of the running sum and number, and calculates the average at the end. Which makes median tricky. It depends on the length of input. I have a median implimentation in PL/Perl that queries the table directly, asking it for length, and then extracts the needed records. While this is efficient, it's much closer to magic than i'm comfortable with. Armed with PL/R, I create a median function with the proper type, and then I create the aggregate using that function as the final calc, and a simple accumulator as the transistion function. This isn't very efficient when the "group by" clause give large groups to be passed into the aggregate function, since the whole group has to be stored in memory. But for lots of small groups (say, aggregating 4 samples per hour or 24 samples per day over a year or two), it gives ~20% performance gains. Throw a "limit 5" clause in, and the performance gain increases to ~100%. Here's a link to some truly amazing stuff folks are doing with PL/R. -- sql code follows -- first create the function, then the aggregate
CREATE OR REPLACE FUNCTION median(vals numeric[]) RETURNS float AS '
median(vals)
' LANGUAGE 'plr' STRICT;

CREATE AGGREGATE median (numeric)
(
sfunc = array_append,
stype = numeric[],
initcond = '{}',
finalfunc = median
);

--And usage
select sn, median(dtwm), count(dtwm), date_trunc('day', timestamp) as day from
gwdata where timestamp < '2003-01-01' group by sn, day limit 5;   
-- sn  | median | count |          day  -----+--------+-------+------------------------  
-- 5526 |   1.59 |    24 | 2002-12-26 00:00:00-07  
-- 5539 | 1.7605 |    24 | 2002-02-27 00:00:00-07  
-- 5522 |  0.737 |    24 | 2001-10-03 00:00:00-07  
-- 5517 |   0.96 |    24 | 2001-11-05 00:00:00-07  
-- 5513 | 1.3855 |    24 | 2001-09-07 00:00:00-07  
-- Time: 247.126 ms     

-- Here's the PL/Perl median() function.  
-- Much less straightforwards, included for historical interest BEGIN { strict->import(); }
my ($tname,$cname) = @_;
my $SQL = "SELECT count($cname) AS t FROM $tname";
my $rc = spi_exec_query($SQL);
my $total = $rc->{rows}[0]{'t'};
$total < offset =" ($total-1)/2;" sql = "SELECT $cname AS median FROM $tname     ORDER BY $cname OFFSET $offset LIMIT 1" sql = "SELECT avg($cname) AS median FROM   (SELECT $cname FROM $tname    ORDER BY $cname      OFFSET $offset LIMIT 2   ) AS foo" rc =" spi_exec_query($SQL);">{rows}[0]{median};

No comments:

Post a Comment