Russian version
Add to Del.icio.us
English version
Digg It!

 Old-School dkLab | Constructor | dkLab PostgreSQL patch: work with very huge integer arrays 

Site map :: Project Orphus :: Constructor


2008-05-20

Download the PostgreSQL intarray and intagg patch:
dklab_postgresql_patch_2008-05-20.tgz

dkLab PostgreSQL patch is a small patchset for PostgreSQL 8.3 intarray and intagg contrib modules. It adds 3  fast and CPU-efficient functions written in C to work with a very large set of pre-sorted arrays.

To apply these patches, execute the following commands and then rebuild PostgreSQL from source codes, as usual:

Listing 1: Applying patches
$ cd postgresql-8.3.*
$ patch < /your/path/to/dklab_postgresql_patch_intagg.patch
$ patch < /your/path/to/dklab_postgresql_patch_intarray.patch
$ ./configure && make install

View the patch source codes:
dklab_postgresql_patch_intagg.patch
dklab_postgresql_patch_intarray.patch

intagg.int_array_append_aggregate(int[]): merge arrays into one large list

Special aggregate function int_array_append_aggregate() allows you to merge a very large set of large arrays together in one aggregate query. Merging is very fast, because the function uses a number of memory allocation and copy optimizations.

Listing 2: int_array_append_aggregate usage sample
CREATE TABLE doc(doc_id INTEGER, doc_word_ids INTEGER[]);
INSERT INTO doc(doc_id, doc_word_ids) VALUES(1, '{123, 456, 789, 111}');
...
SELECT int_array_append_aggregate(doc_word_ids) FROM doc;
/* Returns all document word IDs glued together. */

Even if doc table contains 10000 rows and each row contains 1000 word IDs in doc_word_ids, you will get merged result in no more than a couple of milliseconds.

For dummies 

There is a standard function int_array_aggregate that uses the same speed optimization and produces an integer array containing exactly the integers it is fed. But it merges a lot of integers into an array, but not a lot of input arrays into one large array.

intarray._int_group_count_sort(int[], bool): frequency-based sorting

INTEGER[][] _int_group_count_sort(list_sorted INTEGER[], sort_asc BOOLEAN)

Suppose you have a very large sorted array of integers and values are not unique within this array. You need to determine which values are most frequent an which — least frequent. Function _int_group_count_sort() helps you to do that with a very little CPU cost even for large arrays. In some cases the function is more than 1000 times faster than native "GROUP BY ... ORDER BY COUNT(*)" query.

Listing 3: _int_group_count_sort usage sample
list_sorted := '{ 1, 1, 1,   6, 6,   7, 7, 7, 7 }';
groups := _int_group_count_sort(list_sorted);
/* The result is: { { 4, 7 },  { 3, 1 },  { 2, 6 } } */

You see, the function returns two-dimension array. The first element of each sub-array is a frequency of an element, the second — the element's value itself. The result is sorted descendingly (sort_asc = false) or ascendingly (sort_asc = true) by elements frequency.

Note that input array MUST be sorted (use intarray.sort() function), else _int_group_count_sort() will work incorrectly.

For dummies 

The name _int_group_count_sort looks possibly strange, but
it's a common naming conversion for intarray contrib module...

intarray.bidx(int[], int): binary search in a sorted array

INTEGER bidx(INTEGER[] list_sorted, INTEGER value)

This function implements binary search algorythm for a sorted array. It returns position of the element value in sorted array list_sorted or 0 if no element is found at all. The algorythm is very fast and may process thousands of arrays in a millisecods. The function may be used if you need to post-filter a large set of IDs in your own stored procedure or SQL query. Here is an usage example:

Listing 4: bidx usage sample
list := ARRAY(SELECT * FROM generate_series(2,1000000,1));
SELECT bidx(list, 765432); /* returns 765431, the position of 765432 */
SELECT bidx(list, -100);   /* returns 0, no element is found */

Note that input array MUST be sorted (use intarray.sort() function), else bidx() will work incorrectly.

Conclusions

dkLab PostgreSQL patch may be helpful in a heavy-loaded project, when pure relational SQL features are not enough, but you need to work with large sets of integers (e.g. object IDs).





Dmitry Koterov, Dk lab. ©1999-2017
GZip
Add to Del.icio.us   Digg It!   Reddit