During ETL extraction and loading into the star-schema based warehouse, we generally confront with the idea of using outer join or correlated subqeries to add dimension table primary key into the fact table. to clarify more, take an example below:
INSERT INTO w_emp_f
SELECT a.row_wid loc_wid,
b.ename,
b.address
FROM w_emp_fs b
LEFT OUTER JOIN w_location_d a
ON a.datasource_num_id = b.datasource_num_id
AND a.integration_id = b.integration_id
In the example above, row_wid is the unique identifier of the row in Location dimension which needs to be added to Employee Fact table. In reality, there could be up to 30-40 dimensions which need to be joined to Fact table. Outer joining these dimensions is one of the way, but it could be a big overhead if the dimension tables are very large e.g. >2million records. Outer join may not perform.
This is when correlated subqueries come to rescue. We can frame the above load query as follows:
INSERT INTO w_emp_f
SELECT (SELECT row_wid
FROM w_location_d a
WHERE a.datasource_num_id - b.data_source_num_id
AND a.integration_id = b.integration_id) loc_wid,
b.ename,
b.address
FROM w_emp_fs b
Correlated query would only perform good if the dimension columns used in join condition are indexed. This way DB memory can cache the index and return the records from the dimension table fast for each of the record in fact table.
We can choose to segregate dimensions attached to the fact table based on their size. Depending on the memory configuration of the DB system we can hit&trial the size threshold. Dimensions having size lower than this threshold can be outer joined to the fact table and the ones with size higher than the threshold can be pushed to correlated subqueries.
Sunday, January 11, 2009
Subscribe to:
Post Comments (Atom)
2 comments:
Hi Nitin,
Good one...small correction tho..
These are known as Scalar sub-queries, not correlated sub-queries.
Thanks,
Ashok
Hi Ashok,
Please refer to:
http://publib.boulder.ibm.com/infocenter/rbhelp/v6r3/index.jsp?topic=/com.ibm.redbrick.doc6.3/sqlrg/sqlrg44.htm
Excerpts:
------------------------
A scalar subquery returns a single scalar value (one column, one row) and can occur either in a select list or in a condition as an argument of a comparison operator.
A correlated subquery, however, contains cross-references to the parent query that can force the execution of the subquery each time the parent retrieves a new row.
Post a Comment