Incremental updates in HDInsight

How to load dimension tables to Hadoop in an incremental style? The best option would be to use Sqoop and run incremental load. Sqoop is a great tool, but it runs only within cluster. If your cluster have access to all the needed data - great. But what if instead of pull you are allowed to use push only. Files containing new changes are pushed from source systems and it’s up to you how to merge it with a base table. Then Hive comes into the picture. This short described how to merge incremental loads with base load efficiently.

Hortonworks suggests this setup Four-step strategy for incremental updates In RECONCILE_VIEW section you’ll this find “merge” script:

CREATE VIEW reconcile_view AS

SELECT t1.*
FROM (SELECT * FROM base_table
UNION ALL
SELECT * FROM incremental_table) t1
JOIN (SELECT id, max(modified_date) max_modified
FROM (SELECT * FROM base_table
UNION ALL
SELECT * FROM incremental_table) t2
GROUP BY id) s
ON t1.id = s.id AND t1.modified_date = s.max_modified;

Is there a way to make it more efficient? Yes! Joins are heavy, slow and … did I mentioned slow? In another blog post Hortonworks shared5 ways to make hive queries run faster Hortonworks uses Hive’s OLAP functionality (OVER and RANK) to achieve the same thing, but without a Join.

CREATE VIEW reconcile_view AS

SELECT t1.*, RANK() OVER (PARTITION BY id, ORDER BY modified_date DESC) AS rank
FROM	(SELECT * FROM base_table
UNION ALL
SELECT * FROM incremental_table) t1
WHERE t1.rank=1;