My goal is to load transactions for only those customers which is present in Input View and do some processing on them later.
The current implementation is:
1. Hdfs has data for customers with columns (cust_id, date, uuid)
So , first I read this data and create input view on this one.
2. Now later in the pipeline I have create a view from transactions table of DB having schema
(transaction_id, customer_id, date, transaction_amt, tran_type, current_amt).
3. At this point I now have both views with me input and transactions view. Then i am running SPARK SQL on them as "Select i.cust_id, t.transaction_id, t.transaction_amt, t.tran_type from transactions t join input i on i.cust_id=t.customer_id && i.date= t.date"
Now what happens here is that Spark will load all the data from transactions table to create view which is not efficient.
I want to achieve some filter push down in RDBMS also like Spark does for Hdfs.