Contributor

How to filter duplicate records and take latest record

Hi Have below scenario --

There are multiple entries for single employee from source . I need to pick latest record and create hierarchy based on that . Below is the example from source --

 Emp id Date Country Dept Sal 1 1/1/2018 Australia HR 100 1 10/1/2019 Australia HR 150 1 11/22/2020 Australia BD 200 2 1/1/2018 Australia HR 300 2 10/1/2020 Australia HR 350

Now for hierarchy creation I want to take  the latest entry ,in this example i want to take 3rd entry for emp 1 and 2nd entry for emp 2

Under australia ,emp 1 should have date as 11/22/2020 and dept BD

emp 2 should have date as 10/1/2020 and dept HR

Now for Calculation purpose ,one calculation should be based on latest entry < 1st Nov 20 and other calculation should be based on latest entry irrespective of date .

In this case ,

for emp 1 calculation one should take salary from second entry (with date 10/1/2019) and calculation 2 should take salary from 3rd entry (with date 11/22/2020)

for emp 2 calculation one should take salary from second entry (with date 10/1/2020) and calculation 2 should also take salary from 2nd entry (with date 10/1/2020)

Any help is much appreciated .Thanks in advance !!

3 REPLIES 3
Community Boss

Re: How to filter duplicate records and take latest record

For hierarchy purpose create a rank line item with this formula:
RANK(Date, DESCENDING, MINIMUM, TRUE, Emp ID)
By using the Emp ID in the rank group argument, the ranks will be split for each employee. So the first three entries will have the rank of 3, 2, 1 respectively. The next employee will have its own ranking as well

Now to create the hierarchy, create line items that exclude the metadata without the rank 1. For example the dates line item would be:

IF Rank Hierarchy = 1 THEN Dates ELSE BLANK. Create similar line items for Country, Dept and Salary. Then use those line items to create the employee list after excluding the blanks

For the calculation create another rank line item called Rank Calculation with the formula:

RANK(Date, ASCENDING, MINIMUM, Dates <= START(Time.'Nov 20'), Emp ID)

Now this formula will automatically exclude the entries after Nov 20 (They will have a rank of NaN) and just as before all the entries before 11/01/2020 will have their individual rank. I have used November 2019 in my screenshot:

Again create line items which exclude all entries except rank = 1. Dates formula would be:

IF Rank Calculation = 1 THEN Dates ELSE BLANK. Create similar line items for Country, Dept and Salary. Then use those line items to do your calculation 1

For calculation 2 use the same line items that you used to create the hierarchy

Let me know if you need more help with this

Regards,

Anirudh

Contributor

Re: How to filter duplicate records and take latest record

Thanks Anirudh for detailed explanation  , will try this and let you know .

Moderator

Re: How to filter duplicate records and take latest record

It would be better if you load the data dimensionally to a module.  Essentially, by loading the data in the above picture, you are making Time be the unique member and this exponentially increases your list.  For more information on this, please take a look at this article:  Data Hubs: Purpose and Peak Performance

To get around this, create a list of Employee ID's.  Create a module dimensionalized by Employee ID and Time (day level), with line items for Country, Department, and Salary.  Will this be more sparse?  Yes, but it will allow for more scalability in the future and will allow Hyperblock to work more efficiently.

Rob