How is the data being duplicated in the data hub To answer the question, yes it is possible to avoid duplicate data. But it is usually better and more efficient if duplicate data never enters the data hub in the first place
1) Understand the data and meta data. Ask the extraction team to provide a unique key in the file, concatenate several fields to form the unique key in the extraction itself if need be. If there are duplicates, then the unique key is not unique. Highlight this to the extraction team. I would not let duplicate data to enter inside the data hub. Keep refining the unique until you reach such a point.
2) We can write minimal formulas in data hub. Try to avoid concatenations, if possible push it to the extraction layer itself.