Filter condition and deduplication #2678
-
SetupI am wondering how to correctly specify a filter condition in blocking rules when deduplicating a dataset. Specifically, suppose I observe data on individuals over different time periods where the period is indicated by two columns: the first year of observation My approach is to define a filter condition where I restrict comparisons to individuals observed over different time periods (as I know that there are no duplicates within a single time period), i.e, QuestionWith I believe that a potential workaround is to include two separate blocking rules, one with a filter condition A secondary comment is that it took me a while to understand why Splink did not perform some comparisons with my filter condition. Unless I am misunderstanding/abusing filter conditions, it would be nice to have a better documentation on the fact that |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
I think the answer is simply to ensure that |
Beta Was this translation helpful? Give feedback.
I think the answer is simply to ensure that
unique_idis sorted in ascending order according to[year_first, year_last](in which case thelink_type_join_conditionenforced by Splink doesn't affect the filter conditionl.year_last < r.year_first).