Near-realtime people deduplication with Postgres (DuckDB) #2748
Replies: 3 comments 1 reply
-
|
You are my hero and probably saved me a month or 2. Just starting down the path and was going to attempt postgres backend (rationale: we use it for everything else) |
Beta Was this translation helpful? Give feedback.
-
|
Just to say thanks so much for this - really appreciate the write up! |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the write up as well.
I am trying to use that function currently. What does this mean? The function has a blocking_rules parameter. This parameter is not working? Is it bugged or intended? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Foremost I want to thank all the authors and contributors of this great tool! One of the selling points for me was extensive documentation describing both the library usage aspects and also math background behind it. It was later when I started using it I appreciated the additional tooling it provides for analysing the model, assessing the results, etc.
I want to share our experience implementing near realtime dedupe for people records stored in Postgres. As some aspects seemed doubtful for me at the beginning.
What we had
Our path
After we've had more or less stable model tested with artificial data loaded into DuckDB, we tried to port that to native Postgres. While Postgres had all the necessary functions (fuzzymatch + pg_similarity](https://github.com/eulerto/pg_similarity)), but their performance wasn't acceptable at all.
We haven't dug much into what exactly was slow. Instead we switched to back to use DuckDB with Postgres extension.
What we did
We decided to approach dealing with existing data and incremental updates separately, since existing data could be handled in bulk more effectively.
Also for the sake of keeping dependencies separated and not affecting the monolith's performance, we decided to implement dedupe as a micro-service.
architecture-beta group mono(cloud)[Monolith] group splink(cloud)[Dedupe Service] service db(database)[Postgres] in mono service monolith(server)[Monolith] in mono db:L -- R:monolith service dedupe(server)[Web Service] in splink service duck(database)[DuckDB] in splink db:L -- R:duck duck:L -- R:dedupeBulk dedupe
This part is well documented, we followed the docs and it worked well. Took ~3hrs on GCP c3-highcpu-44.
This stage produced a JSON file with mapping, which we then imported into our monolith app.
Realtime incremental dedupe
This required a some more manual work. It appeared that
linker.inference.find_matches_to_new_recordsdoesn't take blocking rules into account :-( and fetching 4M row from Postgres into the app on every request doesn't sound realtime"ish.So we implemented our own blocking rules level, which takes data from Postgres and moves it to temporary DuckDB table likes this
and then run
find_matches_to_new_recordson that smaller table (in our case it's size varies from couple hundred rows to up to a dozen thousand).What we managed to achieve
In terms of realtime matching performance and resource utilisation we have load up ~70 requests per minutes, average response latency is
on 0.3vCPU, 1.5Gb RAM on GCP
n2d-standard-8node.This works pretty well for our purposes, it could be sped up by providing more resources.
Caveats
Fuzzy match on Postgres
Is damn slow as I mentioned above. I suspect
pg_similarity, we make heavy use of Jaro-Winkler and Levenshtein and they're unable to use indexes. Where DuckDB took at max few seconds to find a match for a record in a large table, Postgres took several dozens of seconds or even few minutes to do the same.DuckDB extension for Postgres
Can't effectively push down query filters. Well, it can, but is very limited in doing so.
This effectively means that once you do
SELECT * FROM pg_db.some_large_table WHERE column LIKE '%@gmail.com', it will fetch the whole table from Postgres first and only then will filter it.Same is true for JSON fields.
SELECT data->'$.family' FROM pg_db.table_with_json_columnwill fetch wholedatafield into DuckDB and only after that will apply the->operator.Workarounds for both cases is doing this work manually and pushing all those stuff down into Postgres query
I hope this post could save somebody a day.
Beta Was this translation helpful? Give feedback.
All reactions