Near-realtime people deduplication with Postgres (DuckDB) #2748

scorpp · 2025-07-23T12:43:37Z

scorpp
Jul 23, 2025

Hi,

Foremost I want to thank all the authors and contributors of this great tool! One of the selling points for me was extensive documentation describing both the library usage aspects and also math background behind it. It was later when I started using it I appreciated the additional tooling it provides for analysing the model, assessing the results, etc.

I want to share our experience implementing near realtime dedupe for people records stored in Postgres. As some aspects seemed doubtful for me at the beginning.

What we had

custom software, monolith, which needs to understand if diff records of personal data belong to same physical person
Postgres
~4M records
fields we used for dedupe (pretty much all are optional)
- first / last name
- email
- phone number
- dob
- gender
- postal code
- country

Our path

After we've had more or less stable model tested with artificial data loaded into DuckDB, we tried to port that to native Postgres. While Postgres had all the necessary functions (fuzzymatch + pg_similarity](https://github.com/eulerto/pg_similarity)), but their performance wasn't acceptable at all.

We haven't dug much into what exactly was slow. Instead we switched to back to use DuckDB with Postgres extension.

What we did

We decided to approach dealing with existing data and incremental updates separately, since existing data could be handled in bulk more effectively.

Also for the sake of keeping dependencies separated and not affecting the monolith's performance, we decided to implement dedupe as a micro-service.

architecture-beta
    group mono(cloud)[Monolith]
    group splink(cloud)[Dedupe Service]

    service db(database)[Postgres] in mono
    service monolith(server)[Monolith] in mono
    db:L -- R:monolith

    service dedupe(server)[Web Service] in splink
    service duck(database)[DuckDB] in splink

    db:L -- R:duck
    duck:L -- R:dedupe

Bulk dedupe

This part is well documented, we followed the docs and it worked well. Took ~3hrs on GCP c3-highcpu-44.

This stage produced a JSON file with mapping, which we then imported into our monolith app.

Realtime incremental dedupe

This required a some more manual work. It appeared that linker.inference.find_matches_to_new_records doesn't take blocking rules into account :-( and fetching 4M row from Postgres into the app on every request doesn't sound realtime"ish.

So we implemented our own blocking rules level, which takes data from Postgres and moves it to temporary DuckDB table likes this

CREATE TABLE generated_table_name_here AS (
    SELECT * FROM postgres_query('pg_db', 'SELECT ... FROM postgres_table WHERE <blocking_rule_conditions>')
)

and then run find_matches_to_new_records on that smaller table (in our case it's size varies from couple hundred rows to up to a dozen thousand).

What we managed to achieve

In terms of realtime matching performance and resource utilisation we have load up ~70 requests per minutes, average response latency is

Name	Mean
0.99	5.23 s
0.95	4.12 s
0.9	3.30 s
0.75	2.40 s
0.5	1.90 s

on 0.3vCPU, 1.5Gb RAM on GCP n2d-standard-8 node.

This works pretty well for our purposes, it could be sped up by providing more resources.

Caveats

Fuzzy match on Postgres

Is damn slow as I mentioned above. I suspect pg_similarity, we make heavy use of Jaro-Winkler and Levenshtein and they're unable to use indexes. Where DuckDB took at max few seconds to find a match for a record in a large table, Postgres took several dozens of seconds or even few minutes to do the same.

DuckDB extension for Postgres

Can't effectively push down query filters. Well, it can, but is very limited in doing so.

This effectively means that once you do SELECT * FROM pg_db.some_large_table WHERE column LIKE '%@gmail.com', it will fetch the whole table from Postgres first and only then will filter it.

Same is true for JSON fields. SELECT data->'$.family' FROM pg_db.table_with_json_column will fetch whole data field into DuckDB and only after that will apply the -> operator.

Workarounds for both cases is doing this work manually and pushing all those stuff down into Postgres query

SELECT * 
FROM postgres_query(
  'pg_db', 
  'SELECT data->\'family\' FROM some_large_table_with_json WHERE column LIKE \'%@gmail.com\''
)

I hope this post could save somebody a day.

chrishamsondhcs · 2025-07-23T17:38:37Z

chrishamsondhcs
Jul 23, 2025

You are my hero and probably saved me a month or 2. Just starting down the path and was going to attempt postgres backend (rationale: we use it for everything else)

1 reply

scorpp Jul 23, 2025
Author

Your mileage may vary depending on what exact comparisons and blocking rules you have and which functions you use. But in general DuckDB shows incredible results!

One thing to keep an eye at if you'll go DuckDB Postgres connector path is splink / DuckDB making excessive queries into Postgres. This might get expensive on large tables. Fortunately, DuckDB Postgres connector has logging config which you can enable to actually see what's going on.

Overall the approach is to do an ETL, even for incremental dedupe. Extract from Postgres in a form more suitable for dedupe, load into DuckDB, find dupes.

RobinL · 2025-07-25T18:33:38Z

RobinL
Jul 25, 2025
Maintainer

Just to say thanks so much for this - really appreciate the write up!

0 replies

m1-s · 2025-09-18T20:54:38Z

m1-s
Sep 18, 2025

Thanks for the write up as well.

It appeared that linker.inference.find_matches_to_new_records doesn't take blocking rules into account

I am trying to use that function currently. What does this mean? The function has a blocking_rules parameter. This parameter is not working? Is it bugged or intended?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Near-realtime people deduplication with Postgres (DuckDB) #2748

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Near-realtime people deduplication with Postgres (DuckDB) #2748

Uh oh!

Uh oh!

scorpp Jul 23, 2025

What we had

Our path

What we did

Bulk dedupe

Realtime incremental dedupe

What we managed to achieve

Caveats

Fuzzy match on Postgres

DuckDB extension for Postgres

Replies: 3 comments · 1 reply

Uh oh!

chrishamsondhcs Jul 23, 2025

Uh oh!

scorpp Jul 23, 2025 Author

Uh oh!

RobinL Jul 25, 2025 Maintainer

Uh oh!

m1-s Sep 18, 2025

scorpp
Jul 23, 2025

Replies: 3 comments 1 reply

chrishamsondhcs
Jul 23, 2025

scorpp Jul 23, 2025
Author

RobinL
Jul 25, 2025
Maintainer

m1-s
Sep 18, 2025