Reference Deduplication#

Terminology & Concepts #

Duplicate references are grouped together. Each group has one canonical reference and zero or more duplicates.

When we search, we look for a canonical reference to which to attach the incoming reference as a duplicate. If we cannot find a canonical reference, the incoming reference becomes a canonical reference with no duplicates. We search for canonical references when we Import a Reference (with work pending to also search when we ingest a relevant enhancement).

A canonical reference is the primary reference of a group of duplicates. The choice of canonical reference is arbitrary, at least for now. The Deduplicated Projection is the same regardless of canonical choice. A canonical reference has an active duplicate decision of Canonical.
A duplicate reference is a reference which has been determined to be a duplicate of a canonical reference. A duplicate reference has an active duplicate decision of Duplicate.
A duplicate decision is the outcome of the deduplication process for a given reference. A reference has at most one active duplicate decision, but may have multiple historical decisions. For instance, a canonical reference has an active decision of canonical. See ReferenceDuplicateDecision for more.
A candidate duplicate is a reference which has been identified as a potential canonical of an incoming reference, but has not yet been compared in detail.
An exact duplicate is a reference which has an identical supersetting reference already present in the repository. These are not imported, but a duplicate decision is still registered for them with Exact Duplicate.

It may also help to think of a group of duplicating references as a star graph. The canonical reference is the center of the star, and all duplicates point to it. Duplicates do not point to other duplicates (more on that in Action Decision).

        flowchart BT
    D1(Duplicate)
    D2(Duplicate)
    D3(Duplicate)
    D4(Duplicate)
    D5(Duplicate)
    D6(Duplicate)
    D7(Duplicate)
    C[Canonical]

    D1 --> C
    D2 --> C
    D3 --> C
    D4 --> C
    D5 --> C
    D6 --> C
    D7 --> C

Note also that deduplication doesn’t necessarily occur at import time, it may also be triggered manually or by a new enhancement.

High Level Process #

        flowchart LR

    R[[Repository Process]]
    P[(Register Pending Decision)]
    T[Initiate Duplicate Decision]
    IS[[Identifier Shortcut]]
    CS[[Candidate Selection]]
    DD[[Deep Deduplication]]
    A[[Action Decision]]
    DP>Deduplicated Projection]

    T-->IS
    IS-->|Shortcut|A
    IS-->|No shortcut|CS
    CS-->|No candidates found|A
    CS-->|Candidates found|DD
    DD-->A
    R-->P
    P-.->|Queue|T
    A~~~DP

There are six key steps:

Exact Duplicates - an early-exit check to see if the incoming reference is an exact duplicate of an existing reference. This occurs outside the main deduplication flow.
Identifier Shortcut - a fast-path check to see if the incoming reference has any unique identifiers that match an existing reference.
Candidate Selection - a high-recall, low-precision search to find potential canonical references.
Deep Deduplication - a high-precision comparison of the incoming reference against each candidate to determine if it duplicates the candidate.
Action Decision - deciding what to do with the reference based on the deduplication results.
Deduplicated Projection - the output of the process, the final representation of the deduplicated reference.

Identifier Shortcut #

        flowchart LR

    D["Duplicate Decision"]
    I["Get Unique Identifiers"]
    E[("Search for Identifiers")]
    CS["Go to Candidate Selection"]
    C1{"Matches Found?"}
    A["Go to Action Decision"]
    M["Manual Review"]
    C2{"Multiple Matches?"}
    C3{"Different Canonicals?"}
    F["For each unmapped match"]

    D-->I-->E-->C1
    C1-->|No|CS
    C1-->|Yes|C2
    C2-->|Yes|C3
    C2-->|No|A
    C3-->|Yes|M
    C3-->|No|F
    F-->A
    F-->A
    F-->A
    F-->A
    F-->A

The identifier shortcut is a high precision, low-recall step that attempts to quickly determine the duplicate decision for an incoming reference based on its unique identifiers. These identifiers are configured by trusted_unique_identifier_types.

This is a very powerful operation that should be enabled with caution. It relies on both the uniqueness of the identifiers and the accuracy of the incoming data. An instance where it is suitable to be used is with OpenAlex IDs and OpenAlex imports, where we can verify both those assumptions.

There are a handful of possible outcomes, documented more fully in shortcut_deduplication_using_identifiers(), but in summary:

If no matches are found or no unique identifiers exist, we proceed to Candidate Selection.
If any matches are found, we build a duplicate decision tree for all of them - any undeduplicated references that are matched are included.
If the above is unresolvable, (i.e. we find more than one existing duplicate decision tree), we raise the decision for manual review. This provides an important sense-check of our core assumptions.

Candidate Selection #

        flowchart LR

    D["Duplicate Decision"]
    SF["Project Search Fields"]
    ES[("Search Against ES")]
    C{"One or more candidates?"}
    CR["Decision = Canonical"]
    DD["Deep Dedup"]

    D-->SF
    SF-->ES
    ES-->C
    C-->|Yes|DD
    C-->|No|CR

Candidate selection employs a high-recall, low-precision approach to identify potential canonical references. The goal is to ensure that all possible canonicals are considered, even if it means including some false positives.

If no candidates are found, the incoming reference is immediately designated as a canonical reference.

The search strategy is a work in progress, but will likely involve a combination of projected fields (defined in CandidateCanonicalSearchFields) and a fuzzy Elasticsearch query:

At this stage, only canonical references are considered as candidates.

async ReferenceESRepository.search_for_candidate_canonicals(search_fields: CandidateCanonicalSearchFields, reference_id: UUID) → list[ESScoreResult][source]#

Fuzzy match candidate fingerprints to existing references.

This is a high-recall search strategy.

NOT TESTED/EVALUATED. Thrown together as a proof of concept, this must be polished and evaluated before use.

The proof of concept does:

MUST: fuzzy match on title (requires 50% of terms to match)
SHOULD: partial match on authors list (requires 50% of authors to match)
FILTER: publication year within ±1 year range (non-scoring)

Parameters:

search_fields (CandidateCanonicalSearchFields) – The search fields of the potential duplicate.
reference_id (UUID) – The ID of the potential duplicate.

Returns:

A list of search results with IDs and scores.

Return type:

list[ESScoreResult]

Deep Deduplication #

        flowchart LR

    D[Duplicate Decision]
    R[Get References]
    DD[[Perform Deep Dedup]]
    C[Canonical Found?]
    A[Proceed to Actioning]
    M([Raise for Manual Review])

    D-->R-->DD-->C
    C-->|Yes|A
    C-->|No|A
    C-->|"Ambiguous/Uncertain"|M

If candidate canonicals are found, each is compared in detail against the incoming reference to determine if they are true duplicates. This step prioritizes precision over recall, aiming to minimize false positives.

This algorithm is still being built out. For now, we have a placeholder that we will update in the future:

async DeduplicationService.__placeholder_duplicate_determinator(reference_duplicate_decision: ReferenceDuplicateDecision) → ReferenceDuplicateDeterminationResult[source]#

Implement a basic placeholder duplicate determinator.

Temporary implementation: takes the first candidate as the duplicate. This is the one with the highest score in the candidate nomination stage. This completes the flow but should not be used in production.

Parameters:: reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to determine duplicates for.
Returns:: The result of the duplicate determination.
Return type:: ReferenceDuplicateDeterminationResult

Manual Resolution #

Duplicate decisions that are Decoupled or Unresolved can be handled here. This is not yet implemented.

Action Decision #

Once the deduplication process is complete, the decision must be actioned. In essence, this involves activating the new decision unless there is a particular reason not to.

Special cases #

The bold lines in the flowchart indicate what we expect to be nominal flow.

        flowchart LR

    N["New Decision (N)"]
    C1{"Active Decision Exists? (A)"}
    C2{A == N?}
    C3{A Canonical & N Duplicate?}
    C4{N is Canonical?}
    C5{N's Canonical is Canonical?}
    T[[Activate New Decision]]
    M([Mark for Manual Handling])

    N ==> C1
    C1 ==>|Yes| C2
    C1 -->|No| C4
    C2 ==>|Yes| T
    C2 -->|No| C3
    C3 ==>|Yes| T
    C3 -->|No| M
    C4 -->|No| C5
    C4 ==>|Yes| T
    C5 -->|No| M
    C5 ==>|Yes| T
    M ~~~ T

There are two cases where the new decision is not automatically activated:

The active decision is duplicate and the new decision is canonical or a duplicate of a different reference.
The new decision is canonical but its canonical reference is not.

Both of these can be handled automatically, but manual review allows us to highlight and understanding the frequency and nature of these cases. The commentary around these is changing frequently so not documenting in detail here, but please reach out if you want more information!

Deduplicated Projection #

The end product of deduplication is a rich database with each individual reference, linked together with their duplicate decision history. However, this is not the most convenient format for most use cases. To this end, the default view for interfacing with the repository is the deduplicated projection.

The deduplication projection is simply a consolidated Reference object, with enhancements and identifiers of its duplicates included in the canonical reference. This is the view that is indexed into Elasticsearch, and likely the view that most robots and users will interact with.

Also note this projected view is reversible, data provenance is preserved through the reference_id field on each enhancement and identifier.

Exact Duplicates #

Exact duplicates are references which are wholly represented (superset) by an existing reference in the repository. This does not form part of the main deduplication flow, but provides an early-exit optimisation for importers and enhancement processors.

Exact duplication is performed on individual references, not the deduplicated projection. This preserves any implied contextual information from the incoming reference.

Function Reference #

class app.domain.references.services.deduplication_service.DeduplicationService(anti_corruption_service: ReferenceAntiCorruptionService, sql_uow: AsyncSqlUnitOfWork, es_uow: AsyncESUnitOfWork)[source]#

Service for managing reference duplicate detection.

async determine_canonical_from_candidates(reference_duplicate_decision: ReferenceDuplicateDecision) → ReferenceDuplicateDecision[source]#

Determine a canonical reference from its candidates.

Parameters:: reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to determine duplicates for.
Returns:: The updated decision with the determination result.
Return type:: ReferenceDuplicateDecision

async find_exact_duplicate(reference: Reference) → Reference | None[source]#

Find exact duplicate references for the given reference.

This is not part of the regular deduplication flow but is used to circumvent importing and processing redundant references.

Exact duplicates are defined in app.domain.references.models.models.Reference.is_superset(). A reference may have more than one exact duplicate, this just returns the first.

Parameters:: reference (app.domain.references.models.models.Reference) – The reference to find duplicates for.
Returns:: The supersetting reference, or None if no duplicate was found.
Return type:: app.domain.references.models.models.Reference | None

async map_duplicate_decision(new_decision: ReferenceDuplicateDecision) → tuple[ReferenceDuplicateDecision, bool][source]#

Apply the persistence changes from the new duplicate decision.

If the new decision is not terminal, it is not made active.

Parameters:: new_decision (ReferenceDuplicateDecision) – The new decision to apply.
Returns:: The applied decision and whether it changed.
Return type:: tuple[ReferenceDuplicateDecision, bool]

async nominate_candidate_canonicals(reference_duplicate_decision: ReferenceDuplicateDecision) → ReferenceDuplicateDecision[source]#

Nominate candidate canonical references for the given decision.

This uses the search strategy in app.domain.references.repository.ReferenceESRepository.search_for_candidate_canonicals.

Parameters:: reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to find candidates for.
Returns:: The updated decision with candidate IDs and status.
Return type:: ReferenceDuplicateDecision

async register_duplicate_decision_for_reference(reference_id: UUID, enhancement_id: UUID | None = None, duplicate_determination: Literal[DuplicateDetermination.EXACT_DUPLICATE] | None = None, canonical_reference_id: UUID | None = None) → ReferenceDuplicateDecision[source]#

Parameters:

reference (app.domain.references.models.models.Reference) – The reference to register the duplicate decision for.
enhancement_id (uuid.UUID | None, optional) – The enhancement ID triggering with the duplicate decision, defaults to None
duplicate_determination (Literal[DuplicateDetermination.EXACT_DUPLICATE] | None, optional) – Flag indicating if a reference was an exact duplicate and not imported, defaults to None
canonical_reference_id (uuid.UUID | None, optional) – The canonical reference ID this reference is an exact duplicate of, defaults to None

Returns:

The registered duplicate decision

Return type:

ReferenceDuplicateDecision

async shortcut_deduplication_using_identifiers(reference_duplicate_decision: ReferenceDuplicateDecision, trusted_unique_identifier_types: set[ExternalIdentifierType]) → list[ReferenceDuplicateDecision] | None[source]#

Deduplicate the given reference using trusted unique identifiers.

This shortcuts the regular deduplication flow and is only run on import.

This is a very powerful operation and should only be used with identifier types that are certain to be unique and reliable. Misuse can lead to incorrect duplicate relationships that are hard to correct.

The search will likely return multiple references (“candidates”), to be handled by:

Terminal Cases:

A. If they all belong to the same duplicate relationship graph, the given reference will be marked as duplicate of that graph’s canonical reference.

B. If they belong to more than one duplicate relationship graph, the given reference is marked as decoupled for manual review, as it indicates disconnected duplicate relationship graphs and undermines the assumption of the shortcut.

C. If none of them belong to a duplicate relationship graph, the given reference becomes the canonical of a new duplicate relationship graph including all candidates.

D. If some of them belong to a single duplicate relationship graph and some don’t, the non-graph references are marked as duplicates of the canonical of the graph.

Non-terminal Cases:

E. Finally, if the given reference has no trusted identifiers or no candidates are found, no action is taken and regular deduplication continues.

Parameters:

reference (app.domain.references.models.models.Reference) – The reference to deduplicate.
trusted_unique_identifier_types (set[ExternalIdentifierType]) – The identifier types considered trusted unique identifiers.

Returns:

The generated duplicate decisions, if any.

Return type:

list[ReferenceDuplicateDecision] | None