Reference Deduplication#

Terminology & Concepts#

Duplicate references are grouped together. Each group has one canonical reference and zero or more duplicates.

When we search, we look for a canonical reference to which to attach the incoming reference as a duplicate. If we cannot find a canonical reference, the incoming reference becomes a canonical reference with no duplicates. We search for canonical references when we Import a Reference (with work pending to also search when we ingest a relevant enhancement).

  • A canonical reference is the primary reference of a group of duplicates. The choice of canonical reference is arbitrary, at least for now. The Deduplicated Projection is the same regardless of canonical choice. A canonical reference has an active duplicate decision of Canonical.

  • A duplicate reference is a reference which has been determined to be a duplicate of a canonical reference. A duplicate reference has an active duplicate decision of Duplicate.

  • A duplicate decision is the outcome of the deduplication process for a given reference. A reference has at most one active duplicate decision, but may have multiple historical decisions. For instance, a canonical reference has an active decision of canonical. See ReferenceDuplicateDecision for more.

  • A candidate duplicate is a reference which has been identified as a potential canonical of an incoming reference, but has not yet been compared in detail.

  • An exact duplicate is a reference which has an identical supersetting reference already present in the repository. These are not imported, but a duplicate decision is still registered for them with Exact Duplicate.

It may also help to think of a group of duplicating references as a star graph. The canonical reference is the center of the star, and all duplicates point to it. Duplicates do not point to other duplicates (more on that in Action Decision).

        flowchart BT
    D1(Duplicate)
    D2(Duplicate)
    D3(Duplicate)
    D4(Duplicate)
    D5(Duplicate)
    D6(Duplicate)
    D7(Duplicate)
    C[Canonical]

    D1 --> C
    D2 --> C
    D3 --> C
    D4 --> C
    D5 --> C
    D6 --> C
    D7 --> C
    

Note also that deduplication doesn’t necessarily occur at import time, it may also be triggered manually or by a new enhancement.

High Level Process#

        flowchart LR

    R[[Repository Process]]
    P[(Register Pending Decision)]
    T[Initiate Duplicate Decision]
    CS[[Candidate Selection]]
    CF{"Candidate(s) Found?"}
    DD[[Deep Deduplication]]
    A[[Action Decision]]
    DP>Deduplicated Projection]

    T-->CS
    CS-->CF
    CF-->|No|A
    CF-->|Yes|DD
    DD-->A
    R-->P
    P-.->|Queue|T
    A~~~DP
    

There are four key steps:

  • Candidate Selection - a high-recall, low-precision search to find potential canonical references.

  • Deep Deduplication - a high-precision comparison of the incoming reference against each candidate to determine if it duplicates the candidate.

  • Action Decision - deciding what to do with the reference based on the deduplication results.

  • Deduplicated Projection - the output of the process, the final representation of the deduplicated reference.

Candidate Selection#

        flowchart LR

    D["Duplicate Decision"]
    SF["Project Search Fields"]
    ES[("Search Against ES")]
    C{"One or more candidates?"}
    CR["Decision = Canonical"]
    DD["Deep Dedup"]

    D-->SF
    SF-->ES
    ES-->C
    C-->|Yes|DD
    C-->|No|CR
    

Candidate selection employs a high-recall, low-precision approach to identify potential canonical references. The goal is to ensure that all possible canonicals are considered, even if it means including some false positives.

If no candidates are found, the incoming reference is immediately designated as a canonical reference.

The search strategy is a work in progress, but will likely involve a combination of projected fields (defined in CandidateCanonicalSearchFields) and a fuzzy Elasticsearch query:

At this stage, only canonical references are considered as candidates.

async ReferenceESRepository.search_for_candidate_canonicals(search_fields: CandidateCanonicalSearchFields, reference_id: UUID) list[ESSearchResult][source]#

Fuzzy match candidate fingerprints to existing references.

This is a high-recall search strategy.

NOT TESTED/EVALUATED. Thrown together as a proof of concept, this must be polished and evaluated before use.

The proof of concept does:

  • MUST: fuzzy match on title (requires 50% of terms to match)

  • SHOULD: partial match on authors list (requires 50% of authors to match)

  • FILTER: publication year within ±1 year range (non-scoring)

Parameters:
  • search_fields (CandidateCanonicalSearchFields) – The search fields of the potential duplicate.

  • reference_id (UUID) – The ID of the potential duplicate.

Returns:

A list of search results with IDs and scores.

Return type:

list[ESSearchResult]

Deep Deduplication#

        flowchart LR

    D[Duplicate Decision]
    R[Get References]
    DD[[Perform Deep Dedup]]
    C[Canonical Found?]
    A[Proceed to Actioning]
    M([Raise for Manual Review])

    D-->R-->DD-->C
    C-->|Yes|A
    C-->|No|A
    C-->|"Ambiguous/Uncertain"|M
    

If candidate canonicals are found, each is compared in detail against the incoming reference to determine if they are true duplicates. This step prioritizes precision over recall, aiming to minimize false positives.

This algorithm is still being built out. For now, we have a placeholder that we will update in the future:

async DeduplicationService.__placeholder_duplicate_determinator(reference_duplicate_decision: ReferenceDuplicateDecision) ReferenceDuplicateDeterminationResult[source]#

Implement a basic placeholder duplicate determinator.

Temporary implementation: takes the first candidate as the duplicate. This is the one with the highest score in the candidate nomination stage. This completes the flow but should not be used in production.

Parameters:

reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to determine duplicates for.

Returns:

The result of the duplicate determination.

Return type:

ReferenceDuplicateDeterminationResult

Manual Resolution#

Duplicate decisions that are Decoupled or Unresolved can be handled here. This is not yet implemented.

Action Decision#

Once the deduplication process is complete, the decision must be actioned. In essence, this involves activating the new decision unless there is a particular reason not to.

Special cases#

The bold lines in the flowchart indicate what we expect to be nominal flow.

        flowchart LR

    N["New Decision (N)"]
    C1{"Active Decision Exists? (A)"}
    C2{A == N?}
    C3{A Canonical & N Duplicate?}
    C4{N is Canonical?}
    C5{N's Canonical is Canonical?}
    T[[Activate New Decision]]
    M([Mark for Manual Handling])

    N ==> C1
    C1 ==>|Yes| C2
    C1 -->|No| C4
    C2 ==>|Yes| T
    C2 -->|No| C3
    C3 ==>|Yes| T
    C3 -->|No| M
    C4 -->|No| C5
    C4 ==>|Yes| T
    C5 -->|No| M
    C5 ==>|Yes| T
    M ~~~ T
    

There are two cases where the new decision is not automatically activated:

  1. The active decision is duplicate and the new decision is canonical or a duplicate of a different reference.

  2. The new decision is canonical but its canonical reference is not.

Both of these can be handled automatically, but manual review allows us to highlight and understanding the frequency and nature of these cases. The commentary around these is changing frequently so not documenting in detail here, but please reach out if you want more information!

Deduplicated Projection#

The end product of deduplication is a rich database with each individual reference, linked together with their duplicate decision history. However, this is not the most convenient format for most use cases. To this end, the default view for interfacing with the repository is the deduplicated projection.

The deduplication projection is simply a consolidated Reference object, with enhancements and identifiers of its duplicates included in the canonical reference. This is the view that is indexed into Elasticsearch, and likely the view that most robots and users will interact with.

Also note this projected view is reversible, data provenance is preserved through the reference_id field on each enhancement and identifier.

See also:

classmethod DeduplicatedReferenceProjection.get_from_reference(reference: Reference) Reference[source]#

Get the deduplicated reference from a reference.

Exact Duplicates#

Exact duplicates are references which are wholly represented by an existing reference in the repository. This does not form part of the main deduplication flow, but provides an early-exit optimisation for importers and enhancement processors.

Exact duplication is performed on individual references, not the deduplicated projection. This preserves any implied contextual information from the incoming reference.

See also: app.domain.references.services.deduplication_service.DeduplicationService.find_exact_duplicate.

Function Reference#

class app.domain.references.services.deduplication_service.DeduplicationService(anti_corruption_service: ReferenceAntiCorruptionService, sql_uow: AsyncSqlUnitOfWork, es_uow: AsyncESUnitOfWork)[source]#

Service for managing reference duplicate detection.

async determine_canonical_from_candidates(reference_duplicate_decision: ReferenceDuplicateDecision) ReferenceDuplicateDecision[source]#

Determine a canonical reference from its candidates.

Parameters:

reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to determine duplicates for.

Returns:

The updated decision with the determination result.

Return type:

ReferenceDuplicateDecision

async find_exact_duplicate(reference: Reference) Reference | None[source]#

Find exact duplicate references for the given reference.

This is not part of the regular deduplication flow but is used to circumvent importing and processing redundant references.

Exact duplicates are defined in app.domain.references.models.models.Reference.is_superset(). A reference may have more than one exact duplicate, this just returns the first.

Parameters:

reference (app.domain.references.models.models.Reference) – The reference to find duplicates for.

Returns:

The supersetting reference, or None if no duplicate was found.

Return type:

app.domain.references.models.models.Reference | None

async map_duplicate_decision(new_decision: ReferenceDuplicateDecision) tuple[ReferenceDuplicateDecision, bool][source]#

Apply the persistence changes from the new duplicate decision.

If the new decision is not terminal, it is not made active.

Parameters:

new_decision (ReferenceDuplicateDecision) – The new decision to apply.

Returns:

The applied decision and whether it changed.

Return type:

tuple[ReferenceDuplicateDecision, bool]

async nominate_candidate_canonicals(reference_duplicate_decision: ReferenceDuplicateDecision) ReferenceDuplicateDecision[source]#

Nominate candidate canonical references for the given decision.

This uses the search strategy in app.domain.references.repository.ReferenceESRepository.search_for_candidate_canonicals.

Parameters:

reference_duplicate_decision (ReferenceDuplicateDecision) – The decision to find candidates for.

Returns:

The updated decision with candidate IDs and status.

Return type:

ReferenceDuplicateDecision

async register_duplicate_decision_for_reference(reference_id: UUID, enhancement_id: UUID | None = None, duplicate_determination: Literal[DuplicateDetermination.EXACT_DUPLICATE] | None = None, canonical_reference_id: UUID | None = None) ReferenceDuplicateDecision[source]#

Register a duplicate decision for a reference.

Parameters:
  • reference (app.domain.references.models.models.Reference) – The reference to register the duplicate decision for.

  • enhancement_id (uuid.UUID | None, optional) – The enhancement ID triggering with the duplicate decision, defaults to None

  • duplicate_determination (Literal[DuplicateDetermination.EXACT_DUPLICATE] | None, optional) – Flag indicating if a reference was an exact duplicate and not imported, defaults to None

  • canonical_reference_id (uuid.UUID | None, optional) – The canonical reference ID this reference is an exact duplicate of, defaults to None

Returns:

The registered duplicate decision

Return type:

ReferenceDuplicateDecision