The Provenance Project. ← Overview

Methodology

How scattered archives become a knowledge graph — and how that graph becomes a cited lead.

This page documents the approach in detail, for archivists, provenance researchers, and technical readers. It errs toward transparency: where the method is uncertain or limited, it says so.

Contents 1 · The problem: evidence exists, but scattered 2 · A claims graph, not a facts database 3 · The pipeline, stage by stage 4 · Entity resolution across archives 5 · Analytics: gaps, motifs, networks 6 · Data sources & licensing 7 · Limitations, ethics & calibration

1 · The problem: evidence exists, but scattered

Between 1933 and 1945, an estimated 600,000 artworks were looted across Europe; roughly 100,000 remain unaccounted for. Unlike many historical mysteries, the documentary trail is unusually rich — the Nazi bureaucracy and the Allied recovery effort both kept meticulous records. A single object might appear in an ERR seizure inventory in Paris, a transport manifest in Koblenz, a Munich Central Collecting Point property card, a post-war auction catalogue in Cologne, a restitution claim in a fourth country, and a museum acquisition file decades later — each in a different language, archive, and identifier scheme.

The barrier was never a shortage of evidence. It was that no human researcher could hold millions of pages across dozens of repositories in mind at once and notice that record A and record F describe the same painting. That cross-referencing — at scale, across languages — is what large language models and graph analytics now make tractable.

2 · A claims graph, not a facts database

The central design decision: the system does not store facts. It stores claims. Every assertion — "this work was seized from X in 1941", "this 1954 lot is the same painting as that 1941 seizure" — is recorded together with the document it came from, an extraction-confidence score, and the model and prompt version that produced it. Competing and contradictory claims are kept side by side.

This matters because provenance is adversarial. Provenance was deliberately falsified — a work seized in Paris in 1941 might resurface in 1954 with a fabricated history reading "Swiss private collection since 1920". A conventional database would treat that as a conflict to resolve and silently pick one value. Here, the contradiction is preserved and ranked as a lead: an inconsistency between a seizure record and a later sale catalogue is precisely the fingerprint of laundering.

Four rules are non-negotiable: every claim cites its source; identities are never hard-merged; contradictions are surfaced, not smoothed; and a human makes the final call on anything published.

Never hard-merge

When two records appear to describe the same object, the system does not fuse them. It records a same_as edge with a confidence score and the features that justified it. A "merged" view is materialised from those edges and can always be reversed back to the underlying records. A wrong merge would manufacture a false custody chain — the single fastest way to produce a confident, cited, and completely wrong "finding". Reversibility is the safeguard.

The shape of the graph

Custody is modelled as events, not as a mutable "owner" field — an acquisition, a transfer, a seizure, a sale, a restitution, each with actors, a place, and a time span. Ownership at any moment is derived from the event sequence. This event-based shape follows the museum world's own standard (CIDOC-CRM / Linked Art), which keeps the data interoperable with institutional systems.

(:Document)-[:ATTESTS {confidence}]->(:Claim)
(:Claim)-[:ASSERTS]->(:Event {type, date_range, place})
(:Event)-[:INVOLVES {role}]->(:Actor | :Artwork)
(:Artwork)-[:SAME_AS {confidence, method, decided_by}]->(:Artwork)   // never merged

3 · The pipeline, stage by stage

Each stage writes to durable, inspectable storage, so any step can be re-run and audited independently.

acquire  Open data first

Source adapters pull from bulk datasets and public archives, each producing a raw snapshot plus a manifest recording the retrieval date, URL, and licence. The project uses sanctioned bulk-download channels — published open datasets — rather than scraping live systems behind bot-walls or logins. Where a corpus is only available by agreement, that is a partnership conversation, not a workaround.

extract  Reading records into claims

Structured records (auction databases, museum catalogues) map deterministically to claims. Free-text and scanned records are read by language models that emit typed claims validated against strict schemas — people, organisations, artworks, events, and the wartime identifiers that join archives together (ERR codes, Munich and Wiesbaden collecting-point numbers, Linz inventory numbers). Every extracted claim carries its source span and a confidence score.

On scanned documents: modern multimodal models handle 1940s typescript at or above dedicated OCR tools, and excel at the entity extraction that follows. Old German handwriting (Kurrent / Sütterlin) is still hard — measurably worse than specialist handwriting models — so the project sequences typescript corpora first and treats handwriting as a distinct, later problem.

graph  Accumulating across sources

Claims land in an embedded property graph. Because the graph is claims-based, ingesting a new archive never overwrites what came before — it adds assertions that may corroborate or contradict existing ones, and the analytics layer notices both.

analyse & review

Graph algorithms rank candidate leads; a language-model pass then re-reads the underlying documents for the top candidates and tries to refute each one before a human ever sees it. Adjudicated decisions are recorded as a committed, human-auditable artifact and re-applied on every rebuild, so the graph can never silently lose a verdict.

4 · Entity resolution across archives

The hardest and most valuable step is deciding when two records describe the same object or person. The same painting appears as "La rue Saint-Rustique", "Rue Saint Rustique, Montmartre", and a German catalogue's "Strassenszene", with drifting attributions and dimensions rounded differently each time.

Resolution runs in three steps. Blocking narrows the comparison space using normalised artist keys (diacritic- and order-insensitive, so "DÜRER, Albrecht" meets "Albrecht Durer"), parsed dimensions converted to millimetres, and multilingual title keys. Pairwise adjudication scores each candidate pair on hard evidence — dimensional agreement within tolerance, including orientation-swapped, and title similarity — with explicit penalties for contradictions and era mismatches. Collective propagation then compounds relational evidence: two artwork records that share an owner, a dealer, and compatible dates reinforce each other. Every decision becomes a scored, reversible edge — never a silent merge — and the score is always openable into the features that produced it.

5 · Analytics: gaps, motifs, networks

TechniqueWhat it produces
Custody-gap detectionThe core product: every work with a documented seizure and no documented restitution, destruction, or return — ranked by how strong and recent the trail is.
Temporal-spatial constraintsContradiction flags — a painting in two cities at once, a sale dated after the owner's deportation. Each is either an extraction error or a wartime lie; both deserve a human look.
Laundering-motif matchingAnti-money-laundering patterns pointed at the 1940s market: seizure → anonymous consignor → neutral-country sale.
Network centralityThe looted-art trade ran through a small network of dealers and auction houses. Centrality yields both a risk score and an ingestion-priority list — which un-digitised archive would unlock the most chains.
Red-flag name scanProvenance text scanned for the known art-trade figures of the period and the major despoiled collections. A work naming both is a priority lead.

A worked validation: pointed at museum provenance narratives with no prior knowledge, the red-flag scan independently surfaced documented looted-art cases — including the Gutmann Degas Landscape with Smokestacks. Rediscovering known cases is how a method is validated before it is trusted on the unexamined long tail. Separately, measuring auction-catalogue consignor anonymity reproduced the historical arc of the forced-sale market — ~40% anonymous in the late-Weimar baseline rising to ~76% at the 1942–44 peak — from open data alone.

6 · Data sources & licensing

The project prefers openly-licensed bulk data (CC0 or public-domain) for anything that feeds the graph. Per-item records that carry "in copyright" rights — many archival page images — are used as cited evidence for research and linked at their source, never rehosted or redistributed. Person-level records carry privacy obligations and are handled under the terms of the holding archive.

SourceRoleLicence
Getty Provenance Index (sales, Knoedler)art-market layer, 1900–1971CC0
French MNR / Rose Valland (POP)never-restituted French holdingsopen data
Joconde — musées de Francemuseum holdings + former ownershipopen data
Art Institute of Chicago; The Metpresent-day collections + provenanceCC0
Wikidatacross-archive identity hubCC0
Heidelberg digitised auction cataloguesprimary-source evidence pages (IIIF)linked, not rehosted

In progress, via public APIs and partnership: the U.S. National Archives Catalog (Munich Central Collecting Point property cards, OSS Art Looting Investigation Unit reports), the German Lost Art Foundation registry, Dutch restitution data, and the Arolsen Archives.

7 · Limitations, ethics & calibration

This is a research aid, not an oracle. AI extraction and matching make mistakes — misread dates, false title matches, homonymous names. That is the reason every claim is shown with its sources and a human makes the final determination. Outputs are ranked leads, subject to revision.

Calibrated language is a discipline, not a disclaimer. "Offered at auction in Hamburg in 1937" is a fact with a scan behind it. "Looted" is a legal conclusion the project does not draw. A dealer's name in a provenance line is a signal worth investigating — not evidence of wrongdoing; many dealers of the period operated legitimately, and the despoiled collectors were the victims.

People deserve care. The project does not speculate publicly about present-day private owners, and it treats living people connected to these histories — heirs above all — with appropriate restraint.

Reproducibility is the standard. Anything the project publishes can be traced back to the cited records and reconstructed. The site's case pages are generated from the graph, not hand-written — regenerating them from the evidence is itself the integrity guarantee.