Entities and experiments. Corpus exploration, proto indexing and enrichment workflows Digital Scholarly Editions "Arcipelago Ceresa" Named Entities in Digital Editions. Between Structured Databases and Context-Specific Annotation Open Editions Workshop, University of Zurich 07.03.2026 Levyn Bürki, Data Science Lab, University of Bern Peter Dängeli, Data Science Lab, University of Bern Edition workflow: General approach ██ Corpus definition ██ Image digitisation (Swiss National Library) ██ IIIF upload ██ Transkribus (raw transcriptions) ▓▓▓ Generation of IIIF manifests (as sequences of canvasses/images per document) ▓▓▓ IIIF manifest-based upload to Transkribus ▓▓▓ Transcription (primarily automated for print/typescripts, manual for manuscripts) ▓▓▓ Document export (incl. transformation to project data structure) ██ Transcription and annotation in oXygen XML editor (with project framework) ██ Web app development (SvelteKit, CETEIcean)![]()
Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 2 / 12
Edition workflow: General approach ██ Corpus definition ██ Image digitisation (Swiss National Library) ██ IIIF upload ██ Transkribus (raw transcriptions) ▓▓▓ Generation of IIIF manifests (as sequences of canvasses/images per document) ▓▓▓ IIIF manifest-based upload to Transkribus ▓▓▓ Transcription (primarily automated for print/typescripts, manual for manuscripts)
Named entity
▓▓▓ Document export (incl. transformation to project data detection
structure) and
identificati
on
██ Transcription and annotation in oXygen XML editor (with project
framework) ██ Web app development (SvelteKit, CETEIcean)![]()
Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 3 / 12
Edition workflow: Named entities Goal: facilitate manual tagging by creating an automatically
compiled list with possible entries (including authority
references) Approach: • use an LLM to evaluate the raw Transkribus output of each
transcribed document and • detect entities,
• try to link them to one or more authority records, and
• deduplicate the resulting list • in order to offer the entries in the linking utility in
oXygen
We call this a "proto index", an imperfect index of entities that
likely occur in the corpus. It is meant to be a shortcut for the editors that saves them manual querying of authority databases. The actual entity tagging is done by the editors. State of work: experimental, explorative (but not too far from production-ready) Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 4 / 12
Demo ██ Step 1 ██ Step 2
Feed documents to LLM and ask to Take output of step 1 and enrich it
detect entities, then identify them with Wikidata information, then
using Wikidata knowledge. generate a spreadsheet (csv).
Using a slim local helper service Simple Python pipeline executing
that acts like a plug‑in the model SPARQL queries (no LLM used).
can call (via the Model Context Protocol, a simple way for tools to talk to each other). This allows to search (e.g.) Wikidata directly (quickly, in memory) without extra servers or manual wiring. Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 5 / 12
Demo ██ Step 1 Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 6 / 12
Demo ██ Step 1 Feed documents to LLM and ask to detect entities, then identify them using Wikidata knowledge. ENTITY_CLASSES = [
"persons", "places", "institutions", "publishers", "works", "events", "citations"
]
ENTITY_SEARCH_SYSTEM_PROMPT = f"""
You are a semantic annotation assistant for a Digital Humanities project working with
historical writings (Italian).
Your task: read the text below and extract ALL named entities you can find, grouped by
category.
Return a JSON object with the following top-level keys, each mapping to an array of
exact surface forms as they appear in the text:
{json.dumps({category: [] for category in ENTITY_CLASSES}, indent=2)}
Rules:
- Use the exact string as it appears in the source text (do not normalize or modernize)
- If a category has no entries, return an empty array
"""
ENTITY_LINKING_SYSTEM_PROMPT = """
You are an entity linking assistant for a Digital Humanities project working with
historical writings (Italian).
Your task: link the given entity to a knowledge base (e.g. Wikidata) using the provided
context for disambiguation.
Return a JSON object with a 'candidates' key containing a list of matches, each with a
'qid' and a 'confidence' of high, medium, or low. If no candidates are found, return an
empty list.
"""
TEXT_SUMMARIZATION_SYSTEM_PROMPT = """
Arcipelago Ceresa are Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 7 / 12
historical writings (Italian).
pip install -r requirements.txt
python main.py > output.txt
For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=43 Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 8 / 12
pip install -r requirements.txt
python -m main --input ../../demo/step1/output.txt --output entities_enriched.csv --log
pipeline.log
For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=165 Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 9 / 12
csvlens demo/step2/entities_enriched.csv
For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=194 Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 10 / 12
██ Result: facilitated manual
tagging based the automatically compiled list The generated entries are used to populate the entity spreadsheet(s) of the project (with manual checks). The oXygen framework queries the spreadsheet and offers the entities ——————————————————————————————————————————————
for comfortable linking. ———————————————————————————————————— ██ Next steps and further
considerations ▓▓▓ Workflow decisions We need to define in what frequency and with what degree of automation this recognition/identification task is executed. One idea is to integrate it into the Transkribus document export. In part, this depends on how fixed the decisions around named entities are (and with them the structure of the![]()
spreadsheet). Arcipelago Ceresa Top Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 11 / 12
Technical partner of the Arcipelago Ceresa project Data Science Lab -- https://dsl.unibe.ch https://youtu.be/afXUHAUZ4dk Arcipelago Ceresa Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026 12 / 12