Entities and experiments. Corpus exploration, proto indexing and enrichment workflows                  
                                     Digital Scholarly Editions "Arcipelago Ceresa"                                     
                                                                                                                        
                                                                                                                        
            Named Entities in Digital Editions. Between Structured Databases and Context-Specific Annotation            
                                      Open Editions Workshop, University of Zurich                                      
                                                       07.03.2026                                                       
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                   Levyn Bürki, Data Science Lab, University of Bern                                    
                                  Peter Dängeli, Data Science Lab, University of Bern                                   
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                 Edition workflow: General approach                                                     
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 ██ Corpus definition                                                                                   
                                                                                                                        
                                                                                                                        
                 ██ Image digitisation (Swiss National Library)                                                         
                                                                                                                        
                                                                                                                        
                 ██ IIIF upload                                                                                         
                                                                                                                        
                                                                                                                        
                 ██ Transkribus (raw transcriptions)                                                                    
                                                                                                                        
                 ▓▓▓ Generation of IIIF manifests (as sequences of canvasses/images                                     
                 per document)                                                                                          
                                                                                                                        
                 ▓▓▓ IIIF manifest-based upload to Transkribus                                                          
                                                                                                                        
                 ▓▓▓ Transcription (primarily automated for print/typescripts,                                          
                 manual for manuscripts)                                                                                
                                                                                                                        
                 ▓▓▓ Document export (incl. transformation to project data                                              
                 structure)                                                                                             
                                                                                                                        
                                                                                                                        
                 ██ Transcription and annotation in oXygen XML editor (with project                                     
                 framework)                                                                                             
                                                                                                                        
                                                                                                                        
                 ██ Web app development (SvelteKit, CETEIcean)                                                          
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 2 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                 Edition workflow: General approach                                                     
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 ██ Corpus definition                                                                                   
                                                                                                                        
                                                                                                                        
                 ██ Image digitisation (Swiss National Library)                                                         
                                                                                                                        
                                                                                                                        
                 ██ IIIF upload                                                                                         
                                                                                                                        
                                                                                                                        
                 ██ Transkribus (raw transcriptions)                                                                    
                                                                                                                        
                 ▓▓▓ Generation of IIIF manifests (as sequences of canvasses/images                                     
                 per document)                                                                                          
                                                                                                                        
                 ▓▓▓ IIIF manifest-based upload to Transkribus                                                          
                                                                                                                        
                 ▓▓▓ Transcription (primarily automated for print/typescripts,                                          
                 manual for manuscripts)                                                                                
                                                                                         Named entity                   
                 ▓▓▓ Document export (incl. transformation to project data               detection                      
                 structure)                                                              and                            
                                                                                         identificati                   
                                                                                         on                             
                 ██ Transcription and annotation in oXygen XML editor (with project                                     
                 framework)                                                                                             
                                                                                                                        
                                                                                                                        
                 ██ Web app development (SvelteKit, CETEIcean)                                                          
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 3 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                  Edition workflow: Named entities                                                      
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 Goal: facilitate manual tagging by creating an automatically                                           
                 compiled list with possible entries (including authority                                               
                 references)                                                                                            
                                                                                                                        
                 Approach:                                                                                              
                                                                                                                        
                    •  use an LLM to evaluate the raw Transkribus output of each                                        
                       transcribed document and                                                                         
                 detect entities,                                                                                 
                    •  try to link them to one or more authority records, and                                           
                    •  deduplicate the resulting list                                                                   
                    •  in order to offer the entries in the linking utility in                                          
                       oXygen                                                                                           
                                                                                                                        
                 We call this a "proto index", an imperfect index of entities that                                      
                 likely occur in the corpus.                                                                            
                                                                                                                        
                 It is meant to be a shortcut for the editors that saves them                                           
                 manual querying of authority databases.                                                                
                                                                                                                        
                 The actual entity tagging is done by the editors.                                                      
                                                                                                                        
                                                                                                                        
                 State of work: experimental, explorative (but not too far from                                         
                 production-ready)                                                                                      
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 4 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                 Demo                                                                                   
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 ██ Step 1                                         ██ Step 2                                            
                                                                                                                        
                 Feed documents to LLM and ask to                  Take output of step 1 and enrich it                  
                 detect entities, then identify them               with Wikidata information, then                      
                 using Wikidata knowledge.                         generate a spreadsheet (csv).                        
                                                                                                                        
                 Using a slim local helper service                 Simple Python pipeline executing                     
                 that acts like a plug‑in the model                SPARQL queries (no LLM used).                        
                 can call (via the Model Context                                                                        
                 Protocol, a simple way for tools to                                                                    
                 talk to each other).                                                                                   
                                                                                                                        
                 This allows to search (e.g.)                                                                           
                 Wikidata directly (quickly, in                                                                         
                 memory) without extra servers or                                                                       
                 manual wiring.                                                                                         
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 5 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                          Demo                                                          
                                                                                                                        
                                                                                                                        
     ██ Step 1                                                                                                          
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 6 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                          Demo                                                          
                                                                                                                        
                                                                                                                        
     ██ Step 1                                                                                                          
                                                                                                                        
     Feed documents to LLM and ask to detect entities, then identify them using Wikidata knowledge.                     
                                                                                                                        
                                                                                                                        
                ENTITY_CLASSES = [                                                                                      
                    "persons", "places", "institutions", "publishers", "works", "events", "citations"                   
                ]                                                                                                       
                                                                                                                        
                ENTITY_SEARCH_SYSTEM_PROMPT = f"""                                                                      
                You are a semantic annotation assistant for a Digital Humanities project working with                   
                historical writings (Italian).                                                                          
                Your task: read the text below and extract ALL named entities you can find, grouped by                  
                category.                                                                                               
                Return a JSON object with the following top-level keys, each mapping to an array of                     
                exact surface forms as they appear in the text:                                                         
                {json.dumps({category: [] for category in ENTITY_CLASSES}, indent=2)}                                   
                Rules:                                                                                                  
                - Use the exact string as it appears in the source text (do not normalize or modernize)                 
                - If a category has no entries, return an empty array                                                   
                """                                                                                                     
                                                                                                                        
                ENTITY_LINKING_SYSTEM_PROMPT = """                                                                      
                You are an entity linking assistant for a Digital Humanities project working with                       
                historical writings (Italian).                                                                          
                Your task: link the given entity to a knowledge base (e.g. Wikidata) using the provided                 
                context for disambiguation.                                                                             
                Return a JSON object with a 'candidates' key containing a list of matches, each with a                  
                'qid' and a 'confidence' of high, medium, or low. If no candidates are found, return an                 
                empty list.                                                                                             
                """                                                                                                     
                                                                                                                        
                TEXT_SUMMARIZATION_SYSTEM_PROMPT = """                                                                  
  Arcipelago Ceresa are  Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 7 / 12  
                historical writings (Italian).                                                                          
                                                                                                                        
                                                                                                                        
                                     pip install -r requirements.txt                                                    
                                     python main.py > output.txt                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                          For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=43                          
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 8 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                pip install -r requirements.txt                                                                         
                python -m main --input ../../demo/step1/output.txt --output entities_enriched.csv --log                 
                pipeline.log                                                                                            
                                                                                                                        
                                                                                                                        
                                                                                                                        
                         For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=165                          
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                 9 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                     csvlens demo/step2/entities_enriched.csv                                           
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                         For executed code see: https://asciinema.org/a/CvUyVktwWlx02NTE?t=194                          
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                10 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 ██ Result: facilitated manual                                                                          
                 tagging based the automatically                                                                        
                 compiled list                                                                                          
                                                                                                                        
                 The generated entries are used to                                                                      
                 populate the entity spreadsheet(s)                                                                     
                 of the project (with manual checks).                                                                   
                                                                                                                        
                 The oXygen framework queries the                                                                       
                 spreadsheet and offers the entities     ——————————————————————————————————————————————                 
                 for comfortable linking.                                                                               
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                 ————————————————————————————————————                                                                   
                                                                                                                        
                                                                                                                        
                 ██ Next steps and further                                                                              
                 considerations                                                                                         
                                                                                                                        
                 ▓▓▓ Workflow decisions                                                                                 
                                                                                                                        
                 We need to define in what frequency                                                                    
                 and with what degree of automation                                                                     
                 this recognition/identification task                                                                   
                 is executed.                                                                                           
                                                                                                                        
                 One idea is to integrate it into the                                                                   
                 Transkribus document export. In                                                                        
                 part, this depends on how fixed the                                                                    
                 decisions around named entities are                                                                    
                 (and with them the structure of the                                                                    
                 spreadsheet).                                                                                          
                                                                                                                        
  Arcipelago Ceresa  Top Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                11 / 12  
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                   Technical partner of the Arcipelago Ceresa project                                   
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                        Data Science Lab -- https://dsl.unibe.ch                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                              https://youtu.be/afXUHAUZ4dk                                              
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
                                                                                                                        
  Arcipelago Ceresa      Named Entities in Digital Editions, Roundtable “Workflows”, 07.03.2026                12 / 12