Overview

Scope and rationale

Varela Digital as a scholarly edition and documentation-driven project

Varela Digital is implemented as a documentation-driven scholarly digital edition whose primary objective is to expose the structure, entities, and relations implicit in a historical correspondence corpus. The project integrates textual encoding, metadata normalisation, and semantic modelling within a coherent technical framework designed to support inspection, reuse, and exploratory analysis.

At its core, the project treats TEI/XML not only as a publication format, but as the authoritative representation of editorial decisions. All subsequent layers—indexes, visualisations, and semantic exports—are derived from explicitly encoded structures rather than from ad hoc data extraction. This ensures traceability between documentary evidence and the computational representations presented on the website.

The scope of the project extends beyond document display. Persons, places, institutions, and relations referenced in the corpus are modelled as first-class entities and documented through authority files and standoff annotations. These entities form the basis for cross-document navigation, network construction, and spatial analysis, enabling users to move systematically between individual documents and aggregated views of the corpus.

From a technical perspective, Varela Digital is designed as a static, reproducible system. The entire platform is deployed without server-side dependencies, relying instead on pre-generated HTML, structured datasets, and client-side visualisation libraries. This design choice prioritises transparency, long-term maintainability, and ease of reuse over dynamic querying or runtime data generation.

The project’s documentation plays a central role in this architecture. Rather than treating the website as a black-box interface, Varela Digital exposes its data models, ontologies, and knowledge base artefacts as inspectable components. This documentation-oriented approach positions the edition not only as a reading environment, but also as a technical reference for scholars interested in data-driven approaches to historical correspondence.

Editorial and interpretive stance

Varela Digital adopts an explicit and documented editorial stance that distinguishes clearly between source material, editorial mediation, and interpretive modelling. The textual layer of the project is based on the printed transcriptions published in the Anais do Arquivo Histórico do Rio Grande do Sul, which are treated as the authoritative source for encoding.

As a consequence, the project does not aim to produce a diplomatic or critical edition of the original manuscripts. Orthographic, palaeographic, and material features of the handwritten documents are not reconstructed. Instead, the edition preserves the editorial logic of the printed source and documents this dependency explicitly as part of the project’s scope.

Editorial intervention is limited to structural encoding, normalisation required for computational processing, and the identification of entities and relations relevant to historical analysis. These interventions are applied consistently and recorded through TEI/XML markup, authority files, and standoff annotations, allowing them to be isolated from the textual transcription itself.

Interpretive statements—such as the identification of social roles, institutional affiliations, or relationships between actors—are not embedded directly in the running text. Instead, they are modelled as structured assertions linked to documentary evidence. This separation allows interpretive choices to remain explicit, traceable, and revisable, without compromising the integrity of the encoded text.

This editorial configuration positions Varela Digital as a documentation-oriented scholarly edition: one in which transparency of sources, clarity of modelling decisions, and technical reproducibility take precedence over exhaustive textual reconstruction or interpretive closure.

From texts to structured data

The Varela Digital workflow is designed as a controlled, multi-stage pipeline that transforms documentary sources into structured, reusable data while preserving editorial responsibility and traceability. Each stage produces explicit artefacts that serve as inputs for subsequent processes, ensuring that computational representations remain grounded in documented editorial decisions.

The workflow is intentionally split into two complementary paths originating from the TEI/XML layer: a text-oriented publication pipeline, dedicated to reading and validation, and a data-oriented extraction pipeline, dedicated to semantic modelling and visualisation. This separation allows the project to support both close textual engagement and corpus-level analytical exploration.

Editorial and technical workflow of the Varela Digital project — Editorial and technical workflow: from TEI/XML to static publication and semantic visualisation

Workflow stages

Stage	Input	Process	TEI / Data structures	Outcome / Purpose
1. Source acquisition	Printed Anais do Arquivo Histórico do Rio Grande do Sul (vol. II, 1978)	High-resolution scanning and OCR (Python workflow), followed by manual revision	Raw text files (intermediate, non-authoritative)	Establish a clean working transcription faithful to the printed edition, without producing an authoritative textual layer
2. Structural TEI encoding	Revised OCR text	Encoding of document structure (letter boundaries, paragraphs, openers, closers, postscripts)	`<TEI>`, `<teiHeader>`, `<text>`, `<body>`, `<div type="letter">`, `<p>`	Produce a stable, machine-readable representation of each document as a textual unit
3. Archival and editorial metadata	Anais metadata and institutional documentation	Encoding of archival context, editorial responsibility, and representational constraints	`<sourceDesc>`, `<msDesc>`, `<editorialDecl>`, `<revisionDesc>`	Represent the Anais as a transmission stratum and document editorial assumptions and limits
4. Entity identification (inline anchoring)	TEI body text	Anchoring textual references to persons, places, organisations, dates, and events	`<persName>`, `<placeName>`, `<orgName>`, `<date>` with `@ref`	Anchor textual mentions to identifiable entities while preserving textual readability
5. Standoff entity modelling	Inline entity references	Creation and maintenance of external authority files	Separate TEI/XML standoff files (persons, places, organisations, events)	Enable correction, enrichment, and disambiguation without altering the base transcription
6. Editorial intervention marking	Printed Anais text	Explicit encoding of normalisation, uncertainty, paratext, and editorial notes	`<note type="editorial">`, `<note type="historical">`, `<sic>`, folio markers	Preserve the editorial layer of the printed edition transparently, without claiming diplomatic fidelity
7. Relation annotation (standoff)	Curated historical interpretation	Encoding of interpersonal, institutional, and event-based relations	`<relation>` elements in standoff files (FOAF, HRAO, PRO, Schema)	Make historical relations explicit, queryable, and analysable as networks
8. Validation and consistency checks	TEI corpus and standoff files	Schema validation, identifier consistency, and cross-reference checks	Project-specific constraints and guidelines	Ensure internal coherence, referential integrity, and long-term maintainability
9. Version control and documentation	All project files	Git-based versioning and documentation of editorial decisions	Public repository, changelogs, encoding guidelines	Guarantee transparency, reproducibility, and future extensibility
10. Semantic transformation and publication	Structured metadata and standoff layers	Generation of RDF/Turtle and JSON-LD, both per document and as an aggregated graph	FRBR/FaBiO, FOAF, PRO, HiCO, SAN, HRAO	Enable knowledge graph publication, data exploration, and analytical reuse

Reuse and interoperability

Interoperability in Varela Digital is addressed through the explicit reuse of shared standards, controlled vocabularies, and stable identifiers. Rather than introducing a comprehensive project-specific ontology, the data model combines established Semantic Web vocabularies with a minimal domain-oriented extension designed to express historically meaningful relations.

Reused standards and vocabularies

The project adopts widely used standards for textual encoding, metadata representation, and semantic modelling. These choices ensure that the resulting datasets can be understood and reused beyond the project context, while remaining compatible with common Digital Humanities workflows.

TEI (Text Encoding Initiative): used as the authoritative format for textual and editorial encoding of documents.
Dublin Core (DC): reused for basic descriptive metadata related to documents, creators, and dates.
FOAF (Friend of a Friend): employed for modelling persons and social relationships at a general level.
PRO (Publishing Roles Ontology): used to represent roles and responsibilities in relation to documents and institutional actors.
FaBiO / FRBR-aligned concepts: applied to distinguish documents as abstract works, expressions, and concrete manifestations where required for bibliographic clarity.
HiCO (Historical Context Ontology): reused to qualify interpretive statements and contextual assertions derived from historical sources.
SAN (Scholarly Annotation Namespace): employed to model references and textual anchoring between documents and interpretive assertions.

HRAO as a domain-specific extension

The Historical Relations of Archival Objects (HRAO) vocabulary is reused as a lightweight, domain-specific extension to express relations that are not sufficiently captured by general-purpose ontologies. HRAO introduces a small set of properties designed to model forms of agency, responsibility, and interaction typical of nineteenth-century archival correspondence.

HRAO is maintained in a separate repository and is referenced by the Varela Digital knowledge base as an external dependency. This separation avoids coupling project data too tightly to a single vocabulary, while enabling clear documentation of the semantic commitments introduced by relation modelling.

Data formats and access for reuse

To support reuse, the project publishes its semantic representations in standard, non-proprietary formats. The knowledge base is released as RDF/Turtle and JSON-LD, and is documented alongside the project ontology. These formats enable both human inspection and machine processing, without requiring specialised infrastructure.

While no public SPARQL endpoint is provided in the pilot phase, the published datasets are designed to be compatible with external triple stores and analysis environments. This allows third parties to ingest, query, and extend the data independently of the project website.

Interoperability by design

By combining shared standards, stable identifiers, and a minimal extension strategy, Varela Digital prioritises interoperability as a design constraint rather than a post-hoc feature. The resulting data model supports alignment with external datasets, integration into broader research infrastructures, and reuse in future scholarly projects without requiring access to the original publication platform.

Current state and roadmap

Varela Digital is implemented as a functional pilot project. All core components required to support textual consultation, structured navigation, and data-driven exploration are in place and publicly accessible. The current state reflects a completed editorial and technical workflow rather than a conceptual prototype.

Implemented components

Textual edition: a corpus of TEI/XML documents encoded from the printed Anais do Arquivo Histórico do Rio Grande do Sul, including structural markup, editorial notes, and inline entity anchoring.
Document viewer: static HTML pages generated via XSLT, enriched with client-side JavaScript for annotations, navigation, and validation-oriented reading.
Metadata-driven indexes: CSV-based metadata supporting browsing by document identifier, sender, recipient, date, and places of sending and reception.
Standoff authority files: external TEI/XML files for persons, places, organisations, events, and relations, enabling controlled enrichment and disambiguation.
Knowledge base publication: RDF/Turtle and JSON-LD exports documenting entities, relations, and ontological commitments, published as reusable artefacts.
Visualisation interfaces: interactive map and network views (social, family, and institutional) generated from pre-processed datasets derived from the semantic layer.

Consolidated design decisions

Several architectural and methodological choices are treated as stable in the current version of the project. These include static web deployment (GitHub Pages), the absence of server-side components, the use of TEI/XML as the authoritative textual layer, and the publication of the knowledge base as downloadable files rather than through a live querying endpoint.

These decisions prioritise transparency, reproducibility, and long-term maintainability over dynamic data generation or real-time querying.

Planned extensions

Future work is envisaged primarily as incremental extension rather than architectural revision. This includes the expansion of the corpus, the enrichment of authority files, and the refinement of relation modelling as new historical questions emerge.

Potential technical extensions include the integration of document images through the International Image Interoperability Framework (IIIF), enabling alignment between textual encoding and visual access to manuscript or printed sources where available. Such an extension would introduce an image layer without altering the project’s existing editorial workflow.

Another possible development concerns the exposure of project data through a documented application programming interface (API). Rather than replacing the current static publication model, an API would function as an additional access layer, supporting programmatic reuse of metadata, authority files, and derived datasets.

Finally, future phases may explore the production of diplomatic transcriptions for selected documents, explicitly separated from the current encoded texts. Any such extension would be documented as a parallel editorial layer, allowing comparison between representational strategies without conflating distinct editorial aims.

Out of scope

The following elements are explicitly considered outside the scope of the current project phase: diplomatic transcription of manuscript sources; automated entity recognition; collaborative editing interfaces; and the provision of a public SPARQL endpoint or server-side API.

These exclusions are not interpreted as limitations, but as deliberate constraints that define the project’s focus and ensure coherence between editorial aims and technical implementation.