Current Issue

2026: 20.1

Preview Issue

2026: 20.2

Previous Issues

Indexes

Title
Author

ISSN 1938-4122

Announcements

DHQ: Digital Humanities Quarterly

Editorial

A Semantic Search Platform for Old Church Slavonic Texts and a case study on the Nicene – Constantinopolitan Creed

Marianna Napolitano <mariannanapolitano_at_unimore_dot_it>, , University of Modena and Reggio Emilia

https://orcid.org/0009-0005-1347-973X

Giovanni Puccetti <giovanni_dot_puccetti_at_isti_dot_cnr_dot_it>, , Institute of Science e Technologies of Information “A. Faedo” - CNR

https://orcid.org/0000-0002-8906-0987

DOI: pending

Abstract

Semantic retrieval based on Language Models embeddings is an established approach for information retrieval from large scale textual corpora, however, its application to digital humanities and specifically to ancient languages is still limited. In this work we develop a tool for the semantic retrieval of Old Church Slavonic and Church Slavonic texts and we present an investigation, carried out by experts in Slavic Philology and Slavic Orthodoxy, of the Nicene-Constantinopolitan Creed in these languages.

The main contributions of the work are two: (1) We describe the creation of a reference dataset for semantic retrieval in (Old) Church Slavonic, addressing key challenges such as the limited availability of digital corpora, fragmentation of sources, and character encoding inconsistencies; (2) We develop a tool for semantic retrieval in these languages.

By collecting feedbacks from the experts using the tool to study the Creed we find that it is reliable in supporting semantic analysis of the Creed and its major strengths are: (a) identifying expressions that are semantically related to the query based on context; (b) providing access to diachronic sources and enabling assessment of their impact on the translation and transmission of texts, particularly the Creed, which conveys the core principles of the Christian faith. The experts also identified some limitations, which will serve as a basis for future work and further improvement, particularly when querying texts from manuscript corpora. These issues highlight the need to: (a) enrich the dataset with manuscript sources using OCR and HTR techniques; (b) account for script variation and historical or regional orthographic differences. Our findings highlight how the tool proposed provides both a useful infrastructure for the study of (Old) Church Slavonic as well as a proof of concept for the application of semantic based information retrieval in ancient literature.

The tool is available on HORTUS, the ITSERR project website: ITSERR - Prototyping Lab

1. Introduction

The goal of this paper is to present a tool for the semantic analysis of texts in Old Church Slavonic and Church Slavonic. The need for a tool capable of enabling semantic analysis of texts in this historical language emerged within the framework of a collaborative project that aims to apply Natural Language Processing (NLP) techniques to religious texts in various languages, using the Nicene-Constantinopolitan Creed as a case study. This research is part of a broader scholarly initiative dedicated to the 1700th Anniversary of the Ecumenical Council of Nicaea [1] (325-2025).

The Creed is a doctrinal text agreed by bishops convened in council that conveys the fundamental tenets of the Christian faith. As a text regularly recited in the liturgy and quoted in catechesis it exerts a significant impact on the target culture. Accordingly, its semantic implications, particularly the theological significance of linguistic and terminological variation, are of primary relevance. As shown by Gezen’s analysis of the Creed in Istoriia slavianskogo perevoda Simvola very (published in 1884; second edition: 2015), the examination of manuscript evidence reveals a wide range of textual variation, which changes according to the historical period and the regional context. More specifically, Gezen’s study demonstrates not only the multiplicity of these variations, but also makes it possible to classify the Slavic translation of the Nicene-Constantinopolitan Creed into three distinct editions. Among these, the second edition survives in two different redactions which, according to the scholar, differ both in terms of text and script: one is written in Glagolitic characters, the other in Cyrillic. Furthermore, variation is also attested within the same redaction. This is the case, for example, with the third edition, in which both adjectives истиннагѡ (istinnago - true) and животворѧщаго (zhivotvoriashchago - life-giving) appear in the eighth article ([Napolitano 2025]), as will be discussed in greater detail in the following sections of this paper. Building on Gezen’s analysis, it emerges that up to the seventeenth century three different versions of the Creed could still be found in liturgical texts. This view seems to be confirmed by the comparative analysis aimed at assessing their linguistic and historical significance. Additional differences become evident when comparing, for instance, pre-reform texts other than those used by Meyendorff, particularly those originating from the Polish-Lithuanian Commonwealth, where Catholic influence was more prominent.

The present study is study thus originates from the intent to investigate the Slavonic translations of the Creed (Сѵмволъ вѣры Символ веры - Simvol very) and to explore the impact of different geographic areas and historical phases on the language’s transmission. A particularly important point of reference is the revision of the Creed undertaken during the liturgical reform of Patriarch Nikon, [2] with a specific focus on assessing the role of the Greek liturgical tradition in the implementation of these revisions.

Following a review of the relevant literature on NLP for Old Church Slavonic (OCS) and Church Slavonic in paragraph 2, the paper discusses the creation of a reference dataset for the semantic retrieval tool (paragraph 3). This phase is particularly complex due to the extreme scarcity of resources available for this language, the fragmentation of digitized texts across multiple unsystematized digital libraries, and the persistent issues related to character standardization, especially the widespread use of non-Unicode characters. The following paragraphs will present the methodology adopted for the development of the tool for the semantic retrieval (paragraph 4), not only from a technical perspective but also through the analysis of evaluation forms completed by a selection of expert scholars (paragraph 5). This testing group includes six researchers specializing in Slavic philology, many of whom are actively involved in digital humanities projects related to OCS and Church Slavonic manuscripts. Their feedback will serve to assess the perceived reliability of the tool’s output with respect to the primary objective of analyzing the Nicene-Constantinopolitan Creed, and to identify the strengths and limitations of this novel contribution to NLP applied to (Old) Church Slavonic.

2. Advances in NLP for Old Church Slavonic: Methods, Resources, and Open Problems

Despite the growing effectiveness of transformer-based text representations developed through large language models (LLMs), these techniques remain underdeveloped for historical languages such as Old Church Slavonic (OCS) and Church Slavonic, especially due to the lack of training data.

This scarcity of data has had a significant impact on the work conducted in the field of this project. In addition, the focus will be on diachronic information, which is of particular relevance for research on Old Church Slavonic and Church Slavonic.

The following pages will briefly present the most relevant projects and studies in the field of Old Church Slavonic and Church Slavonic research, in order to outline the context in which our tool has been developed. This overview also aims to highlight the current limitations and future challenges that must be addressed to consolidate LLMs in this field of study and to improve their results.

More generally, NLP for OCS and Church Slavonic continues to face severe constraints, primarily due to the absence of annotated datasets sufficient for training robust computational models. An additional obstacle stems from the limited number of historical linguists capable of providing expert human evaluation of model performances, which hinders rigorous qualitative assessment. This remains a critical bottleneck, despite recent advances that have improved performance in low-resource NLP through transformer-based architectures, including historical languages.[3]

To date, research efforts have primarily focused on two major historical Slavic varieties: Old Church Slavonic in the strict sense, and Old East Slavic (which represents the attested form used in English to refer to texts written in древнерусский - drevneruskii). The Universal Dependencies framework[4] and the Stanza toolkit[5] have played pivotal roles in providing both annotated corpora and processing tools for these languages.

Previous attempts to develop NLP resources for pre-modern Orthodox Slavic texts have employed a variety of strategies. Rule-based approaches to morphological analysis have been notably explored by [Baranov et al. 2007] and by #moldovan_etal who did this wily while at the Russian Academy of Sciences[6] Alternatively, tagging methodology from Modern Russian has been proposed for developing a new pipeline in this field.[7]

Current state-of-the-art approaches to automatic Part-of-Speech (PoS) tagging and morphological analysis for OCS have achieved results comparable to those in modern NLP applications. While many corpora offer either gold-standard dependency annotations or morphological tagging, some, like the PROIEL Treebank for Old Church Slavonic,[8] provide both.

Addressing this gap is essential for advanced syntactic analysis of historical Slavic texts. In the following, several projects that have implemented these techniques will be discussed.

The BogoSlov project presents a hybrid computational framework for identifying biblical quotations and allusions in OCS texts. The system addresses the challenge of detecting intertextual references that range from verbatim citations to paraphrased allusions. It combines 1) rule-based methods using string similarity and longest common subsequence metrics with 2) semantic methods leveraging embeddings from SentenceTransformers (Sentence-Bert or SBERT) and RoBERTa on corpora from the late antiquity and early middle ages. These methods are integrated into a CTS URN [9] - based model[10] for referencing source texts (primarily the Psalter and Gospels in OCS) and linking them to homiletic and hagiographic target texts such as Vita Constantini and Vita Methodii. The framework includes a graphical annotation interface to support expert-in-the-loop validation, visualization of alignments, and manual metadata enrichment. Designed to be extensible to other historical languages such as Latin or Greek, the framework integrates symbolic and neural approaches to support philologically sensitive annotation at scale.[11] In a related study,[12] the authors propose a mixed-methods, quantitative-qualitative framework for analyzing Church Slavonic texts, using a 16th-century East Slavic manuscript and a 1519 South Slavic printed SluzhebnikSlužebnik. Transcriptions were generated using Transkribus (HTR), followed by a quantitative analysis and traditional philological validation. This hybrid approach revealed differences across regional varieties and between manuscript and printed forms, showcasing the potential of NLP-philology synergies in historical Slavic research. The project forms part of a broader investigation of Orthodox Slavic linguistic varieties (15th - 18th centuries) at the University of Łódź, which uses corpus methods and automatic handwriting recognition to study vernacular-Church Slavonic interactions. The proposed processing workflow includes three components: 1) HTR post-editing and ground truth creation; 2) sentence-level classification with transfer learning models for temporal and regional attribution; 3) token-level annotation of linguistically salient features. The pipeline was evaluated on both HTR-derived and manually corrected corpora, with attention to systematic HTR errors, tool limitations, and accessibility for non-technical scholars. This modular system enhances the state of the art in historical Slavic NLP by integrating automatic correction, supervised classification, and fine-grained annotation in a reproducible and user-adaptable framework.

[Scherrer, Rabus, and Mocken n.d.] laid foundational groundwork with New Developments in Tagging Pre-modern Orthodox Slavic Texts, comparing a statistical tagger (TnT)[13], MarMoT,[14] which is used in particular to address problems with large tagsets, such as full morphological tagging, and a bi-LSTM character-level tagger (CLSTM) on annotated [Eckhoff et al. 2018]; [Eckhoff and Haug 2021]. Their findings show that modern tagging architectures can achieve accuracies up to 95–96% without orthographic normalization.[15] The study also explored cross-linguistic and cross-corpus transfer learning, demonstrating that synthetic or dialectal resources do not reveal substantial performance improvements over models trained on low resources historical languages. This underlines the importance of high-quality annotated corpora and establishes morphosyntactic tagging as a reliable foundation for future semantic modeling in premodern Slavic NLP.

Building on this foundation, more complex PoS tagging systems have been proposed. One of the most advanced systems to date is described in [Berdichevskii, Eckhoff, and Gavrilova 2016], which combines a statistical PoS tagger (TnT)[16] with a rule-based tagging module and a series of pre-processing steps. This hybrid model achieved an accuracy of 92.7% for part-of-speech tagging and 81.5% for morphological features. While these results are encouraging, recent advances in statistical and neural tagging models, as well as in normalization techniques, suggest that further improvements are possible. Subsequent work has built upon BEG16’s framework using the TOROT treebank, which has since been expanded, increasing both training data size and test set diversity. Additional sources, such as the Old Church Slavonic portion of the PROIEL corpus, have been included to broaden the linguistic basis of experimentation. Beyond the original use of the TnT tagger, more experiments have incorporated MarMoT and taggers based on deep neural networks. Both the PROIEL and TOROT datasets have been converted to the Universal Dependencies format, allowing for standardized PoS, morphological, and syntactic annotation: this is a key step for enabling cross-linguistic comparisons and annotation projection. In terms of evaluation, the field has moved beyond simple accuracy and Hamming distance. Micro-averaged F1-scores are being used to provide more granular insight into tagging performance.

Regarding character normalization, earlier approaches focused on lowercasing and removal of diacritics, ligatures, and orthographic variants. More recent experiments updated normalization routines and also explored tagging on unnormalized data, a scenario highly relevant for digital Paleoslavic studies. Efforts have been made to integrate external resources, such as the unannotated PLDR corpus (parallel Old Russian-Modern Russian) and the Modern Russian SynTagRus treebank in UD format, but these have not yielded substantial gains in accuracy.[17].

In parallel to these tagging efforts, the focus of research on OCS and Church Slavonic has increasingly shifted toward approaches using words and sentence embeddings. SuchThese approaches open for more novel downstream tasks such as text classification and embedding evaluation, semantic similarity and semantic retrieval. This paradigm shift also raises the issue of the representation’s quality automatically learned from historical language data.

Whithin In this framework, a major bottleneck for these downstream tasks lies at the tokenization stage. As [Dorkin and Sirts 2024] highlight, standard tokenization tools are poorly suited to the linguistic properties of Old Church Slavonic and Old East Slavic,. These languages, which frequently result in high out-of-vocabulary rates and unrecognized characters. Therefore, the development of custom tokenizers for specialized embedding models is essential for achieving meaningful representations in this domain.

Benchmark datasets and pre-trained models fine-tuned for language-specific tasks are notably sparse for historical Slavic languages. Embedding-based models, originally designed for applications in contemporary languages, often fail to align with the goals of diachronic linguistic research applied to historical languages. In fact, the main objective of diachronic research pursued so far, is to uncover patterns and mechanisms of orthographic and grammatical variation within the data, rather than to facilitate semantic question answering for accessing document content. In contrast, the focus of our study, as outlined in the following sections, is the development of a tool designed to retrieve sentences semantically related to a specific input query.

As [Lendvai et al. 2025] emphasize, techniques such as embedding similarity and retrieval augmented generation, a methodology originally developed by [Lewis et al. 2020], perform well on modern languages but are ill-suited for capturing the phenomena of interest in historical corpora. A further shortcoming of current approaches is the limited attention to temporal and geographic variation within historical Slavic corpora. This specific study also adopts an approach to diachronic linguistics, in which the focus lies in tracing orthographic and grammatical evolution over time rather than in fulfilling semantically-driven queries aimed at producing interpretative outcomes. To address this issue the authors also propose a classifier,[18] using manuscript metadata to assign spatial-temporal attributes, by both copying time and region at the sentence level. Building on this, [Lendvai et al. 2025] assembled a diachronically and dialectally heterogeneous corpus including South Slavic manuscripts from the 10th-11th centuries and East Slavic recensions from the 15th -17th centuries. This dataset enabled the exploration of linguistic influence between traditions. The analytical framework combined string-based representations, such as character n-grams and term frequency - inverse document frequency (TF-IDF)[19] , with computational similarity techniques, including sequence alignment, local string matching, and k-nearest neighbor (kNN) search based on cosine distance. These were systematically contrasted with neural methods using text embeddings derived from SBERT and BERT. Candidate ranking in retrieval tasks was performed via cosine similarity, providing a quantitative basis for evaluating both traditional and neural approaches to modeling linguistic variation.

Semantic similarity, which assesses how similar two sentences are, based on vectorial representations that encoded high-level meanings of the word, has become a major focus of recent research, especially in low-resource language [Deshpande et al. 2023]. Dense text retrieval with LLMs has emerged as a particularly promising avenue, though still relatively new. In particular, for LLMs to create vectorial representations of text that encode semantic meaning, they require training over large scale corpora, notable examples of this approach are the works from [Artetxe and Schwenk 2019] and [Reimers and Gurevych 2019]. Adaptation of these architectures to historical languages, with their unique orthographic variants and fragmentary documentation, remains an open and rapidly evolving research challenge.

As discussed above, in the scope of historical language, orthographic normalization must be approached with caution. Non-normalized forms often carry important information about temporal or regional provenance. These features are essential for capturing linguistic change and for understanding the sociocultural dynamics that underlie orthographic and morphosyntactic variation. Also for this reason, Lendvai’s study of the Vita of Paul and Juliana, a text with Church Slavonic and Old East Slavic versions, is a notable case study.[20] Here, sentences from the earlier recension were used as queries to retrieve aligned content from the later version, preserving historical variation while testing retrieval performance. The dataset for this experiment was created through meticulous manual alignment at the word level. Crucial to this work was the identification of parallel texts across different manuscript sources. This allows for reuse of metadata typically absent or inconsistently annotated in the original documents.

The attribution of texts to specific times and places has also been explored in a work by [Lendvai et al. 2023], who purposely fine-tuned BERT models. Their corpus includes six manually transcribed texts drawn from medieval manuscripts and early printed editions from Southern and Eastern Europe, dating from the 10th to 18th centuries. Dating and regional attribution were supported by codicological, paleographic, and linguistic criteria. All texts were written in Cyrillic and exhibit non-normalized orthographies typical of the religious, non-vernacular literary genre. The corpus spans a wide variety of Old Church Slavonic recensions shaped by distinct cultural influences, reflected in the orthographic, lexical, and morphosyntactic diversity of the material. A particularly salient phenomenon traced here is the so called Second South Slavic influence, which manifests in the 14th -15th century Rus’ scribal tradition through the incorporation of South Slavic stylistic norms into East Slavic manuscript culture.[21] BERT-based models[22] have been applied to sentence-level classification, segmenting texts into sentence-like units. For fine-tuning, the study employed publicly available models on Hugging Face, such as 1) bert-base-multilingual-uncased and Cyrillic-specific architectures like 2) KoichiYasuoka/bert-base-slavic-cyrillic-upos and 3) anon-submission-mk/bert-base-macedonian-bulgarian-cased. BERT achieved high accuracy in three attribution tasks: a) manuscript identification; b) dating by century attribution; c) regional classification, outperforming Cyrillic-specialized models. The study demonstrated the competitiveness of general-purpose multilingual transformers when fine-tuned on historical Slavic data. This is coherent with our findings, as described in our own case study from section 5. (Analysis of results).

The issue of text normalization remains unresolved, along with the need to develop analytical methodologies capable of working with non-normalized texts. Such approaches are essential to preserve the orthographic and stylistic peculiarities of the language, thereby enabling in-depth philological analysis that can engage in dialogue with semantic studies and their historical, theological, and cultural significance.

The following section explores the challenges related to data scarcity and text processing, with a particular focus on data collection aimed at developing a semantic retrieval tool for OCS and Church Slavonic. This development is part of the broader DaMSym project, which supports semantic similarity search not only for OCS and Church Slavonic but also for Latin, Greek, Arabic, Sanskrit, and a combined Latin/Greek multilingual configuration.

3. Data Collection

As previously discussed, the scarcity of texts in Old Church Slavonic and Church Slavonic has posed, and continues to pose, a significant challenge for the development of NLP tools for these languages, and more broadly, for the training of large language models (LLMs). This challenge can be traced to three main issues concerning the availability of plain-text resources:

Texts are scattered across various online repositories, stored in heterogeneous formats, and are not encoded using a consistent Unicode font, which makes their aggregated use difficult;
The editorial criteria adopted are often unclear, and there is no shared agreement on their accuracy or reliability;
Progress in OCR and HTR methodologies remains limited, and the amount of manual intervention still required to improve the performance of tools such as Transkribus and eScriptorium is too substantial to enable a large-scale creation of high-quality resources.

We accounted for these issues when collecting the diachronic dataset and when developing the semantic retrieval tool presented in this work.

For this reason, we initially chose to focus exclusively on Old Church Slavonic and Church Slavonic textual resources written in the Cyrillic script, thereby excluding those preserved in Glagolitic[23]. This decision is directly related to the approach taken in the development of the dataset and the selection of source texts. We retained the original transcriptions provided by the selected editions but opted not to include transliterations. Although the texts in question have already undergone editorial handling especially in terms of normalizationprocessing, we argue that preserving the original script allows for a more faithful and nuanced analysis, both linguistically and semantically.

This principle also informed our decision to exclude the Corpus Cyrillo-Methodianum Helsingiense (CCMH)[24] which was made available through the Language Bank of Finland (Kielipankki) and the CLARINO infrastructure.[25] While the CCMH corpus offers transliterated versions of Old Church Slavonic texts that may serve particular use cases, it does not meet the requirements of our framework for semantic modeling and linguistic annotation, which relies on maintaining the original Cyrillic representations.

Given these considerations, texts written in the Glagolitic alphabet were excluded due to the substantial challenges involved in standardizing such sources for qualitative analysis, particularly those arising from the use of Private Use Area (PUA) characters.[26] As previously noted, even working exclusively with Cyrillic-script materials has required the application of multiple Unicode fonts and, in some cases, the implementation of customized mappings between PUA and Unicode code points to ensure accurate text rendering. This process, which is essential for conducting reliable qualitative analyses, would become significantly more complex with the inclusion of additional scripts.

A further example of texts that were intentionally excluded, is the Orthodox portal Azbuyka,[27] particularly the section Богослужение – Священное Писание (Bogosluzhenie – Sviashchennoe Pisanie; Divine Liturgy – Holy Scripture). These texts represent modern editions of manuscripts reflecting liturgical language from approximately the 16th-17th centuries, a later phase in the historical development of Church Slavonic. However, they were excluded from the corpus due to the lack of reliable metadata[28], particularly concerning the precise dating and provenance of the editions.

Therefore, our current focus remains limited to Cyrillic sources for which precise metadata information could be retrieved. Once the primary issues related to PUA encoding are resolved, we intend to expand the dataset to include Glagolitic texts, enriching both its historical scope and linguistic depth in support of the broader objectives of this diachronic resource.

As previously mentioned, the inclusion of Cyrillic sources underwent further refinement based on the issue of Private Use Area (PUA) characters. For this reason, only texts with available and compatible fonts enabling accurate rendering were included. In the case of the Cyrillomethodiana,[29] in addition to the font itself, PUA-to-Unicode mappings were introduced to facilitate proper front-end visualization. These mappings were designed not only to support display within the semantic retrieval tool, but also to serve as the foundation for a Unicode character converter currently under development.

The mapping table was compiled through meticulous cross-referencing with the charts found in Unicode Technical Note #41 [30] specifically the sections detailing the Glagolitic Supplement and Cyrillic Unicode blocks, as defined in Unicode Standard Version 16.0.[31] Furthermore, for diacritical marks, comparative analysis was carried out with the Unicode implementation for Ancient Greek, using the IFAOGrec Unicode[32] resource as a reference. The map includes 43 characters, organized according to the following information: 1) appearance on website; 2) PUA – PUA Unicode; 3) Cyrillic Char.; 4) Cyrillic Unicode. Among these, 16 are diacritical signs and 23 are characters. Of the latter, 2 belong to the Glagolitic alphabet block: 1) U+2C51 - Glagolitic Small Letter Yat; 2) U+2C49 - Glagolitic Small Letter Out. Both have been added in Unicode version 4.1 (2005). and belong to the Glagolitic block of the Basic Multilingual Plane. [33]

In conclusion, the dataset used for the development of the semantic analysis tool is compiled from the following repositories and sources:

Cyrillomethodiana (uni-sofia.bg);
Syntacticus (syntacticus.org);
Old Russian Hagiographic Literature (spbu.ru);
a subset of the National Corpus of the Russian Language (ruscorpora.ru), specifically the texts available through Universal Dependencies;
a sample from the Ruthenian Corpus (UD_Old_East_Slavic-Ruthenian);
the Cyrillic manuscript transcriptions from the 11th century and the transcriptions of manuscripts from the Kazanskaya Collection, available through the manuscript.ru project.

In total the documents are: 683. The semantic retrieval carried out by the tool is performed on these texts. Following the principle that the text to be retrieved should not be contained in the documents on which the tool operates, among these there are four documents that are known to contain witnesses of the Creed.

4. Methodology

This chapter describes the technical aspects underlying the functionalities of the retrieval search system. Here, we describe the methodology used, the details of the text parsing and preparation, and the models used and the impact of these choices on the final tool.

4.1 Data Preparation

Each document contains the following fields:

Title;
Latin title;
Original title;
Language - categorized as follows: -
- a. Old Church Slavonic (OCS): 9th–11th centuries;
- b. Church Slavonic (CS): 12th–17th centuries (including regional redactions: Bulgarian, East Slavic, Serbian);
- c. New Church Slavonic (NCS): 18th century;
- d. Ruthenian / Ruska mova (Rut): 15th–18th centuries;
Century (10th–18th);
Geographic area;
Historical and regional variant;
Source,
Notes,
Content, which contains the full text in Old Church Slavonic and Church Slavonic.

The documents span a range of centuries and originate from different regions within the Slavic Orthodox cultural sphere. The accompanying metadata supports advanced filtering and semantic analysis.

Among the most useful metadata fields there are 1) Century;, 2) Area;, and 3) Source, which help contextualize the texts both chronologically and geographically. However, the primary focus of semantic processing is the Content field. These texts vary significantly in length and complexity. The average document contains approximately 7133 words, with a minimum of 5 and a maximum of 280,403 words. This wide distribution required a robust chunking strategy to make the data suitable for embedding and semantic search. The texts originate from a variety of repositories and digital libraries. As indicated above, a total of 6 unique source domains are represented in the dataset: this diversity introduces variability in formatting and textual conventions, a problem that is addressed through the use of each charset based on the input text.

4.2 Data Processing

To prepare the Old Church Slavonic and Church Slavonic textual data for semantic embedding and search, we adopted a chunking strategy aimed at segmenting long texts into semantically meaningful units. Chunking is crucial not only for complying with the token limitations of transformer-based models but also for enhancing semantic retrieval accuracy and contextual coherence.

We used a rule-based sentence segmentation approach using spaCy2’s multilingual blank model (spacy.blank("xx")). Since there is no dedicated model for Old Church Slavonic in spaCy, we added a lightweight rule-based splitter component to split the text using a punctuation-based methodology. To ensure manageable and semantically coherent segments, we implemented a chunking routine with the following configuration:

Target chunk size: 75 words
Minimum size: 50 words
Maximum size: 100 words
Overlap strategy:
- For chunks within size bounds, we attached 2 sentences before and after the chunk to provide contextual continuity.
- For oversized chunks, we used a sliding window to split the chunk into smaller sub- chunks, each followed by left/right context built from adjacent tokens.

Each chunk record includes:

The chunked text
The context_before (up to 2 preceding sentences)
The context_after (up to 2 following sentences)
All original metadata fields from the document source

This approach provides flexibility and ensures that both short and long passages retain enough semantic context to be effectively indexed and retrieved during search.

4.3 Model Implementation

To enable semantic retrieval, each chunk of Old Church Slavonic and Church Slavonic texts was converted into a high-dimensional numerical representation known as an embedding. These embeddings allow us to compare textual passages based not on exact keyword overlap but on semantic similarity, which is crucial for working with a historic and morphologically rich language like (Old) Church Slavonic.

Embeddings are central to our semantic pipeline because they allow us to:

Measure conceptual similarity between phrases, even if they differ in vocabulary.
Support natural language queries that retrieve thematically related content.
Enable fast vector-based search via cosine similarity in vector databases like Qdrant.

Initially, we tested bert-base-multilingual-uncased, a general-purpose multilingual model. However, during the evaluation phase, the results yielded low similarity scores and imprecise semantic matches. This indicated that the model lacked the representational nuance needed for the historical and liturgical nature of our corpus. As a result, we transitioned to jinaai/jina-embeddings-v3 (Sturua et al., 2024), one of the most downloaded and performant multilingual models currently available for text embedding. It is designed for high-quality retrieval use cases and better aligned with our requirement to embed non-standard historical scripts.

The cleaned and chunked dataset was processed in batches of 16 to optimize GPU memory usage. For each batch:

Text chunks were tokenized and fed into the Jina model.
The embedding vector was extracted from the model’s [CLS] token (i.e., outputs.last_hidden_state[:, 0, :]), cast to float32 to avoid type issues.
A custom text_to_display field was constructed using the pattern:
- Context_before <mark> Chunk </mark> Context_after
This format enhances readability and context comprehension in the frontend interface by highlighting the core chunk while preserving its semantic surroundings.
A payload dictionary was built for each chunk, containing all metadata from the original dataset and the newly created text_to_display field
Each entry was wrapped in a Qdrant PointStruct, assigned a unique UUID, and uploaded to the local Qdrant vector store.

In total, every chunk from the corpus was embedded, wrapped with metadata and context, and indexed in Qdrant. This embedding infrastructure provides the semantic backbone of the search engine, enabling linguists and scholars to explore Old Church Slavonic and Church Slavonic documents with precision and contextual awareness, even when queries do not directly match the original wording.

4.4 User Interface

Figure 1.

User Interface showing the results for a given query.

The User Interface provides the following key features:

Query panel
- Text area to input the search query
- Select boxes to choose the corpus and the corresponding model
- Similarity Score and Top-K Results inputs to let users define the strictness of the results. And enable filtering based on semantic proximity
- Search button to initiate the request
Filter Panel
- Dynamically populated autocomplete filters based on metadata returned from the backend .
- Filter and Reset buttons allow users to refine or reset their search easily.
Results Display Panel
- Results are shown as a list of box items, paginated in groups of 5.
- Each item displays the highlighted chunk of text using <mark> tags to emphasize the matched portion.
- An Accordion expands to reveal metadata for each result.

To ensure accurate visual rendering of Old Church Slavonic texts, the frontend applies specific fonts dynamically based on the domain listed in the Source metadata field. Since the documents were aggregated from various online repositories each relying on its own typographic conventions to represent extended Cyrillic and historical glyphs this approach preserves the visual fidelity of the original sources. The platform supports five custom fonts, each associated with a corresponding domain:

Custom Fonts	Platform Interface Label
github-ocs font for texts from syntacticus.org; UniversalDependencies/UD_Old_East_Slavic-RNC; UniversalDependencies/UD_Old_East_Slavic-Ruthenian	Syntactivus_
cb10u font for texts from histdict.uni-sofia.bg	Cyrillica_Bulgarian_Conv
menaion font for texts from manuscripts.ru	Menaion
AGIO font for texts from project.phil.spbu.ru	Agio

Table 1.

These fonts have been integrated into the platform : their identification and implementation in the tool were essential requirements to enable the correct visualization of results. In some cases, as already mentioned, the integration of Unicode fonts must be accompanied by the PUA - Unicode mapping, since only this combination ensures correct rendering also on the frontend. This approach has already been applied to the texts of the Cyrillomethodiana web portal, while it is still under development for other texts whose character maps are currently being defined. In its final version, As shown in the table (Platform Interface Label), tthe tool will include a specific filter allowing users to select the combination of Unicode font + PUA map - Unicode. [34] The font filter has been designed as a field that can be dynamically configured by incorporating fonts available online in open access, as well as additional mapping tables between Private Use Area (PUA) characters and Unicode characters created manually. Members of the project’s editorial board can access the dashboard and add new fonts by specifying the label designation, the source domain, and the file in one of the supported formats (.woff, .woff2, .ttf). Once saved, the font becomes selectable from the font list available prior to the text input field used for searchable queries.

This domain-aware font assignment ensures that special characters unique to Old Church Slavonic are correctly displayed in the user interface, enhancing readability and historical authenticity.

Figure 2.

List of fonts currently installed on the platform

5. Analysis of Results

[Feliksov 2018] has noted that the language of Orthodox doctrine, although well consolidated in liturgical practice and written tradition, lacks a structured grammar or a comprehensive descriptive model capable of accounting for the syntactic, semantic, and morphological peculiarities of its core terms. From his perspective, this absence leads to:

difficulties in the linguistic analysis of vocabulary: as Orthodox vocabulary contains semantically dense units deeply rooted in theological meaning, many of which do not easily conform to standard grammatical categories. The main challenges involve:
- the polysemy of key terms (e.g., благодать –blagodat- grace),
- their strong interdependence on liturgical context,
- the lack of direct correspondences in modern and everyday language.
difficulties in grammatical and semantic encoding: the current linguistic description of Orthodox lexicon is fragmented and often limited to encyclopedic dictionaries or glossaries, which typically do not offer coherent morphological or syntactic analysis. This creates obstacles for structured teaching and for computational formalization (e.g., annotation, parsing, semantic processing) of this type of vocabulary.

To address these issues, a semantic approach is proposed, that aims to describe the complexity and richness of the vocabulary of Orthodox faith in Church Slavonic, by incorporating:

analysis of syntactic roles within the sentence,
discourse function in liturgical usage,
semantic and theological relations among terms.

The linguistic methodology underpinning this study is based on the ontological theory of language, grounded in the ideas of Orthodox doctrine on being and the word, as developed by [Florenskii 1990]; [Losev 1993]; [Kamchatnov 1998]; [Bulgakov 2007]; [Aksakov 2011] and [Postovalova 2022]. Central to this theory is the concept of the “eidosphere”, understood as an instrument to designate the set of lexical and phraseological units, as well as non-predicative (nominative) word combinations "generated" by a given eidos through the act of naming. The study adopts the term “lexico-eidetic group” (LEG), as introduced by Kamchatnov. This notion is rooted in the Christian intuition that God “creates simultaneously both the thing and its idea, in their indivisibility and inseparability” [Kamchatnov and Nikolina 2008, 37].

The ontological approach to language fundamentally relies on the deductive method of semantic, accordingly, the first step toward constructing a semantic classification of Orthodox faith vocabulary should consist in a systematic description of the “world” of eidē that constitute the content of the eidosphere of “Orthodox Faith.”

Our work on the study of the semantic value of sentence and word is based on a different methodological premise, although it aims to capture the polysemy of specific terms and to investigate the diachronic evolution of Orthodox vocabulary.

First and foremost, our research aims to analyze individual sentences starting from the context in which they appear, thus proceeding from semantic retrieval at the phrasal level. Additionally, the research is conducted with reference to a specific text, namely the Nicene-Constantinopolitan Creed, a brief doctrinal and liturgical formulation that conveys the core principles of the Christian faith.

Gezen, as previously mentioned, in his book Istoriia slavianskogo perevoda Simvola very Istoria slavianskova perevoda Simvol veri (published in 1884; second edition: 2015), notes that the Slavic translation of the Creed predates the missionary work of Saints Cyril and Methodius, although its final redaction could only occur after the invention of the Slavic alphabet. As a text regularly recited during the liturgy, the Creed has exerted a profound influence on the target culture. Therefore, due to its catechetical role, the semantic implications of its vocabulary, particularly the theological significance of linguistic and terminological variants, are of special relevance.

This research, consequently, is motivated by the intention to trace the influence of the Greek liturgical tradition on the Slavonic versions of the Creed, and to study their reception across different regions and historical stages in the development and use of the language.

The research sets out to examine two key questions:

the ability of the tool to retrieve phrases that are semantically related to selected articles from the Nicene-Constantinopolitan Creed;
the tool’s ability to provide useful references for analyzing the textual revisions introduced in the Creed during the liturgical reform of Patriarch Nikon (1653-1656).

These revisions, which aimed to restore consistency between the Russian Orthodox liturgical tradition and the Byzantine tradition, involved not only orthographic corrections, but also terminological changes whose historical and theological significance has been widely discussed in the scholarly literature.

For this reason, the presentation of results will proceed in two parts:

The first section will be based on an in-depth analysis of two selected articles from the Creed and will address both objectives.
The second section will examine the results of user tests conducted independently by six scholars and professors with expertise in the field of Slavic studies and digital humanities.

These tests will focus on:

the functionality and usefulness of the tool
the accuracy of its outputs in relation to the self-selected article and version of the Nicene-Constantinopolitan Creed.

5.1 Analysis of Results - Section One

A particularly significant example, from this perspective, concerns the verbal tense used for the verb “to be.” In the seventh article of the Creed, which refers to the Kingdom of God, the original Slavonic formulation нѣсть конца несть конца (nest’ kontsa – “there is no end”) was revised during Patriarch Nikon’s liturgical reform to не будет конец (ne budet konez – “there will be no end”).

This modification implied a theological shift: the Kingdom of Christ was no longer understood as an eternal reality already inaugurated through His incarnation, crucifixion, and resurrection, but rather as a reality postponed to the future. The pre-reform version of this article corresponds to the formulation found in the earliest Slavic recension of the CreedSymbol of Faith #gezen1884.

A comparison of the results returned by the tool for the queries “нѣсть кѡнца несть конца” and “не бꙋдетъ не будет конец” reveals important insights into how the linguistic retrieval system operates, particularly its ability to function on both formal and semantic levels.

In the case of the first query, нѣсть кѡнца несть конца (Table 1), the tool does not detect any literal occurrence of this phrase in the examined texts. However, it correctly retrieves a semantically equivalent expression - не будет кѡнца конца - attested in the Codex Laurentianus (12th century, Kyiv Pechersk Lavra, Old Church Slavonic).

The fact that the tool does not retrieve the expression is rather unexpected, given that the dataset is known to contain at least four versions of the Creed in its pre-Nikonian form that include the expression “нⷭ҇ѣ кѡнца̀”. It is not possible to determine with certainty whether these are the only instances present in the dataset; however, even in this case, the absence of any reference to these texts among the results must be regarded as a limitation of the tool. This issue clearly underscores a weakness of the system, confirming the findings that also emerged from user feedbacks. Several factors may explain the observed limitation. First, it is important to emphasize that the goal of the tool is to retrieve semantically related sentences, rather than sentences that merely correspond at the linguistic level. This objective has significantly influenced the preparation of Old Church Slavonic texts for embedding and semantic retrieval, leading to the adoption of a chunking strategy aimed at segmenting longer textual passages into semantically meaningful units. Such an approach was necessary both to comply with the token limitations of transformer models and to ensure semantic accuracy, as well as coherence with the broader textual context.

However, although there is a grammatical difference between the two phrases (present vs. future tense), both convey the same eschatological meaning: the eternity of Christ’s Kingdom. The fact that the tool retrieves a morphosyntactic variant that does not exactly match the query demonstrates its ability to go beyond literal pattern matching, incorporating a semantic component capable of identifying differently formulated yet semantically equivalent expressions.

This pattern is also confirmed in the second query, не бꙋдетъ кѡнец, (Table 2), where in addition to retrieving the exact textual match (again in the Codex Laurentianus), the system also identifies other doctrinally relevant passages, even in the absence of the explicit phrase - such as those found in the Poouchenie za spasenieto na dushata (14th century, Russia) and the Dialogi (15th century). In both cases, the retrieved texts address similar eschatological themes, the eternal Kingdom, judgment, and salvation, although these are formulated paraphrastically, e.g. “царство, уготованное от созданїѧ мирацарство, уготованное от создания мира” (“the Kingdom prepared from the foundation of the world”) or “црⷭ҇твовати вѣчно и бесм҃ртно (“to reign eternally and immortally”).

As a result, it is evident that the tool can retrieve sentences that are semantically related to the initial query, based on contextual similarity. In some cases, however, the tool is also capable of detecting literal similarity. These capabilities are further supported by the qualitative assessment made by specialists in the humanities.

Such behavior proves particularly valuable in the analysis of religious or patristic texts, where doctrinal content is often conveyed through variable formulas that remain thematically recognizable. The tool demonstrates suitability for scholars seeking semantically related results. Nevertheless, literal matches are preserved as well, since the tool does not impose any antinomy between them. This feature allows for its implementation in philological, theological, and diachronic lexicographic research.

The replacement of the formula нѣсть кѡнца несть конца with не бꙋдетъ кѡнца in the seventh article of the Symbol of FaithCreed, implemented during Patriarch Nikon’s liturgical reform, does not constitute a mere morphosyntactic variation, but rather a profound doctrinal shift: a transition from an understanding of the Kingdom as already present and active in history to a future and deferred conception. In this context, the use of automated semantic query tools, such as the one under examination, offers a novel and concrete perspective for verifying how and where these two formulations are actually documented in Slavonic texts, and which of the two is attested across different historical periods and ecclesiastical settings.

The tool’s main strength, in this regard, lies in its ability to recognize and connect semantically related expressions, even when they exhibit morphological or syntactic variation. Starting from the query нѣсть несть кѡонца, the pre-Nikonian form, the system retrieves не бꙋдетъ кѡнца, attested in an authoritative text such as the Codex Laurentianus (12th century), thereby revealing that the future-tense formulation was already in use well before the reform, albeit not yet canonized liturgically. This information, which emerged precisely through automated retrieval, challenges the narrative of an absolute rupture introduced by Nikon, confirming instead a process of selection and formalization of already existing variants.

In the eighth article of the Creed, the removal of the adjective истиннагѡ истиннаго (istinnago - “true”) referring to the Holy Spirit is regarded by some scholars as the only truly significant revision introduced by Patriarch Nikon. [Shakhov 1997], in his study analyzing the Old Believers’ stance on the revisions enacted through Nikon’s liturgical reform, emphasizes that the Old Russian translation, which included the adjective “true” as a descriptor of the Holy Spirit, was intended to convey the theological assertion that the Spirit shared equal status with the other persons of the Trinity. Nikon’s decision to omit the adjective thus undermined this assertion, subordinating it to the objective of achieving grammatical and stylistic conformity with the Greek version of the Creed.

A comparison of the tool’s output in response to queries based on the formula “И въ Дꙋха Свѧтаго, Господа животворѧщаго, иже от Отца исходѧщаго И в Духа Святаго Господа животворящаго иже от Отца исходящаго - I v Dukha Sviatago, Gospoda zhivotvoriashchago, izhe ot Ottsa iskhodiashchago - And in the Holy Spirit, the Lord, the Giver of life, who proceeds from the Father” (Table 3) yields relevant data regarding the dissemination, reformulation, and lexical stability of this article of the Symbol of FaithCreed within the Slavic Orthodox tradition. In none of the analyzed texts, dating from the 12th to the 15th century, does the formula appear in its complete canonical form. Nonetheless, the tool makes it possible to trace the gradual appearance of its constituent elements across various liturgical, homiletic, and theological contexts.

In Table 3, for instance, the word Господа (“Lord”) is consistently attested with reference to Christ, but not to the Holy Spirit, while the term истиннагѡ истиннаго (“true”) appears only once, and likewise refers to the Son. The component животворѧщаго животворящаго (“life-giving”), which is central to the pneumatological articulation of the Creed, is entirely absent, except perhaps in implicit or descriptive form.

A similar pattern emerges in Table 4, where terminology related to the Holy Spirit appears more vaguely and fragmentarily: for example, in the Uchitelno evangelie Учително евангелие (12th century), the Holy Spirit is referred to as д͠хъ с͠тыи ( dukh sviatyi “holy spirit”), and His withdrawal is described as marking the end of prophecy. However, neither the title “Lord” nor the epithet “life-giving” is used. Only in the later liturgical text Oustav bozh(ʹ)estvennyia sluzhby sviatago apostola Iakova, which appear as result in both searches, does a formulation occur that is nearly complete and semantically equivalent to that of the Nicene-Constantinopolitan Creed: here, the Spirit is referred to as животворѧщимъ… дꙋхомъ (“life-giving... Spirit”) and is explicitly associated with the Father and the Son in a solemn liturgical context that aligns with Orthodox Trinitarian doctrine.

From a methodological standpoint, the qualitative analysis conducted by a humanities specialist indicates that the tool employed demonstrates a solid capacity to recognize semantically equivalent or related structures, even when the queried formulation is not attested in a literal form. Its accuracy is particularly evident in the identification of the 17th -century liturgical source, but also in its ability to retrieve elements dispersed across time and space. This provides a valuable foundation for investigating the historical development of the requested formula, as well as the ways in which it was received and adapted in textual and liturgical practice.

The research shows that the adjective истиннагѡ истиннаго (“true”) is rarely attested in the examined texts and, when it does appear, it is applied to the Son or to God in a general sense, not to the Holy Spirit. The tool accurately highlights this absence, demonstrating that the complete formulation with истиннагѡ истиннаго referring to the истиннагѡ referring to the дꙋхъ (Spirit) is neither widespread nor stable within the pre-reform tradition. Consequently, Nikon’s revision can be interpreted not only from a theological perspective but also as a redactional normalization of an expression that had little support in the earlier liturgical corpus.

At the same time, the tool demonstrates the ability to recognize partial, semantically or liturgically related formulations, identifying fragments that testify to the progressive doctrinal elaboration of the formula. In particular, the retrieval of the sentence “и животворѧщимъ твоимъ Дꙋхомъ и животворящим твоим Духом - i zhivotvoriashchim tvoim Dukhom - and with your life-giving Spirit” only in late liturgical texts (17th century) may help in the definition of a concrete mapping of the moment when the formula became stabilized in its new-canonical version. On this basis, a more comprehensive study could be undertaken in the future, extending the scope to all articles of the Creed, as well as to other texts, in order to analyze their relationship to the broader historical context that may have influenced both their translation and the revisions introduced in the 17th seventeenth century. Such research would naturally align with the previously mentioned studies devoted to the historical and theological significance of Patriarch Nikon’s reform.

Query	Results	Exact text: “несть конца”	Equivalent Expression	Semantic Relevance
и паки грядущаго со славою судити живымъ и мертвымъ, Егоже Царствию несть конца.
	84 и ѡтсуду иꙁбавлѣнью сподобимъсѧ . и тамо вѣчнꙑꙗ̇ жиꙁни наслѣдници будемъ небоно дх҃овноѥ̇ подвиꙁаньє или на супротивнаго побѣдѣ страⷭ҇ю̇ . и смр҃тью бꙑваѥть стражющеи и оумерщьвѧюще оудꙑ телѣснꙑꙗ хⷭ҇а ради ѥдинодш҃ьно побѣдимъ супротивьнаго . тѣмьже всѧкꙑ скерби и напасти . и въстаньѥ вселукаваго . нѣ въ болѣꙁнехъ имѣти . но въ доброврѣмиѥ трепѣниꙗ желающе. 85 Title: Сборник поучений Ефрема Сирина Сборник поучений Ефрема Сирина ("Паренесис Ефрема Сирина") 86 Century: 13th; 87 Source: http://manuscripts.ru/mns/main?p_text=86892772 Language: Church Slavonic	✘ No	✘ No	Talks about resurrection, suffering, and judgment: "и тамо вѣчнꙑꙗ̇ жиꙁни наслѣдници будемъ""and there we shall become heirs of eternal life"
	88 тогда жалостью въскричитьвсѧка дш҃а и всѧкъ чл҃вкъ годъѹпустившє и ѡⷮвѣщаѥть имъправєдныи судии ижє ны многотєрпѣ идѣтє проклѧтии ѡⷮ мєнєвъ огнь вѣчныи пакы жє къст҃мъ грѧдѣтє блгⷭ̄нии о҃ца моѥгопридѣтє въ ѹготовано вамъцрⷭ̄тво ѡⷮ съставлєниꙗ мира всєго.ѥмужє слава и вєлєлѣпьѥ чєстьжє и миръ и блгнⷭ̄иѥ ѡⷮ всєꙗтвари съ прєст҃ымь оц҃мьи животворѧщимь дх҃омь нынѧи приⷭ̄. 89 Original_title: Поучение за спасението на душата; 90 Century: 14th; 91 Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_53 92 Language: Church Slavonic	✘ No	✘ No	Refers to the kingdom “prepared since the foundation of the world”, but no mention of unending reign."придѣтє въ ѹготовано вамъ црⷭ̄тво ѡⷮ съставлєниꙗ мира всєго""Enter into the kingdom prepared for you from the foundation of the world"
	93 не будет конца сица же будть мч҃нья иже не вѣруеть къ б҃у нашему іс҃у хсу мч҃ми будут в огни иже сѧ не кртсить и се рекъ показа володимеру запону на неиже бѣ написно судище гсне показываше ему ѡ десну прв҃дныя в весельи предъидуща в раи а ѡ шююю грѣшники идуща в муку володимеръ же вздохнувъ реч добро симъ ѡ десную горе же симъ ѡ шююю ѡнъ же реч аще хощеши ѡ десную съ првд҃нми стат 94 Title: The Primary Chronicle, Codex Laurentianus 95 Century: 12th; Source: https://github.com/torottreebank/treebank-releases/blob/master/lav.conll 96 Language: Old Church Slavonic	✘ No	✔ "не будет конца"	Direct future-tense equivalent of “несть конца” — “there will be no end”. Fully aligns with the phrase.

Table 2.

Text of the Nicene – Costantinopolitan Creed is from Sledovannaya Psatir'- 1653 (For further information on the transcription, see [Napolitano 2025]).

Query	Results	Exact text "не будет конец"	Equivalent Expression	Semantic Relevance
И паки грядущаго со славою судити живым и мертвым, егоже царствию не будет конец.
	97 тогда жалостью въскричитьвсѧка дш҃а и всѧкъ чл҃вкъ годъѹпустившє и ѡⷮвѣщаѥть имъправєдныи судии ижє ны многотєрпѣ идѣтє проклѧтии ѡⷮ мєнєвъ огнь вѣчныи пакы жє къст҃мъ грѧдѣтє блгⷭ̄нии о҃ца моѥгопридѣтє въ ѹготовано вамъцрⷭ̄тво ѡⷮ съставлєниꙗ мира всєго.ѥмужє слава и вєлєлѣпьѥ чєстьжє и миръ и блгнⷭ̄иѥ ѡⷮ всєꙗтвари съ прєст҃ымь оц҃мьи животворѧщимь дх҃омь нынѧи приⷭ̄. 98 Title: Поучение за спасението на душата; 99 Century: 14th; 100 Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_53; 101 Language: Church Slavonic;	✖ No	✔ Reference to the return of Christ and to the Kingdom prepared "от создания мира"“from the foundation of the world”	Semantically related (Second Coming, Judgment, Kingdom)
	102 не будет конца сица же будть мч҃нья иже не вѣруеть къ б҃у нашему іс҃у хсу мч҃ми будут в огни иже сѧ не кртсить и се рекъ показа володимеру запону на неиже бѣ написно судище гсне показываше ему ѡ десну прв҃дныя в весельи предъидуща в раи а ѡ шююю грѣшники идуща в муку володимеръ же вздохнувъ реч добро симъ ѡ десную горе же симъ ѡ шююю ѡнъ же реч аще хощеши ѡ десную съ првд҃нми статTitle: The Primary Chronicle - Codex Laurentianus; Century: 12th; Source: https://github.com/torottreebank/treebank-releases/blob/master/lav.conll Language: Old Church Slavonic;	✔ не будет конец		Partially related (Judgment, salvation/damnation)
	103 по настающиⷨ вѣцѣ. нетлѣннїи и бесм҃ртнии. въстающе прⷭ҇но пребываеⷨ. оле саⷨ тъ ѹбо всѣⷨ. исконныи хытрець и строитель. къ ѡ҃цѹ пакы дошеⷣ не пребѹдет ли. какоⷤ҇ и црⷭ҇твовати вѣчно и бесм҃ртно съ ниⷨ чл҃вкоⷨ ѡбѣщаваеⷮ. а саⷨ того лишаеⷨ. и кончаниеⷨ црⷭ҇тво ѿлагаѧ по сѹемысльныиⷨ. но ѡстани сѧ беꙁѹмиѧ того молю тѧ. обличиⷮ бо ихъ въ древнїиⷯ дв҃дъ ѡ х҃ѣ поа ц҃рьствѣ. исповѣдѧть ти сѧ гⷭ҇и всѧ дѣла твоа. 104 Title: Диалози; 105 Century: 15th; 106 Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_32; Language: Church Slavonic;	✖ No	✔"црⷭ҇твовати вѣчно и бесм҃ртно"	Semantically related: eternity of the Kingdom

Table 3.

ext of the Nicene – Costantinopolitan Creed is from Acts of the 1654, 1655, 1656 Councils (For further information on the transcription, see [Napolitano 2025])

Query	Results	Exact word “Господа”	Exact word “истиннаго”	Exact word “животворящаго” / semantico equivalente	Equivalent expression	Semantic Relevance
И въ Духа Святаго Господа истиннаго и животворящаго, иже от Отца изходящаго
	107 тѹне и дадите ѡ сихъ бо и самъ г҃ь рече вѣрѹꙗи въ мꙗ дѣла ꙗже азъ творю и тъ сътворитъ и больша тѣхъ нъ о блаженаꙗ страстотьрпьца хр҄сва не забꙑваита отьчьства идеже пожила ѥста въ тели ѥгоже всегда посѣтъмь не оставлꙗета тако же и въ мл҃твахъ вьсегда молита сꙗ о насъ да не придеть на нꙑ зъло и рана да не пристѹть къ телеси твоѥмѹ и рабъ ваю вама бо дана бысть бл҃годать да молита 108 Title: Uspenskij sbornik; 109 Century: 13th; 110 Language: Church Slavonic; 111 Source: https://github.com/torottreebank/treebank-releases/blob/master/usp-sbor.conll	✔ “г҃ь” (“Господь”)	✘ No	✘ No		Reference to Christ and to prayer, but no specific mention of the Holy Spirit or the attributes “true” (истиннаго) and “life-giving” (Животворящаго)
	112 и оутвръждаѧй въсѧ вѣроуѫщѧѧ къразоумоу неблазньномоу и исповѣданїꙋ опасномоу и слоужбѣблагочьстивѣй и покланѣнїꙋ доуховномоу и истинномоубога и ѡтца и єдинороднааго єго сына, господа и боганашего Ісоуса Христа, и своемоу коемоуждо именю именꙋемомоу187свойство ꙗвѣ намь благорасѫждаѫщоу и о коемждоѡт именоуемыих въсѣко нѣкымь изрѧднымь свойствомьблагочьстивѣ зримо, ѡтцꙋ оубо въ своиствѣ ѡтчьстѣмь,сыноу же въ свойствѣ сыновнѣмь, свѧтомомоу же доухоувъ своемь свойствѣ, ни свѧтомоу доухоу ѡт себе глаголѧщоу,ниже сыноу ѡт себе что творѧщоу и ѡтцꙋ оубопосилаѫщоу сына, сыноу жеOriginal_title: Похвално слово за Йоан, епископ Поливотски; 113 Century: 15th; 114 Language: Church Slavonic; 115 Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_240	✔ “господа… Ісоуса Христа”	✔ “истинномоу”	✘ No		It refers to Christ, not to the Holy Spirit. ‘Истинному’ is used in reference to God and the Son, not to the “Духа” (Spirit).”
	116 и тѣлеса,ꙗкѡ да достойни бꙋдемь причастницы и ѡбещницы пречистымьтвоимь тайнамь во ѡставленїе грѣхѡв и в жизньвѣчнꙋю –Возгласъ: Благодатїю и щедротами и человѣколюбїемединороднагѡ сына твоегѡ, с нимже благословеньеси с пресвѧтымъ и благимъ и животворѧщимъ твоимъдоухомь нынѣ и приснѡ.Молитва: Свѧтый, иже во свѧтыхъ почиваѧй,господи боже нашъ, ѡсвѧти нась словомь твоеѧ благодатии наитїемъ пресвѧтагѡ доуха, самь бо рекль еси: Свѧтибꙋдите, ꙗкѡ азъ свѧть есмь, господь богъ вашь. 117 Title: Оуставъ бож(ь)ственныѧ слꙋжбы свѧтаго апостола Іакѡва, брата господнѧ; 118 Century: 17th; 119 Language: Church Slavonic; 120 Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_210	✔ “господи боже нашъ” / “господь богъ”	✔ No	✘ No	121 “животворѧщимъ твоимъ доухомь”	Direct correspondence: this passage explicitly refers to the Holy Spirit as “life-giving”, in a liturgical context consistent with the Creed.“с пресвѧтымъ и благимъ и животворѧщимъ твоимъ доухомь”

Table 4.

Text of the Nicene – Costantinopolitan Creed is from Sledovannaya Psatir'- 1653

Query	122 Results	Exact word “Господа”	Exact word “животворящаго” / equivalente	Reference to “Holy Spirit”	Semantic relevance
И в Духа Святаго Господа животворящаго иже от Отца исходящаго, иже со Отцем и Сыном спокланяема и сославима, глаголавшаго пророки.
	123 ꙗко бѣ благодать ѥго отѧта. сего ради рече. не оубо бѣ д͠хъс͠тыи данъ. небѣ бо оуже пророкъ въ нихъ. ниприꙁираше благодать с͠та ихь.и понѥже оуже оудрьжалъсѧ бѣ д͠хъ с͠тыи.49cхотѣаше бо богатьно благодатьиꙁлити. сего жераꙁдааниꙗ начатъкъ. по пропѧтии бысть.сего ради рече ꙗко ї͠с нѣ оу бѣ прославленъ. славоуже пропѧтьꙗ е наричеть. 124 Title: Учително евангелие 125 Century: 12th; Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_154; Language: Church Slavonic	✘ No	✔ “д͠хъ с͠тыи”	✔ Semantically related – the passage explicitly refers to the Holy Spirit, focusing on the theme of divine grace	126 ✘ Partial – correct context, but without the required terms
	127 и̇просвьтет сѧ ꙗ̇ко-и̇ сл͠нце. а̇. 2грѣшници бѫⷣть тьмни, и̇ пакӹрѣхь г͠и. вси ли крьщенӹи̇ вь е̇динѫ мѫкѫ грѫⷣть. цр͠ие̇ и̇ патриа̇рс̋и, и̇ кнѧꙁӹ и̇ богатӹи̇ и̇ ꙋбоꙁӹи̇. слӹши праведнӹ їѡⷡ҇анє.ꙗ̇коже реⷱ҇ пророкь дав͠дь. трьпѣниє̇ оу̇б͠гӹхь не погӹбнетьдо конц̋а. а̇ патриа̇рс̋и и̇ кнѧꙃїи̇ ꙋбоꙁӹ и̇ богатїи̇. погӹбнѫтьꙗ̇ко-и̇ скоть, и̇ вьсплачѫть сѧ.ꙗ̇ко-и̇ младенц̋и. 128 Title: Berlinski sbornik 129 Original_title: Берлински сборник 130 Century: 14th; Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_173; Language: Church Slavonic	✘ No	✘ No	✘ No	✘ Partially related – the eschatological context is compatible, but there is no explicit reference to the Holy Spirit or to the Trinitarian elements of the Creed.
	131 и тѣлеса,ꙗкѡ да достойни бꙋдемь причастницы и ѡбещницы пречистымьтвоимь тайнамь во ѡставленїе грѣхѡв и в жизньвѣчнꙋю –Возгласъ: Благодатїю и щедротами и человѣколюбїемединороднагѡ сына твоегѡ, с нимже благословеньеси с пресвѧтымъ и благимъ и животворѧщимъ твоимъ доухомь нынѣ и приснѡ. Молитва: Свѧтый, иже во свѧтыхъ почиваѧй,господи боже нашъ, ѡсвѧти нась словомь твоеѧ благодатии наитїемъ пресвѧтагѡ доуха, самь бо рекль еси: Свѧтибꙋдите, ꙗкѡ азъ свѧть есмь, господь богъ вашь. 132 Title: Оуставъ бож(ь)ственныѧ слꙋжбы свѧтаго апостола Іакѡва, брата господнѧ; Century: 17th; Source: https://histdict.uni-sofia.bg/textcorpus/show/doc_210; Language: Church Slavonic;	✔ “господи боже нашъ”	✔ “животворѧщимъ…доухомь”	✔	✔

Table 5.

Text of the Nicene – Costantinopolitan Creed is from Acts of the 1654, 1655, 1656 Councils

5.2 Analysis of Results – Section Two

133

The analysis of responses to the structured test shows that the tool developed for semantic retrieval of sentences in (Old) Church Slavonic was evaluated positively in terms of its core functionality and its usefulness for specialized research. Nonetheless, it still shows room for improvement, particularly in the conceptual coherence of its results. Participants conducted their queries based on various versions of the Nicene-Constantinopolitan Creed, including the Sledovannaya Psaltir’ of 1653, an early manuscript witness (Zogh. 105, f. 141r), the modern version taken from Wikipedia, the PSGP.ru website, as well as a reformulation recalled from memory.

134

When asked to evaluate the semantic relevance of the top 10 retrieved results, the majority of participants gave positive ratings: three users reported 7 out of 10, one reported 6 out of 10, while two others gave lower scores (3 out of 10 and 2 out of 10). In at least three cases, users acknowledged the system’s ability to identify semantically related, though not always central, results. The comments and feedback provided by the participants are summarized in the table below, along with an analysis based on the tool’s characteristics and stage of development.

135

The comments collected through the questionnaires (see the table below for details) indicate that the proposed functionalities are regarded as innovative, despite the limitations highlighted in this feedback and discussed throughout the article. From the perspective of advancing research in NLP and (Old) Church Slavonic, the results can be considered positive based on the collected opinions. Future work is expected to address the identified issues, both in terms of the tool’s functionality and the application of NLP techniques to Old Church Slavonic texts. Overall, the structured testing confirms that the tool is already capable of providing semantically coherent results, useful for philological and theological inquiry, even in the presence of complex linguistic variants. Its strength lies in the combination of broad retrieval coverage and conceptual recognition, while the observed limitations offer pathways for further development rather than indicating structural flaws. The system thus presents itself as a promising platform that can significantly contribute to research on premodern Slavic texts, provided it is supported by a clearly structured user guide tailored to the needs of specialist users.

136

The issue raised in the test by the response“The tool gave plenty of answers but some of them seemed not related to the original quotation (И въ Д ꙋ ха Свѧтаго, Господа истиннагѡ и животворѧщаго, иже от Отца исходѧщаго. Иже со Отцемъ и съ Сыномъ спокланѧема и сславима, глаголавшаго пророки.И въ Духа Святаго Господа истиннаго и животворящаго, иже от Отца изходящаго. иже со Отцемъ и съ Сыномъ съ покланяема и съ славима, глаголавшаго пророки.)” is of our particular interest, as the evaluations were conducted independently and at different times by the author of the this paper and by external academic participants. This convergence reinforces the necessity, already highlighted above, of complementing the tool with a methodological vademecum to guide its effective use. In this case, it remains unclear what similarity score was applied by the user during testing, and a notable methodological discrepancy emerges: whereas participants were instructed to assess semantic relevance across the top ten retrieved passages, the author’s own analysis was limited to the first three results. This methodological divergence was intentional, aimed at generating differentiated insights that reflect multiple and autonomous modes of interaction with the tool.

Comments and Feedback	Analysis
137 “The tool gave plenty of answers but some of them were too far from the intended sense of the query” 138 “In general, the model suggests close, semantically related contexts” 139 “They were not errors per se, but some inconsistencies in the connection between query and retrieved text”	All of these responses, related to the results obtained from the initial query, highlight a limitation that is still inherent in the tool and will need to be addressed in future work. Beyond the need to train and fine-tune a model specifically for Old Church Slavonic, what appears to be the main reason why the results are not always semantically consistent with the initial query—or only partially so—is the amount of data on which the model is trained. LLM typically rely on substantially larger corpora; therefore, expanding and improving the dataset will be a necessary prerequisite for achieving greater accuracy in the results.
“When searching the text from manuscript corpora, the query failed entirely”	The issue identified in this case—namely, that when searching texts from manuscript corpora the query failed entirely—highlights a limitation that is related to, yet distinct from, those discussed previously. While data scarcity is certainly a relevant factor, this problem also points to challenges associated with OCR and HTR processes, as noted in earlier sections. In addition, it raises the issue of language standardization during OCR processing, which does not allow for the preservation of linguistic features that vary according to historical period and geographical origin. Future work will aim to develop a dataset comprising texts derived from OCR and HTR processes applied to manuscript sources.
“The blocks of text are too vague to make any fine-tuned assessment”	This issue is identified in relation to the ability to detect errors or inconsistencies in the results suggested by the model. To address this problem, efforts have been made to improve the clarity of the interface, thereby making it easier for users to access a larger portion of text through the “more details” label. In this context, the sentences identified by the model as semantically related are highlighted in yellow, and users can gain a clearer understanding of the surrounding context as well as access the full text by following the corresponding link provided within the “more details” window.
140 “Very well” 141 Not very well	The issues previously discussed are clearly reflected in these two assessments, both concerning the model’s ability to capture the contextual meaning of sentences in (Old) Church Slavonic. This interpretation is further corroborated by the following response, which, like the other two, refers to the aforementioned question: “I honestly think that there's a long way to go to add corpora (of course this takes a lot of work and time), but the premises are good.”
142 “This model, to my knowledge, is the first to find semantically related passages in OCS” 143 “I think it can do it the future, with further work. Then, it would definitely do it. It would be super useful for research” 144 “I am not sure, because I have no prior experience with this type of tool”	145 The proposed functionalities are considered innovative despite the limitations highlighted both in the previous comments and throughout this paper. These findings indicate that the perception of innovativeness is particularly strong among users familiar with Slavic studies and computational tools, whereas it remains less defined in the absence of comparative experience. 146 From the perspective of research advancement in the field of NLP and (Old) Church Slavonic, the results can be regarded as positive, according to the opinions collected. It is hoped that future work will address the identified issues, both in relation to the functioning of the tool and to the application of NLP techniques to Old Church Slavonic texts.
“A brief vademecum should be added to show how queries must be formulated”	This highlights a concrete need for user orientation: the formulation of queries, in terms of morphosyntactic structure, lemma selection, and management of textual variants, has a direct impact on the quality of the results. The presence of a vademecum, even a basic one, could significantly enhance the user experience by making interaction with the system more accessible, effective, and replicable. It is also thanks to suggestions such as these that a user manual has been included in the Support section of HORTUS, the website where the results of all the project’s work packages are published.

Table 6.

Figure 3.

Survey results regarding the model’s ability to identify sentences semantically related to the initial query

Figure 4.

Survey results on the need to support the release of the tool with specific user training

Figure 5.

DaMSym User Manual on the platform

6. Conclusions

147

The development of a semantic tool for Old Church Slavonic (OCS) and Church Slavonic texts, presented in this study, took place within a research context marked by significant methodological and technical challenges. The main difficulties concern the a) scarcity of digitalized linguistic resources available in plain text; b) the fragmentation of existing resources across multiple online repositories; c) the fact that texts in these repositories are not always encoded using Unicode-compliant characters; d) the absence of shared normalization standards, and the intrinsic complexity of the Slavonic manuscript tradition, often characterized by fragmentary documentation and insufficient metadata. Moreover, the particularities of the scripts, orthographic features, and diacritical signs, though difficult to preserve in standard NLP tools, are essential for conducting studies on historical, geographical, and linguistic variation. Such variation is foundational not only for Slavic philological research but also for broader investigations into the history of these regions and the cultural influences and contaminations that shaped them.

148

One of the most important takeaways from this overview is that the issue of text normalization remains unsolved. There is an urgent need to develop analytical methodologies capable of processing non-normalized texts. Such approaches are crucial in preserving the orthographic and stylistic idiosyncrasies of the language, which are essential not only for diachronic linguistic research, but also for rigorous philological analysis that can engage meaningfully with the theological, cultural, and semantic dimensions of the texts.

149

In addition, the literature review clearly shows that, although transformer-based language models have achieved strong performance in modern contexts and, to some extent, in high-resource historical languages, they remain insufficiently adapted to low-resources corpora marked by high diachronic, orthographic, and dialectal variation, such as Old Church Slavonic and Church Slavonic. The solutions proposed thus far have predominantly relied on hybrid (rule-based and statistical) approaches, yielding promising results in part-of-speech tagging and morphological analysis, yet still falling short in deep semantic modeling and diachronic variation. As noted, current research efforts in the field have primarily focused on two key objectives: a) dating manuscripts and determining their geographic provenance, and b) identifying biblical citations and allusions within OCS texts.

150

In this context, the tool introduced in our study represents an original and methodologically grounded contribution to computational research on (Old) Church Slavonic texts. The platform enables semantic queries that go beyond literal matching, identifying conceptually related expressions even when they exhibit morphosyntactic variation or reformulation. The research has focused on semantic similarity analysis using the Nicene-Constantinopolitan Creed as a case study. Due to its brevity and doctrinal centrality, the Creed functions as a core liturgical formula with high doctrinal content, providing an ideal reference point for analyzing both its source culture and patterns of reception across regions and centuries.

151

The results, particularly in comparing different versions of the Nicene-Constantinopolitan Creed, demonstrate the tool’s capacity to support lexical change analysis over time, offering results that are both linguistically and semantically significant.

152

The adoption of a semantic-oriented embedding model (Jina Embeddings V3) has proven to be a sound choice for the liturgical-Slavonic domain, where vocabulary is highly connotative and theologically dense. The system has shown its ability to return relevant outputs even in the absence of exact textual matches, such as in the replacement of “нѣсть несть кѡнца конца” with “не бꙋдетъ кѡнца не будет конца,” or the disappearance of the adjective “истиннагѡистиннаго” (“true”) referring to the Holy Spirit in pre-Nikonian reform texts.

153

Structured testing conducted with experts in Slavic studies and digital humanities confirmed the tool’s usefulness, while also highlighting areas for improvement, particularly regarding the conceptual coherence of retrieved results and the system’s interpretability when confronted with non-standard query formulations. These observations underscore the necessity of supplementing the platform with a methodological guide to assist users in query construction and in critically interpreting the system’s outputs.

154

Overall, the tool emerges as an innovative resource for semantic inquiry into (Old) Church Slavonic texts, capable of offering new perspectives in theological, historical, and linguistic research. It not only facilitates information retrieval but also serves as a conceptual exploration instrument, enabling the tracing of lexical and semantic variants across time. This approach lays the groundwork for deeper integration between traditional philology and semantic technologies in the study of premodern religious texts.

155

The development of our tool has confirmed the need to address the limitations outlined above, particularly the ongoing data scarcity, which remains a major bottleneck. It has also highlighted the importance of improving OCR and HTR methodologies to ensure the acquisition of reliable digital resources. Although, as noted in Pedrazzini’s 2020 study, computational methods for early Slavic texts are making steady progress, particularly in relation to the digitization of primary sources[35] (Rabus and Petrov, 2023; Lendvai et al., 2024), the use of HTR and OCR techniques for the digitization of Old Church Slavonic and Church Slavonic texts remains an unresolved challenge. The contributions of these scholars represent a critical foundation for future work, but additional resources are still required to optimize model performance in Transkribus and to advance alternative frameworks such as eScriptorium. Improvements in this area will be essential for increasing the availability of high-quality textual data, thereby enabling the eventual training of LLMs for historical Slavic languages.

156

A further significant challenge was the widespread use of PUA (Private Use Area) characters. These characters frequently manifested in the dataset as what is colloquially referred to as “tofu”, placeholder glyphs (e.g., boxes or question marks) that indicate missing font support.

157

Accordingly, future development phases of our tool will also require the integration of extended PUA-Unicode mappings to ensure consistent and accurate frontend rendering. Although this issue does not affect the tool’s semantic retrieval capabilities, it can nonetheless lead to confusion for users and compromise the overall user experience.

Acknowledgments

158

This paper is the outcome of a collaborative work developed within WP4 of the ITSERR project, entitled DaMSym - Data Mining: the Nicene-Constantinopolitan Symbolum, and focused on the specific case study of Old Church Slavonic and Church Slavonic. In particular, sections 1 and 6 were jointly authored by both contributors; sections 2, 3, and 5 were written by Marianna Napolitano; and section 4 was authored by Giovanni Puccetti.

159

This work was supported by the PNRR (National Recovery and Resilience Plan) project Italian Strengthening of ESFRI RI Resilience (ITSERR) founded by the European Union—NextGenerationEU (CUP:B53C22001770006).

Notes

[1] https://www.itserr.it/itserr-its-work-packages/wp4-damsym/; PRIN 2020 The Nicene-Constantinopolitan Creed and its Translations: First Exploration and Methodological Test of a Transdisciplinary Research on the Councils’ Symbol in History, Culture and Society (4th–20th Century).

[2] See: [Napolitano 2025], “La riforma liturgica del patriarca Nikon (1653–1656): la riformulazione del Credo in russo”, in C. Bianchi/A. Melloni/M. Proietti (ed.), Il concilio e il Credo, 325–2025. Storia e trasmissione dei simboli di Nicea e di Costantinopoli (Bologna: EDB) 309–323.

[3] [Scherrer, Rabus, and Mocken n.d.]; [Pedrazzini 2020]; [Lendvai et al. 2025].

[4] [De Marneffe et al. 2021]

[5] [Qi et al. 2020]

[6] (http://bases.ruslang.ru/).

[7] [Meyer 2011].

[8] https://dev.syntacticus.org/proiel.html.

[9] The Canonical Text Services protocol defines a standard for text localization. These are strings taking the form of Uniform Resource Names (URNs), which include a document identifier in the first part and a reference to a specific passage in the second — typically expressed as a range. The URN follows the structure urn:cts:NAMESPACE:WORK:PASSAGE, where the PASSAGE component may include start and end references, separated by a dash (“-”), and optionally finer-grained subdivisions indicated with the at-sign (“@”) [Ruskov et al. 2025, 3].

[10] [Berti et al. 2016].

[11] [Ruskov et al. 2025].

[12] [Rabus and Petrov 2023].

[13] [Brants 2000]

[14] müller2013

[15] The issue of orthographic normalization is also addressed here, with reference to the Library of Sacred Literature- Orthodox Gymnasium in the name of St. Sergius of Radonezh (http://www.orthlib.ru/authors.html), an open-access electronic library dedicated to patristic Christian literature, which includes texts in Old Church Slavonic.

[16] [Brants 2000]

[17] [Scherrer, Rabus, and Mocken n.d.]

[18] A text classifier for four linguistic classes has also been recently developed within the framework of the project carried out by WP4 - ITSERR. A reference can be found in Cassese et al., 2025 (Slavic NLP 2025 and forthcoming in the proceedings of the 10th edition of the SlavicNLP workshopACL Anthology).

[19] [Papineni 2001]

[20] [Lendvai et al. 2025]

[21] [Talev 1973]; [Lendvai et al. 2023]

[22] [Vaswani et al. 2017]; [Devlin 2019]

[23] [Napolitano 2025], “From Characters to Meaning: Challenges and Limitations in the Semantic Analysis of Church Slavonic Texts”, in A. Melloni/F. Cadeddu (eds.), The Digital Turn in Religious Studies. Research, Services, Infrastructures (Göttingen: V&R), forthcoming.

[24] [http://urn.fi/urn:nbn:fi:lb-2021041521.

[25] https://clarino.uib.no/comedi/editor/lb-20140730106.

[26] [Birnbaum 1996]

[27] https://azbyka.ru/

[28] A further example of the challenges encountered in the precise definition of metadata is provided by the texts available on the obdurodon.org site, a platform for the development of digital projects curated by David J. Birnbaum. This resource includes normalized texts encoded in XML/TEI (http://bdinski.obdurodon.org/). In this case, the issue is not one of reliability, but rather of difficulty in accessing the relevant metadata. Future contact with the project’s curator is planned, with the aim of including these materials in our dataset. These transcriptions are particularly valuable in that they allow for direct comparison with the manuscript images, which are made available on the same page as the transcription.

[29] See https://histdict.uni-sofia.bg/

[30] [Andreev 2015]

[31] https://www.unicode.org/charts/PDF/U0400.pdf

[32] https://www.ifao.egnet.net/publications/outils/polices/

[33] A partial list of this mapping will be published in the forthcoming essay [Napolitano 2025], “From Characters to Meaning: Challenges and Limitations in the Semantic Analysis of Church Slavonic Texts”, in A. Melloni/F. Cadeddu (eds.), The Digital Turn in Religious Studies. Research, Services, Infrastructures (Göttingen: V&R).

[34] This specific PUA character mapping work is carried out by Marianna Napolitano in collaboration with Usman Nawaz (University of Palermo).

[35] (Rabus and Petrov, 2023; Lendvai et al., 2024).

Works Cited

Aksakov 2011 Aksakov, K.S. (2011) Opyt russkoi grammatiki: Sravnitelnyi vzgliad: Indoevropeiskie i slavianskie iazyki (reprintnoe izd.; Moskva: URSS/Librokom), 303 s. (Lingvisticheskoe nasledieXIX veka).

Andreev 2015 Andreev, A., Shardt. Y., and Simmons, N. (2015) “Church Slavonic typography in unicode”, Unicode Technical Note pp. 41, 1–97.

Artetxe and Schwenk 2019 Artetxe, M. and Schwenk. H. (2019) “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond”, Transactions of the Association for Computational Linguistics 7, pp. 597–610.

Baranov et al. 2003 Baranov, V., Birnbaum, D., Cleminson, R., Rabus, A. and Miklas, H. (2003) ‘Proposal for a Unified Encoding of Early Cyrillic Glyphs in the Unicode Private Use Area’, E-Scripta 3, pp. 25–52.

Baranov et al. 2007 Baranov, V., Mironov, A., Lapin, A., Mel’nikova, I., Sokolova, A., and Korepanova, E., (2007) “Avtomaticheskii morfologicheskii analizator drevnerusskogo iazyka: lingvisticheskie i tekhnologicheskie resheniia”, in Proceedings of the 10th Anniversary International Conference “EVA 2007 Moskva”, Moscow, 2007) pp. 61–68.

Baranov et al. 2010 Baranov, V., Birnbaum, D. J., Cleminson, R.,  Miklas, H., and  Rabus, A., (2010) “Proposal for a unified encoding of Early Cyrillic glyphs in the Unicode Private Use Area”, Scripta & e‑Scripta 8–9, pp. 9–26.

Berdichevskii, Eckhoff, and Gavrilova 2016 Berdichevskii , A., Eckhoff, H. M., and Gavrilova, T. (2016) “The beginning of a beautiful friendship: Rule-based and statistical analysis of Old Russian”, in Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue 2016 (Moscow: RSUH), pp. 9–16.

Berti et al. 2016 Berti, M., Blackwell, C., Daniels, M., Strickland, S., and Vincent-Dobbins, K., (2016) “Documenting Homeric text-reuse in the deipnosophistae of Athenaeus of Naucratis”, Bulletin of the Institute of Classical Studies 59, pp. 121–139.

Biagetti, Zanchi, and Luraghi 2021a Biagetti, E., Zanchi, C., and Luraghi, S. (eds.). (2021) “Building new resources for historical linguistics (Prima edizione)”. Pavia University Press.

Biagetti, Zanchi, and Luraghi 2021b Biagetti, E., Zanchi, C., and Luraghi, S. (2021) “Introduction: The value of digital resources for historical linguistics”, in Biagetti, E., Zanchi, C., and Luraghi, S. (eds) Building New Resources for Historical Linguistics. Pavia University Press, pp. 1–20.

Birnbaum 1996 Birnbaum, D. J. (1996) “Standardizing characters, glyphs, and SGML entities for encoding early Cyrillic writing”, Computer Standards and Interfaces 18, pp. 201–252.

Brants 2000 Brants, T. (2000) “TnT – A statistical part-of-speech tagger”, in Proceedings of the Sixth Applied Natural Language Processing Conference. Association for Computational Linguistics, pp. 224–231.

Bulgakov 2007 Bulgakov, S.N. (2007) Filosofiia imeni, https://azbyka.ru/otechnik/Sergij_Bulgakov/filosofija-imeni/. Aaccessed 19 July 2025.

De Marneffe et al. 2021 De Marneffe, M.-C., Manning, C.D., Nivre, J., and Zeman, D. (2021) “Universal Dependencies”, Computational Linguistics 47, pp. 1–54.

Deshpande et al. 2023 Deshpande, A., Jimenez, C., Chen, H., Murahari, V., Graf, V., Rajpurohit, T., Kalyan, A., Chen, D., and Narasimhan, K. (2023) “C-STS: Conditional semantic textual similarity”, in Bouamor, H., Pino, J., and Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 5669–5690.

Devlin 2019 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019) “BERT: Pre-training of deep bidirectional transformers for language understanding”, in Burstein, J., Doran, C., and Solorio, T. (eds.) Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1: Long and Short Papers.Association for Computational Linguistics, pp. 4171–4186.

Dorkin and Sirts 2024 Dorkin, A. and Sirts, K. (2024) “TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for ancient and historical languages”, in Hahn, M., Sorokin, A., Kumar, R., Shcherbakov, A., Otmakhova, Y., Yang, J., Serikov, O., Rani, P., Ponti, E.M., Muradoğlu, S., Gao, R., Cotterell, R., and Vylomova, E. (eds.) Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. Association for Computational Linguistics, pp. 120–130.

Eckhoff and Haug 2021 Eckhoff, H.M. and Haug, D.T.T. (2021) “Annotation schemes, tools and data in the PROIEL Treebank family”, in Biagetti, E., Zanchi, C., and Luraghi, S. (eds.) Building New Resources for Historical Linguistics. Pavia University Press, pp. 21–30.

Eckhoff et al. 2018 Eckhoff, H.M., Bech, K., Bouma, G., Eide, K., Haug, D., Haugen, O.E., and Jøhndal, M. (2018) “The PROIEL treebank family: a standard for early attestations of Indo-European languages”, Language Resources and Evaluation 52 (1), pp. 29–65.

Feliksov 2018 Feliksov, S.V. (2018) “Semantic aspect of the linguistic description of lexicon of orthodox dogma in church Slavonic language”, RUDN Journal of Language Studies, Semiotics and Semantics 9(2), pp. 335–350.

Fernández-González and Gómez-Rodríguez 2019 Fernández-González, D. and Gómez-Rodríguez, C. (2019) “Faster shift-reduce constituent parsing with a non-binary, bottom-up strategy”, Artificial Intelligence 275, pp. 559–574.

Ferro, Salmon, and Ziffer 2018 Ferro, M.C., Salmon, L., and Ziffer, G. (eds.) (2018) Contributi italiani al XVI Congresso Internazionale degli Slavisti: Belgrado 20–27 agosto 2018, 1st ed.; 40. Firenze University Press.

Florenskii 1990 Florenskii, P. A. (1990) U vodorazdelov mysli, https://imwerden.de/publ-1677. Accessed 19 July 2025.

Garzaniti 2019 Garzaniti, M. (2019) Gli Slavi: storia, culture e lingue dalle origini ai nostri giorni, 2nd ed., 207 of Manuali universitari. Lingue e letterature straniere (Rome: Carocci).

Janda and Exkhoff 2016 Janda, L. and Eckhoff, H. M. (2016) “Grammatical profiles and aspect in old church Slavonic”, Transactions of the Philological Society 114, pp. 233–255.

Kamchatnov 1998 Kamchatnov, A.M. (1998) Istoriia i germenevtika slavianskoi Biblii. Nauka.

Kamchatnov and Nikolina 2008 Kamchatnov, A.M. and Nikolina, N.A. (2008) Vvdenie v jazykoznanie: učebnoe posobie, 7th ed. Flinta.

Kempgen 2008 Kempgen, K. (2008) “Unicode 5.1, old church Slavonic, remaining problems – and solutions, including OpenType features”, in Slovo: Towards a digital library of south Slavic manuscripts, Proceedings of the International Conference, 21–26 February 2008, Sofia, Bulgaria (Sofia), pp. 200–219.

Kostromin 2012 Kostromin, K. (2012) “Istoriia slavianskogo teksta Simvola very cherez prizmu istorii raskola XVII v”, Khristianskoe chtenie 3, pp. 32-65.

Lendvai et al. 2023 Lendvai, P., Reichel, U., Jouravel, A., Rabus, A., and Renje, E. (2023) “Domain-Adapting BERT for attributing manuscript, century and region in pre-modern Slavic texts”, in Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change. Association for Computational Linguistics, pp. 15–21.

Lendvai et al. 2024 Lendvai, P., Van Gompel, M., Jouravel, A., Renje, E., Reichel, U., Rabus, A., and Arnold, E. (2024) “ A workflow for HTR-Postprocessing, labeling and classifying diachronic and regional variation in pre-modern Slavic texts”, in Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N. (eds.) Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, pp. 2039–2048.

Lendvai et al. 2025 Lendvai, P., Reichel, U., Jouravel, A., Rabus, A., and Renje, E. (2025) “Retrieval of parallelizable texts across church Slavic variants”, in Scherrer, Y., Jauhiainen, T., Ljubešić, N., Nakov, P., Tiedemann, J., and Zampieri, M. (eds.) Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, pp. 105–114.

Lewis et al. 2020 Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020) “Retrieval‑augmented generation for knowledge‑intensive NLP tasks”, Advances in Neural Information Processing Systems 33, pp. 9459–9474.

Losev 1993 Losev, A.F. (1993) Bytie — imia — kosmos, https://imwerden.de/publ-14599. Accessed 19 July 2025.

Meyendorff1991 Meyendorff, P. (1991) Russia, ritual, and reform: The liturgical reforms of Nikon in the 17th century. St Vladimir’s Seminary Press.

Meyer 2011 Meyer, R. (2011) “New wine in old wineskins? — Tagging Old Russian via annotation projection from modern translations”, Russian Linguistics 35 (2), pp. 267–281.

Napolitano 2025 Napolitano, M. (2025) “La riforma liturgica del patriarca Nikon (1653–1656): la riformulazione del Credo in russo”, in C. Bianchi/A. Melloni/M. Proietti (ed.) Il concilio e il Credo, 325–2025. Storia e trasmissione dei simboli di Nicea e di Costantinopoli. EDB, pp. 309–323.

Papineni 2001 Papineni, K. (2001) ‘Why Inverse Document Frequency?’, Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 25–32.

Pedrazzini 2020 Pedrazzini, N. (2020) “Exploiting cross-dialectal gold syntax for low-resource historical languages: Towards a generic parser for pre-modern Slavic”, arXiv preprint. arXiv:2011.06467.

Poliakov 2014 Poliakov A. (2014) Korpus tserkovnoslavianskikh tekstov: problemy orfografii i grammatiki. Przegląd wschodnioeuropejski. 1, 245–254. Available at: https://ruslang.ru/doc/church-slav/conf4/05-polyakov.pdf. Accessed 29 April 2026.

Postovalova 2022 Postovalova, V.I (2022) Nauka o jazyke v svete ideala cel’nogo znanija: V poiskach integral’nych paradigm. LENAND.

Qi et al. 2020 Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C.D. (2020) “Stanza: A Python natural language processing toolkit for many human languages”, in Celikyilmaz, A. and Wen, T.-H. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics, pp. 101–108.

Rabus 2019 Rabus, A. (2019) “Recognizing handwritten text in Slavic manuscripts: A neural-network approach using Transkribus”, unpublished paper. Available at: https://www.academia.edu/38835297. Accessed 20 July 2025.

Rabus and Petrov 2023 Rabus, A. and Petrov, I. N. (2023) “Linguistic analysis of church Slavonic documents: A mixed-methods approach”, Scando-Slavica 69(1), pp. 25–38.

Reimers and Gurevych 2019 Reimers, N. and Gurevych, I. (2019) “Sentence-BERT: Sentence embeddings using Siamese BERT-Networks”, in Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, pp. 3982–3992.

Ruskov et al. 2025 Ruskov, T., Mikulka, T., Podtergera, I., Gavrilkov, M., and Thompson, W. (2025) “Quotes at the fingertips: The BogoSlov Project’s combined approach”, in Proceedings of the 21st Conference on Information and Research Science Connecting to Digital and Library Science (IRCDL 2025), pp. 393–407.

Scherrer, Rabus, and Mocken n.d. Scherrer, Y., Rabus, A., and Mocken, S. (n.d.) “New developments in tagging premodern orthodox Slavic texts”, Scripta & e-Scripta 18, pp. 9–33.

Shakhov 1997 Shakhov, M.O. (1997) Filosofskie aspekty staroveriia. Tretii Rim, pp. 45–47.

Sidash 2013 Sidash, T. (2013) Skrizhal: Akty soborov 1654, 1655, 1656 godov. Svoe izdatelstvo.

Sturua et al. 2024 Sturua, S., Mohr, I., Akram, M.K., Günther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., and Xiao, H. (2024) “jina-embeddings-v3: Multilingual Embeddings With Task LoRA”, arXiv preprint. arXiv: 2409.10173.

Talev 1973 Talev, I. (1973) Some problems of the second south Slavic influence in Russia. Available at: http://www.promacedonia.org/en/itsb/index.htm. Accessed 19 July 2025.

Tomelleri 2022 Tomelleri, V.S. (2022) “When church Slavonic meets Latin: Tradition vs. innovation”, in Diachronic Slavonic Syntax. De Gruyter Mouton, pp. 201–232.

Totomanova 2012 Totomanova, A.M. (2012) “Digital presentation of Bulgarian lexical heritage: Towards an electronic historical dictionary”, unpublished paper. Available at: http://dspace.uni.lodz.pl:8080/xmlui/handle/11089/4274. Accessed 20 July 2025.

Vaswani et al. 2017 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017) “Attention is all you need”, in Advances in Neural Information Processing Systems 30, pp. 5998–6008.

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

URL: https://dhq.digitalhumanities.org/editorial/000866/000866.html
DOI: pending
Comments:
Published by: and
Affiliated with: Digital Scholarship in the Humanities
DHQ has been made possible in part by the National Endowment for the Humanities.
© 2026 the authors

Unless otherwise noted, the DHQ web site and all DHQ published content are published under a Creative Commons Attribution-NoDerivatives 4.0 International License. Individual articles may carry a more permissive license, as described in the footer for the individual article, and in the article’s metadata.