Disco
Diachronic Spanish Sonnet Corpus. Canonical and minor authors in Spanish (Europe, America and Asia): 15th to 20th century
Install / Use
/learn @pruizf/DiscoREADME
DISCO: Diachronic Spanish Sonnet Corpus
Let's party
The Diachronic Spanish Sonnet Corpus (DISCO) contains sonnets in Spanish in XML-TEI, between the 15th and the 20th centuries (4530 sonnets by 1216 authors from 22 different countries). It includes well-known authors, but also less canonized ones. Texts and authors are enriched with identifiers and metadata.
The current version of the corpus is v5.0, published in February 2023. Work about the corpus presented or submitted before then refers to earlier releases, see the release history. <br><br>
- Prior Collections of Texts in Spanish
- Description of DISCO
- Versions
- Publication and Preservation
- Licence and Citation
- Resources reusing DISCO
- Future Steps
- References <br><br>
Prior Collections of Texts in Spanish
A fundamental difficulty for Digital Humanities studies on Spanish literature is a scarcity of digital resources (Agenjo, 2015).
Some resources do however exist. BiDTEA (Gago Jover et al, 2015), ADMYTE (Marcos Marín and Faulhaber, 1992) and ReMetCa (González-Blanco and Rodríguez, 2014) have digitized Spanish Medieval texts. Navarro-Colorado et al. (2015) presented the Corpus of Spanish Golden-Age Sonnets.
For later periods, available collections covering different genres are Textbox (Schöch et al., 2015), BETTE (Santa María Fernández, 2017), Aracne (Álvarez Mellado and Martín-Fuertes, 2015), and Revistas Culturales 2.0 (Ehrlicher and Rißler-Pipka, 2015). Nevertheless, none of these projects are working on poetry.
Concerning available sonnet corpora, Sonnet-Archiv (Elf Edition) is organized as a forum, and its coverage is less wide than ours. The “Sonnet Library” (Biblioteca Virtual Miguel de Cervantes, 2007) is organized alphabetically, rather than using meaningful criteria for literary scholarship, like periods. Both are traditional websites. Finally, the Corpus of Spanish Golden Age Sonnets (Navarro-Colorado et al., 2015) covers major authors from the 15th to the 17th century, with an automatic metrical annotation. Author metadata in these corpora are very limited and unavailable in a machine-readable format (see Calvo Tello, 2017, for discussion of related issues).
DISCO complements this growing ecosystem by adding a meaningful representation of sonnets from the 15th to the 19th centuries.
Description of DISCO
Our corpus currently offers a total of 4530 sonnets in Spanish: 202 from the early-mid 20th century, 2919 from the 19th century, 321 from the 18th century and 1088 from the so-called Spanish Golden Age (15th to 17th centuries). There are a total of 1216 authors (from Spain, Latin America and the Philippines). It intends to provide a wide sample, inspired by distant reading approaches (Moretti, 2005). The raw texts were in most cases extracted from Biblioteca Virtual Miguel de Cervantes (1999), with some 18th-century texts coming from Wikisource. More specific sources were also consulted for some authors (the TEI headers always indicate the source). A table in section Data Distribution below summarizes these data.
The corpus is available in plain-text and in TEI formats; XML-TEI P5 was used given this standard’s benefits in terms of reuse, storage, and retrieval. Author metadata were extracted or inferred from unstructured content in the sources (year, place of birth and death, and gender), and placed in the TEIheader, or in a metadata table in the case of the plain-text version. For both TEI and plain-text formats, two versions of the texts are available: one collecting every sonnet per author, the other encoding a single sonnet per file. For corpus preparation, we closely followed the TEI guidelines and RIDE’s criteria for Digital Text Collections (Henny-Krahmer and Neuber, 2017).
Additionally, authors have been assigned VIAF identifiers and described using RDFa attributes. This gives the corpus an entry-point to the Linked Open Data cloud, enhancing its findability. The corpus is available as a GitHub repository and saved in Zenodo, in response to good practices for data use, reuse, and conservation.
Why Sonnets?
The sonnet has had great importance in European poetry; the relevance of the corpus for literary scholarship is guaranteed. It is a "manageable" form to treat computationally, obeying clear restrictions. Variability stays within bounds, making meaningful comparison across poems easier, as regards scansion or rhyme types. Besides, some digital collections of sonnets already exist (with different features than the one presented here, as discussed below) as well as automatic analyses of this form. The sonnet has received attention from the computational linguistics community (Navarro Colorado et al, 2015, 2016, 2017; Agirrezabal, 2017) including the ADSO project (Navarro Colorado 2017). The DISCO corpus will also be useful for that audience. For these reasons, a new sonnet corpus allows us to engage in a dialogue with earlier work in traditional literary studies, in digital corpus development, and in computational poetry analyses.
Data Distribution
We describe the sources and data distribution for each subcorpus, starting in reverse chronological order with the 19th century. A table below summarizes the information.
The 20th-century subcorpus consists of sonnets in Spanish written by Filipino authors. This choice was made given our involvement in a project on Philippine literature in Spanish. It also responds to the corpus goal to cover a breadth of authors, including lesser-known ones. All sonnets found in Biblioteca Virtual Miguel de Cervantes poetry collections by Filipino authors were included in the corpus; the source volumes are specified in the corpus files. It contains 202 sonnets by 9 authors (2 female, 7 male).
The 19th-century subcorpus contains 2919 sonnets, written by 688 authors. The main source for the corpus is the texts at digital library Biblioteca Virtual Miguel de Cervantes prepared by Ramón García González in 2006. The same library is the source for sonnets by Filipino authors, and for poems by Peruvian modernista author José Santos Chocano (Poesías Completas volumes 1 and 2, besides the 19th century anthology by García González just cited). We also included all sonnets by another major modernista poet, Nicaraguan author Rubén Darío; sources are specified in the TEI headers. Approximately half of the texts were written by Spanish authors, and half by Latin American authors, with Cuba as the best-represented country, followed at a large distances by Mexico, Argentina, Colombia and Puerto Rico. Some authors were born in non Spanish-speaking countries, such as Portugal, Brasil or Haiti. Two Filipino authors are represented. More than 90% of the authors are male.
Note that the 19th-century subcorpus includes about 125 sonnets by 23 authors whose production took place mainly in the early 20th century (with date of death prior to 1936). We kept these authors as they were part of the 19th-century anthology mentioned above, which is our main source for this subcorpus. We will consider creating a separate subcorpus for early 20th century literature if we more systematically collect material for the early 20th century.
The 18th century subcorpus is based on texts from Biblioteca Virtual Miguel de Cervantes, prepared by Ramón García González in 2005. Besides, some texts come from Wikisource.
The Golden Age subcorpus (15th-17th centuries) is based on texts from Biblioteca Virtual Miguel de Cervantes prepared by Ramón García González in 2006. For this period, we chose mostly minor authors, thus complementing Navarro Colorado's (2015) Golden Age corpus, which focuses on canonical authors.
Although overall in the corpus we deliberately included less canonical writers, less than 10% of the authors are female. An active search will be carried out to counteract this lack of diversity.
TABLE 1: Corpus data distribution per period, author gender and primary continent of literary activity<br> Numbers in parentheses indicate authors which were probably active in Europe.
<table> <tr> <th rowspan="2">Period</th> <th rowspan="2">Nbr of Sonnets</th> <th colspan="5">Nbr of Authors</th> <th rowspan="2">Tokens</th> </tr> <tr> <th colspan="2">Gender</th> <th colspan="2">Provenance</th> <th>TRelated Skills
node-connect
343.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
90.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
Security Score
Audited on Jul 16, 2025
