Early results from a birchbark letter proof-of-concept

The frustrating thing about attempting a proof-of-concept for the birchbark letters is that many of the calculations I find most interesting need to be run over the entire corpus-- which means a lot more data preparation than I currently have finished. The sample "names in context" (NIC) XML only has entries for 42 documents, so there's a lot of work yet to be done.

Characterizing the context for different kinds of names

I've written the XSLT to do a number of calculations-- how often are women mentioned in the context of some kind of financial transaction? When women are mentioned in a financial context, are they more likely to be referred to by a derivative of their husband's name than when they are mentioned in other contexts? Are there contexts where certain kinds of names-- "church" names, non-Slavic names, female names-- appear with particular frequency, or not at all? Unfortunately, the output is meaningless until the full NIC XML is ready.

Building charts and graphs

I can see writing code to enable the user to generate-- and, more importantly, compare-- charts and graphs like those found on p. 26-33 of Zaliznjak 2004, but with greater detail. Within a given date range, is there a statistically significant difference between how frequently the jer merger, or cokan'e, occurs when the writer has a non-Slavic name, vs. writers with Slavic names? It didn't seem worth doing for this particular proof-of-concept, though; the only data set I could draw on for this would be the index of all the documents and their dates, and a chart showing how many documents were found in each 50-year time span or what-have-you doesn't seem worth the effort.

Actual results: names and genders

For some usable-- if not the most exciting-- results, I've turned to the fully-prepared index of names, which does allow for certain kinds of calculations. Keep in mind, all of these below refer to unique names, not instances of use ("Khrestina" appears 5 times, but is counted once for these calculations; I don't have the NIC data ready yet to account for multiple instances.)

  • 10.7% of the names with an identified gender are female (87 names).
  • 35.6% of the women's names that are used are derived from a male name, presumably their husband (31 names). There's an opportunity here to go fishing for the equivalent male names from the same time period and location, but my initial guess at reconstructing the male names suggests that there's no attestation-- contemporary with the woman's name or not-- of the male name in about 25% of the cases.
  • 12% of all names have no identified gender (110 names). One of my goals is to assign probable genders to as many of these as possible, after establishing the contexts where men and women are more likely to appear.
Actual results: Boris

When deciding how to build the sample NIC data set, I chose to include all documents that reference the name Boris. I wanted to see how I could organize the available data to facilitate the task of determining how many Borises are represented in the birchbark letter corpus, and my initial pass at 12th century documents surfaced 5 documents mentioning "Boris".

What information is useful to help differentiate one Boris from another? I've listed each document where the name occurs, and sorted them using the earliest conditional date proposed. I've included the location where the document was found, but to put that information into the right context, it's important to know Boris's role in the document. Particularly where he's a sender or 3rd party, documents referring to the same Boris could easily be found in different locations. The value of the additional information concerning Boris's role varies; one could argue that whether he owes money or not is fairly inconsequential, but information indicating that the reference is to Saint Boris, or that the Boris in question has recently died are crucial the task at hand.

  • Борисъ (m)
    • 906
      • 1075 - 1100
      • 3p: Троицк, Е
      • personal: religious
    • 742
      • 1100 - 1120
      • to: Троицк, К
      • financial: owes
      • orders: receiving
    • 237
      • 1160 - 1180
      • from: Нерев, И
      • personal: complaint
    • 806
      • 1160 - 1180
      • 3p: Троицк, Е
    • 581
      • 1180 - 1200
      • to: Троцк, Е
      • financial: gen
    • 671
      • 1180 - 1200
      • 3p: Троцк, Г
      • financial: gen
    • 819
      • 1180 - 1200
      • to: Троцк, Е
      • financial: owed
      • orders: receiving
    • 935
      • 1180 - 1200
      • 3p: Троицк, Е
      • financial: gen
    • 714
      • 1200 - 1220
      • from: Троицк, К
      • personal: gen
    • 343
      • Борисъ Милославовъ
      • 1280 - 1300
      • 3p: Нерев, Д
    • 263
      • Борисъ Пѧнтелѣѥвъ
      • 1360 - 1380
      • 3p: Нерев, Е
      • financial: owes
    • 579
      • 1360 - 1380
      • from: Нутн
      • orders: giving
    • 701
      • Борисъ Петаревъ
      • 1360 - 1380
      • from: Троицк, П
      • orders: giving
    • 744
      • 1360 - 1380
      • 3p: Федоровск
    • 43
      • 1380 - 1400
      • from: Нерев, А
      • orders: giving
    • 49
      • 1410 - 1420
      • 3p: Нерев, Г
      • personal: death

So how many Borises are there? Probably somewhere between 10 and 12.

  • Saint Boris - BBL 906
  • Boris of the early 12th century - BBL 742; 40-60 years between this Boris reference and the next one make it unlikely that it's the same Boris
  • Boris(es) of the late 12th century - BBL 237, 806, 581, 671, 819, 935 all range from 1160-1200. Two letters to a Boris (581, 819) and two where Boris was a third party (806, 935) were found in Troick. E, which may be a reason to connect those with a single individual. Other than BBL 237, from Nerev., all the other documents from this time period are from Troick.
  • Boris of the turn of the 13th century - BBL 935 could plausibly be grouped in with the above if positioned on the earlier side of its 1200-1220 date range.
  • Boris Miloslavov - BBL 343, 1280-1300
  • Boris Pjantelejev - BBL 263, 1360-1380
  • Boris Petarev - BBL 701, 1360-1380
  • Boris(es) of the late 14th century - BBL 579 (Nutn., from), 744 (Feodorovsk., 3rd person); could be any of the two Borises above, the one below, or someone else.
  • Boris and Nastas'ja - BBL 43 (1380-1400) and BBL 49 (1410-1420); BBL 43 features him giving orders to his wife, and BBL 49 is his wife complaining about his death.

The next step-- assuming we had a full data set-- would be to look at the names that co-occur in documents, to try to build up a "social network". (The results may also help fine-tune our identification of individual Borises.) Some kind of "point system" would likely be involved to weight the connections between people, say, 5 points between a writer and an addressee, 3 points between the writer/addressee and each of the third parties in a document, and 1 point between each of the 3rd parties. I haven't by any means worked out the details, and am enthusiastically open to ideas and suggestions for how to handle that part, but there's a lot of data left to be entered before the project reaches that point.


New birchbark XML: names in context (NIC)

To enable a more interesting proof-of-concept for the birchbark letter XML project, I've spent the last week making a new, limited data set (all documents from 1100-1120, plus some documents with the same names from the 12th century, and all the documents that include the name Boris) that lists all the names that occur in a given document and characterizes their role in the document. For the time being, I'm calling it "names in context" (NIC).

I've been adapting the schema every time I come across some new aspect that seems significant. Each document contains one or more

elements, which include:
  • A name
  • Optionally, a "second name"-- sometimes a patronymic or city of residence used to specify which Ivan is being referred to
  • A role ("to", "from", or "3p" for 3rd party), and more optional details:
    • financial
      • gen - general, mostly for lists of names and amounts without any context
      • owes
      • owed
    • orders
      • giving
      • receiving
      • report
      • an optional "polite" attribute to indicate particularly deferential language
    • personal -- might get renamed to other if the "scope creep" continues like this
      • advice
      • complaint
      • news
      • death
      • religious-- for when the names refer to saints
      • gen
      • optional "polite" attribute here, too
  • A section for relatives of the person:
    • Their relation (mother, father, brother, etc.)
    • All options for the relative's role, as listed above

To illustrate, here's a few sample entries:

BBL 49: from Nast'ja to her brother

Conveying the news of her husband's death




BBL 736а: Ivan and Dristliv

Ivan tells Dristliv to collect money from Pavel and Prokopii.











The state of the birchbark letters XML project

I was hoping to delay this until I had the Subversion repository ready to distribute the first versions of the XML, but setting it up is taking a bit longer than I anticipated. I'm working on a proof-of-concept for the kinds of analysis that can be done using the name and date indices together; I'm hoping to finish it in about a week or so. Here's the current status of the deliverables as of 20 June 2010:

Date index

A preliminary version is done and ready for release. It contains all the documents and all the different proposed date ranges found in Zaliznjak 2004 and A sample entry:


Name index

Finished, other than the metadata about whether a name is attested elsewhere, compositional, or none of the above. A sample entry:




Names in context

Starting work on this index (which shows all the names found in each document, and the role they play) is the next step in building a proof-of-concept for the work on the BBL names sub-project.

Unicode word index

Completed through page 62 out of 113.

Word index with vowel etymology

Pending completion of Unicode word index.


Modernizing Research through Collaborative Reference Tools: The Medieval Slavic Linguistics Wiki

As a way of setting an actual deadline for myself to make some progress on the Medieval Slavic wiki, I submitted an abstract to the Fifth Annual Meeting of the Slavic Linguistic Society, which follows.

The rise in scholarly materials accessible through Internet (JSTOR, SpringerLink, and/or PDFs published by individuals) has facilitated the research process for scholars worldwide. For Slavic linguists, however, many of the major reference works were published over 50 years ago, and are unavailable electronically. The success of Wikipedia as a general reference source for laypeople illustrates the potential of wikis as a way of organizing information from diverse sources. This paper aims to make the case for developing a separate wiki as a shared reference resource for Slavic linguists.

While on­line publication of materials lowers one barrier to access, the research process itself has largely remained the same. It is often necessary for scholars to seek out information from areas where they are less familiar with the literature. This can involve consulting a reference work and tracking the topic through bibliographies. The amount of time necessary to look up facts takes away from the time the scholar can devote to the intellectual content of their research.

A common approach to solving this problem is scanning materials. However, a tremendous amount of tedious work is involved in this process, and copyright is a non­trivial concern. A specialized wiki, compiled by Slavic linguists and Slavic linguistics graduate students, would reduce both tedium and copyright concerns, as the facts and conjectures contained within monographs and articles are not themselves subject to copyright.

A test case is currently being developed, limited to topics relevant to medieval Slavic linguistics. The wiki contains two kinds of content: article/monograph summaries that lay out the major claims of a particular work, and topic-­based pages that bring together both undisputed facts and various conflicting scholarly claims on the topic, drawn from articles and monographs. Both kinds of wiki pages include copious, specific footnotes referencing the source material­­ both to enable fact­-checking and to allow the scholar to cite the original material rather than the wiki if desired.

Mediawiki, the wiki software developed for Wikipedia also used for this test case, includes a number of features aimed to both encourage contribution and prevent abuse. Each user account comes with a page linking to that user's contributions on each of the pages where they have added (or removed) something. Even scholars without any specialized technology skills can contribute to the wiki, and include the link to their contributions on their CV to show their involvement in a digital humanities project. Every change made to a page is tracked in the database, and can be viewed, discussed, and/or reverted in cases of blatant abuse. The common graduate student assignment of writing article summaries could be redirected slightly towards writing summaries for articles not currently on the wiki, and breaking the information in those articles down into specific claims that can be added to topical pages encourages students to develop their analytic skills.

In addition to arguing for the benefits such a wiki could provide the Slavic linguistics scholarly community, this paper will present a live demonstration of the Medieval Slavic linguistics wiki.



