What's a 'word': Multilingual DH and the English Default

Poster for Spectrums of DH talk

The following was my presentation at the McGill DH Spectrums of DH series. The images were displayed as Zoom backgrounds during the talk.

Thanks so much for the introduction, Kate. So, yes, I’m Quinn Dombrowski. In English, I identify as non-binary, but I don’t have any preferences about pronouns because in most of the languages I work with, pronouns don’t actually fix anything. In English, if someone says they use they/them pronouns, and you refer to them using those pronouns, congratulations! You’ve fixed binary gender in discourse! In other languages, where gender also triggers things like adjectival or verbal endings, it’s not nearly so clean-cut — especially when there’s no “neutral” ending, or the neuter-gender ending has been used as a slur. So, I do prefer gender-neutral nouns like “person” and “parent”, but don’t particularly care about pronouns. And that’s way more than you can reasonably fit at the bottom of a conference name tag or email footer. But in some sense, that’s doing multilingual DH in a microcosm: it doesn’t always fit neatly into the categories and rituals offered by English.

I’m so touched to be invited to speak as part of this series in honor of Stéfan Sinclair, who was, in his own quiet way, a lion of multilingual DH. His Voyant has better out-of-the-box support for not only multilingual interfaces, but multilingual text processing infrastructure. And all without big splashy announcements or glory-seeking. Stéfan listened when people encountered challenges. And he made Voyant better. And after seeing Voyant's impact in my non-English DH class last week, I’m so grateful that he did.

I’d also like to thank Cecily for this invitation, even if she did put me in the unenviable position of speaking after Roopika Risam. Still, I’m grateful for this chance to put her perspective and mine into dialogue, in anticipation of the broader discussion among the speakers at the end of the series. Within DH, at times, the focus of our respective work has been put in opposition to one another. Based on the likely attendees here, you probably all agree that linguistic diversity is not a substitute for diversity along axes of identity like race, gender, and sexuality. And that there remains much work to be done to address, in particular, anti-Black racism and the impact of the colonial legacy in our institutions, our classrooms, our organizations.

But what I’m hoping to do today is to offer language as another axis of diversity worth giving some thought and attention, in the spirit of “yes, and”. I have faith that we can care about more than one thing at a time, and take steps towards making the DH that we practice more inclusive for people who work with other languages — without it taking away from more focused efforts elsewhere.

That said, it’s not always easy. Being welcoming towards other languages, and actually supporting students and colleagues who work on non-English languages, takes work and attention, and especially for those of us who do Anglophone scholarship while living in an Anglophone world, you have to pay attention to a blind spot you may have never noticed before. So, my hope for today is to point out this blind spot, to make it easier to see next time it comes up, and try to give you some sense of what it actually looks like, and feels like, to work in non-English DH — or, frankly, to do DH in any language that’s not English, or the national language of the country you’re working in. (It’s not necessarily less frustrating to do Japanese DH in Germany than in the US.) I love going on about the amazing work being done in non-English DH worldwide — earlier this year I briefly started to revisit Alex Gil’s “Around DH in 80 Days” project with more recent work from around the world, before it had to be indefinitely put on hold because of COVID-19 meant I’m always short on hours in the day. But that’s not what I want to focus on today. Because if I do that, I’m afraid that what many of you will walk away from this talk with is “Gee, guess there’s a lot of stuff going on in other languages, that’s cool” — but you WON’T leave with a sense of why it should matter to YOU, or what role YOU can play in supporting, in particular, the next generation of non-English DH scholars.

Just to be clear upfront, with this talk I’m speaking especially to Anglophone DH scholars, in primarily or exclusively Anglophone countries, and to a lesser extent, in Western Europe. I’m going to focus primarily on the experience of doing DH research in a language other than English, including how that plays out for students in a DH classroom. These classrooms may look like a workshop or a formal course, and they're often constructed as a kind of interdisciplinary space, even though they almost always assume the use of English-language materials. I won’t be focusing on the issues of language as they pertain to scholarly communication — what languages do people present in at conferences, or publish in. Domenico Fiormonte has written extensively on the Anglophone dominance in scholarly communication, and there seems to be an increasing trend towards linguistic diversity in presenting one’s research, both at conferences and in writing, including, for instance, the recent launch of Humanistica’s Francophone DH journal. While I absolutely respect what people are doing with that shift, it’s not without consequences. The fact that more DH scholarship is published or presented in languages besides English is not, realistically, going to lead to enrollment spikes in French or Italian 101 driven by Anglophone DH scholars who want to keep up with the field. It means there’s additional friction in communication, that requires labor (like the GO::DH “whispering” campaign for interpreting non-English talks at DH 2014 that Élika Ortega has written about) to overcome, or tools. And many people just won’t bother. Maybe I’ll be grateful that so many of you have turned off your cameras so I don’t see the looks on your faces when I say this, but one of the things I’ve done in my non-English DH class this quarter has been to assign articles in languages that only some of the students can read — such as Italian and French. I tell them if they can’t read it, they can use Google Translate to get the gist. Would I recommend they cite the Google translated version? No. Are they getting every nuance? Also no. But machine translation has gotten to be quite good (particularly for European ones, which, yes, still leaves many scholars out). For many language teachers, the way you can tell when a student is cheating with Google translate is that their writing is TOO good. But realistically, the alternative is simply not engaging with that scholarship. So I’ve been trying to get them in the mind-space of seeking out an imperfect solution, instead of giving up for lack of linguistic knowledge.

All right. On to the experience of doing non-English DH! If you’re an Anglophone scholar in an Anglophone country, you live in a sort of magical bubble where you may never run across friction in your research workflows, because things just… work. At least as well as DH tools and methods ever do. The title of my talk is “what’s a word?” Which in the Anglophone bubble feels almost like a nonsense question. Words are such a basic concept in our tools and methods, many of which come down to just being variations and elaborations on word-counts — things like word frequencies, word distributions, word vectors. They all are based on some assumption of what a “word” looks like — particularly, that it’s separated by spaces, and mostly invariant in form. This is a terrible assumption if you’re working on many languages. Chinese is, in some sense, actually one of the easier cases. Yes, you have to artificially insert spaces between “words” (in Classical Chinese, you do it between each character; with modern Chinese, you need more complex algorithms to accommodate compounds). But once you’ve done that, Chinese behaves a lot like English in not having a lot of inflection. On the other hand, Russian (mostly) uses spaces like English does, but an individual word can have 30 different forms, depending on the number, grammatical gender, and case. The computer doesn’t know that those 30 different variants are the same “word”, so any analysis involving word counts just collapses into noise unless you lemmatize — which means, changing your grammatical, human-readable text into a more computationally-tractable derivative, where every word is replaced with its dictionary form. The resulting lemmatized text may be human readable… kind of, though you can lose things like what’s the subject and what’s the object… but it’s still wrong, in a way that takes extra effort to work though. Imagine some parallel universe where DH word-count methods assumed that all words had endings added to the roots. And the only way to get English-language texts to work with these methods was if you first transformed them into Pig Latin. But not just any Pig Latin — one where the added vowel changes depending on whether it’s singular or plural, plus various etymological reasons. (For anyone who doesn’t know, Pig Latin is a children’s language game where you move any initial consonants or consonant clusters to the end of the word and add “ay”.) In this universe, you may start with your original text, but what you have to work with sounds more like: “it-ay asway ethay estboy ofyay imestoy , ityay asway ethay orstwiy ofyay aymestiy”. A Tale of Two Cities, right? But it takes a lot more effort to re-connect this text back to the version you’re familiar with.

You think that’s bad? Consider, if you will, the case of one of my students this quarter, Merve Tekgüler, who works on Ottoman history. (All student examples shared with permission, FWIW.) A thing you should know about Turkish: it’s an agglutinative language, which means that a “word” (or a thing between spaces) can be a whole phrase or sentence in English. They literally have a “word” for “you are reportedly one of those people that we could not make Czechoslovakian” which doubles as a tongue-twister. (In those cases, lemmatizing isn’t enough, and you have to do another method of encoding sub-word components to get anything meaningful out of “word counts”. In case you were wondering.) Another thing you should know: the founding father of the Republic of Turkey, after the fall of the Ottoman Empire, changed the writing system from Arabic script to Latin. Arabic script goes right to left, Latin goes left to right. Latin script writes vowels (which are really important in Turkish); Arabic script does not. Merve wants to make Ottoman Turkish documents available to modern Turks. And so they’re working with Suphan Kirmizialtin at NYU to retrain a model for doing OCR on printed Ottoman Turkish texts into something that can not just OCR but transliterate — essentially, OCR into another alphabet, while adding back in the vowels that are missing. And because the software Merve is using assumes left-to-write, they have to reverse their transcriptions to create training data. Like this.

Of course, this slide is STILL in “easy mode” — it’s still the same alphabet as modern English, and keeps all the vowels. A better comparison would be this:

Any guesses about what the text here is? It’s a familiar one… (Moby Dick)

So these are funny counterfactuals, and every week I teach, I think of more. Like Windows, which to this day uses English text encoding instead of Unicode, an international standard that supports almost all languages and writing systems, when you run Python scripts, unless you tell it otherwise. Imagine how frustrated you’d be if you tried to open a text file using Python, but it’d fail unless you either specified that you had English or multilingual text, because anything other than Chinese characters would trigger an error.

And these scenarios are funny for a reason, because I hope you remember them next time you’re teaching. Maybe your class is going great, and all the students seem to be getting the material you’re covering in your workshop or course, but have you asked them what language they use for their own work? What language they want to apply these methods to? Once the quarter’s over, how will your Russian student feel when they apply the word-count based methods they learned in your class, and get garbage back because they don’t know about lemmatizing? Or how will your Ottomanist handle the complete and utter fail they get when trying to run their manuscripts through Adobe Acrobat or ABBYY FineReader for OCR? What will they conclude? Will they realize that the problem is Anglo-centric DH pedagogy, and not them? Will they want to keep doing DH, when it looks like it “doesn’t work” for the texts they care about?

I wish the only problems these students faced were technical. Too often, people who I’m sure are well-meaning do things that further undermine people who work on multilingual DH, who are already struggling with a harder path than their Anglophone peers.

There was the time I was invited to present as part of a seminar in an English department — I was excited about it, because the other presentation was the perfect topical compliment to my multilingual talk, and I felt like I had something to contribute in terms of raising questions about how broadly applicable the English-language results were, given the diversity in my results spanning multiple different language corpora. I felt great about the presentation— it’d gone over really well before in non-English contexts— but during the Q&A, every single question was about the other talk. Grad students didn’t want to touch something they weren’t familiar with, with a ten foot pole. And when I got the pity question from the organizer, it was about a hypothetical corpus I didn’t have, instead of what I’d presented. When my one friend there with a background in non-English languages saw where the winds were blowing and texted me a question, it kicked off a parallel — literally, under-the-table — Q&A session that actually engaged with language. We find outlets in unfriendly spaces. I was just grateful that the grad students I work with weren’t there. Because, whatever, so I get angry about the response to a talk. I’ve been doing DH since 2004. This won’t drive me away from the field. But those students? What would have left them with the impression that they have any place in DH?

Another favorite of mine is the “international” or “multilingual” panel scheduling at conferences, where you stuff everyone working on not-English onto a single panel, safely tucked away from the rest of the conference where no one has to deal with you unless they opt into it. Imagine being someone who does fandom studies in English, and going to China for a conference, where you’re put on a panel with someone who does medieval Islamic star charts and someone doing 19th century Russian literature. What’s the problem? All of you use alphabetic scripts in your research. You’ve got things in common! Sometimes it works. Sometimes you’re able to draw out unexpected connections. But most of the time, you’re left wishing someone could have made space for you at a panel on the TOPIC of your work, without getting hung up on the language.

Maybe my very favorite was some of the response to a write-up I’d done about my non-English DH course. Concerns that I should articulate a back-story: surely teaching a DH course that centered non-English languages must be a reaction to some instructor who had done me wrong, and I should tell it as a story of redemption or overcoming challenges. Dude, I work for a non-English literatures department. I teach non-English DH because I DO non-English DH and it really doesn’t need a whole psychological back-story. And then, insisting that I cut the part of the piece that gets right to the heart of what I’m talking about today, to try to make people care, because talking about the challenges students face beyond the technical isn’t appropriate. Here’s the hard truth: even leaving aside the challenges your non-English literatures and history students face when trying to apply in the DH classroom, it pales in comparison to what they face OUTSIDE it. Are you horrified by the state of the job market for English literature folks this year? Honestly, feels very familiar, as someone centered in not-English. One tenure track post every year, or two, or three, depending on the thing you work on? Sounds about right. So many students go into programs wanting to teach literature, and come out of it scrambling to teach a language. Which, if you’ve never done it, imagine teaching a composition class… with students whose proficiency is most comparable with a 4-year old… and spending spending the next 30 years discussing favorite foods and colors and what students did over the weekend. I mean, God bless the people who that works for. But it’s not what most students dream of, and it’s probably the most likely outcome for a lot of students who work on other languages. Not to get too neoliberal here, but DH plus language skills give my students options. But only if they have someone to show them how they work together. And that just doesn’t happen in most Anglophone-oriented classrooms.

But many of you, you’re in a position to do something about it. You’re in the position to engage with multilingual DH, as a colleague, or as a teacher.

Also, please don’t think you have to be some kind of polyglot genius to do this? Multilingual DH is the main thing I do, and I’m a failed German heritage speaker with rusty high school and Argentinian exchange student Spanish, Japanese self-taught to the point of failure, who’s taken fourth-year Russian three times, and who’s never spent more than a month outside the US. In college I dreamed of living abroad, but life had other plans.

So here’s some tips.

Name language. This sounds like such a basic thing, but it’s so common to not do this and just treat “English” as a synonym for “literature”, and omit both. Emily Bender, a linguist, is fighting the good fight for what’s been dubbed the “Bender rule”, essentially “always name the language you’re working on”, vis-a-vis computational language processing. I’d love to see the same adopted in DH. And actually, this is a moment where decolonial and multilingual DH intersect — just this last week, I saw that the Cornell University English department had voted to rename itself “Department of Literatures in English” as a step towards de-colonizing the Western university system. Whether they intended it or not, I feel like that brings that English department closer in its framingl to, say, the department I studied in at the University of Chicago, the “Department of Slavic Languages and Literatures”. For English literature people, it’s so common to omit the language name, and just leap to period or style. “I study… modernism!” “I study… the 19th century!” But if the scope of those things is English, just say it? Everyone who works on non-English does, and it helps with specificity and equality, avoiding any implication that “English” should be synonymous with “literature".

Also, get used to looking at languages you can’t read. I find it so strange that people who are generally on board with DH as a collaborative enterprise nonetheless balk when the “collaboration” has to take the shape of relying on a student to read text in a language you don’t know. It’s okay. It’s just another kind of collaboration. Not being able to read another languages casts no shade on your expertise, but working with a student who brings linguistic expertise to a project says something about your openness to embrace the challenges we inevitably face if we want to ask big questions.

Because what I really hope is that scholars with the power to do this (not you, grad students, you’ve got a different set of challenges and constraints) can open themselves up to questions bigger than their own expertise has a prayer of answering. If you’re interested in fandom — like, actually, FANDOM — you need to look at how it manifests not only in English. Same with literature. Same with the Gothic. It’s so common to wave your hand and try to extrapolate from what you find in English to LITERATURE writ large, but it doesn’t work that way. There’s so much more literature than you can conceive of just looking at English, and it’s all weird and wonderful in its own way. I feel like this, in some sense, is the gift of having spent my entire career in staff jobs. I haven’t had the luxury of being able to become an expert in any one thing— my role has been to support others in their work, which means having to find something interesting in just about anything that people come up with. Consider cultivating that kind of staff mindset with work outside your own expertise. What can you contribute? What can you learn? How far can you get, together?

The students in my non-English DH class this quarter are smart and brave and persistent and insightful and I can’t wait until all this is over and I can find some way to introduce them to you all in person. Beyond their linguistic capabilities and interests, they bring their own life experiences and the combination is incredible. Victoria Rahbar, who’s spent the last couple weeks OCRing Japanese (which can run vertically or horizontally — or sometimes both on the same book cover) wrote in her reflection piece on it how she’s concerned with more than just getting the characters and orientation right: she wants the texts she OCRs to be accessible to students with screen readers which means getting the reading-order set correctly too. And Maria Massucco, who’s been working with Italian texts, shared a beautiful reflection about how she overcame her anxiety about deleting the paratext from ebooks before doing computational text analysis, by connecting it to the way words sometimes disappear when you’re translating between languages.

If you want to understand literature, ACTUALLY literature, not just English as a dubious proxy for literature, you’ve got to reach out to your non-English DH colleagues, and the students coming up. Not just talk to them, not just invite them to your seminars and while plunging down the usual Anglophone rabbit holes, but make the leap to actually try to understand and engage with their work. Their work is valuable — it has meaning and worth — without being comparative. But they almost certainly recognize the limitations to the scope of the claims that they can make, within the confines of the languages they work in. Do you? And are you willing to reach beyond your own expertise, to embrace what you could learn beyond English?

Thank you.