University of Arizona
3 files

Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research"

posted on 2021-05-20, 19:44 authored by Damian Yukio Romero DiazDamian Yukio Romero Diaz, Hanyu Jia, Wei Xu, Hui Wang
This dataset contains anonymized email chains between students and instructors in an active learning-teaching relationship. The data was collected from several departments at the University of Arizona.

The data is comprised of seven email chains (conversations) containing a total of 27 email texts. All data is presented in JSON text files. Some of the emails contain a few words in languages other than English, which is why we have encoded all files using UTF-8. We have enriched the email chains metadata with the gender, age range, first language(s), and additional language(s) as reposted by participants through a questionnaire. For most languages, we use the alpha-2 code ISO 639-1. For languages that are not present in the ISO 639-1, we report them as written by the participant. ISO 639-1 language codes can be found at the Library of Congress standards page here:

The data is a sample taken from the Multilingual College Email Corpus (MCEC), an ongoing collection of authentic academic emails which recently started data collection in Fall 2020. We expect to publish the full corpus within this deposit at a future, undetermined date.

The researchers have received IRB approval and participants' consent for publishing this data after anonymization under the following protocol name and number:

"Collection and Analysis of Authentic College Emails"

Our anonymization process can be found at:

For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to


University of Arizona College of Humanities' COH Graduate Student Research Grants Program, Spring 2020

University of Arizona Department of Linguistics' Human Language Technology Program Funds


Usage metrics




    Ref. manager