Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research"

dataset

posted on 2021-05-20, 19:44 authored by Damian Yukio Romero DiazDamian Yukio Romero Diaz, Hanyu Jia, Wei Xu, Hui Wang

This dataset contains anonymized email chains between students and instructors in an active learning-teaching relationship. The data was collected from several departments at the University of Arizona.

The data is comprised of seven email chains (conversations) containing a total of 27 email texts. All data is presented in JSON text files. Some of the emails contain a few words in languages other than English, which is why we have encoded all files using UTF-8. We have enriched the email chains metadata with the gender, age range, first language(s), and additional language(s) as reposted by participants through a questionnaire. For most languages, we use the alpha-2 code ISO 639-1. For languages that are not present in the ISO 639-1, we report them as written by the participant. ISO 639-1 language codes can be found at the Library of Congress standards page here: https://www.loc.gov/standards/iso639-2/php/code_list.php

The data is a sample taken from the Multilingual College Email Corpus (MCEC), an ongoing collection of authentic academic emails which recently started data collection in Fall 2020. We expect to publish the full corpus within this deposit at a future, undetermined date.

The researchers have received IRB approval and participants' consent for publishing this data after anonymization under the following protocol name and number:

"Collection and Analysis of Authentic College Emails"

2004533142

Our anonymization process can be found at: https://github.com/MCECorpus/MCEC-DeID

For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu

Funding

University of Arizona College of Humanities' COH Graduate Student Research Grants Program, Spring 2020

University of Arizona Department of Linguistics' Human Language Technology Program Funds

History

Usage metrics

Keywords

corpus linguistics corpus design language data collection academic writing computer-mediated communication academic email language studies Linguistics

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM