--------------------------------------------- # Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research" Preferred citation (DataCite format): Romero Diaz, Damian Yukio; Jia, Hanyu; Xu, Wei; Wang, Hui (2021). Sample data for "Design and Collection Challenges of Building an Academic Email Corpus for Linguistics and Computational Research". University of Arizona Research Data Repository. Dataset. https://doi.org/10.25422/azu.data.14259785 Corresponding Author: Damian Yukio Romero Diaz, University of Arizona, damianiji@email.arizona.edu License: CC BY 4.0 DOI: https://doi.org/10.25422/azu.data.14259785 --------------------------------------------- ## Summary This dataset contains anonymized email chains between students and instructors in an active learning-teaching relationship. The data was collected from several departments at the University of Arizona. The data is comprised of seven email chains (conversations) containing a total of 27 email texts. All data is presented in JSON text files. Some of the emails contain a few words in languages other than English, which is why we have encoded all files using UTF-8. We have enriched the email chains metadata with the gender, age range, first language(s), and additional language(s) as reposted by participants through a questionnaire. For most languages, we use the alpha-2 code ISO 639-1. For languages that are not present in the ISO 639-1, we report them as written by the participant. ISO 639-1 language codes can be found at the Library of Congress standards page here: The data is a sample taken from the Multilingual College Email Corpus (MCEC), an ongoing collection of authentic academic emails which recently started data collection in Fall 2020. We expect to publish the full corpus within this deposit at a future, undetermined date. The researchers have received IRB approval and participants' consent for publishing this data after anonymization under the following protocol name and number: "Collection and Analysis of Authentic College Emails" 2004533142 Our anonymization process can be found at: --------------------------------------------- ## Files and Folders - mcec_email_samples.json: 27 individual email records (one per line) in JSON format using utf8-encoding. Each email record contains the metadata as well as message information described below. Metadata fields include: chain_id: A random 8-character hexadecimal code unique to each email chain (conversation) message_id: A random 8-character hexadecimal code unique to each email message langs: A list of language codes of all languages present in the email. Language codes are based on the alpha-2 code ISO 639-1 college: University of Arizona college unit where the data was collected Message fields include: header: Includes the following subfields: sender, receiver, date subject: The 'title' of the email as shared with the research team. Notice that titles are carried from the email chains. The research team makes no assumptions about where the title changes, for instance from `Title` to `Re: Title` body: The body of the message, including the greeting, the main content, and the closing (Sincerely, Best wishes, etc.) footer: Any information below the closing. Notice that most of this information has been redacted since it often includes personally identifiable information Note. For the header subfields, `sender` and `receiver` are both random 8-character hexadecimal codes unique to each participant generated solely for the purpose of anonymization. The `date` field is redacted in all cases and appears with the tag `[[date]]` #### email_chains.zip: Description of contents - Seven "email chain record" files in JSON format with utf8-encoding. Naming convention: chain_{chain-id}.json, where {chain-id} is a random 8-character hexadecimal code. Files are pretty printed using 4-space indentation: These files contain all of the emails in `mcec_email_samples.json`. Each email chain record file corresponds to a chronologically ordered conversation between one instructor and one student. Each email chain record contains the `metadata`, `interlocutors`, and `contained_emails` information described below. Metadata fields include: chain_id: See above college: See above n_of_emails_in_chain: An integer indicating the number of emails contained in each chain email_ids_chron_order": The unique email id numbers listed in chronological order in the conversation Interlocutors fields include: interlocutors: always composed of two interlocutors: `interlocutor_1` and `interlocutor_2`. If an email chain is composed of a single email, then the fields for `interlocutor_2` are set to "N/A" (this only happens in file `chain_bada05ff.json`) participant_id: random 8-character hexadecimal code unique to each participant generated solely for the purpose of anonymization. Coincides with the email "sender" and "receiver" fields depending on which role the participant occupies during each email exchange participant_type: Main participant types including `student` or `instructor` participant_sub-type: For students, this field indicates the current level of studies, for instructors, this field indicates their position such as `Professor`, `Graduate Assistant/Associate`, etc. questionnaire_data: Demographic, sociolinguistic, and language background information. Subfields include: gender, age_range, first_languages, and additional_languages emails_contained fields include: emails_{n}: Where {n} is the number ranging from 1 - n in the chronological order of the email exchange in each particular email chain. Each `email_n` field contains all the `metadata` and `message` fields described above in the `mcec_email_samples.json` file description. --------------------------------------------- ## Materials & Methods All files were created using Python 3.9. No additional libraries were required. --------------------------------------------- ## Additional Notes Links: - https://github.com/MCECorpus/MCEC-DeID