University of Arizona
8 files

Lessons Learned from a Secondary Analysis Using Natural Language Processing and Machine Learning from a Lifestyle Intervention

posted on 2022-05-04, 19:55 authored by Sarah Freylersythe, Rebecca Sharp, John Culnan, Damian Yukio Romero DiazDamian Yukio Romero Diaz, Yiyun Zhao, Hagan Franks, Remo Nitschke, Steven J. Bethard, Tracy E. Crane

This poster was presented on 2022 April 6 – 9 at the 43rd Annual Meeting & Scientific Sessions of the Society of Behavioral Medicine in Baltimore, MD, USA.

We provide the poster in several formats, including svg, pptx, png, pdf, and jpg. We also provide two figures: the "iceberg figure" that illustrates the depth of the untapped data from the original LIvES study (iceberg_figure.png), as well as the QR code that links to additional information such as references (QR - Handout.png).


Submitted Abstract:

Background: Recorded telephone coaching sessions (approximately 24,500) in English and Spanish from 1205 women participating in the Lifestyle Intervention for oVarian cancer Enhanced Survival (LIvES), GOG 0225, study were used for this analysis. The LIvES Study tested whether a lifestyle intervention of increased physical activity and a healthy diet would increase progression-free survival compared to an attention control using trained health coaches and Motivational Interviewing (MI), a directive, patient-centered counseling approach; 323 LIvES Study coaching session recordings were scored for adherence to MI techniques. Here we describe lessons learned from a secondary analysis of LIvES data utilizing machine learning and natural language processing to automate fidelity and predict lifestyle behavioral outcomes.

Methods: Numerous steps were necessary to prepare the call recordings for natural language processing. Data were aligned through a combination of participant phone numbers, coach names and participant names, entry dates and recording dates. Transcription was performed automatically with wav2vec. An annotation interface was developed using Label Studio and an annotation guideline was adapted from existing Motivational Interviewing Treatment Integrity (MITI) 3.0. Finally, a pilot annotation of the call recordings was completed and initial inter-rater reliability was measured.

Results: The process of preparing this secondary analysis resulted in a number of lessons learned. First, data infrastructure for the original LIvES study, due to its long-running nature, evolved in ways that lost data continuity. The data alignment process would have been simplified by establishing a single identifier to link calls, outcomes, and MITI scores, and maintaining that identifier over the course of the project. Second, evaluating the quality of automated transcription systems is difficult and could have been streamlined by manually transcribing a small number of study calls to be used for evaluation. Finally, training a machine learning model to assess interviewer turns could have been simplified by establishing a protocol for coding MITI scoring using an annotation tool, resulting in turn-level annotations alongside the holistic scoring typical of MITI.

Conclusion: Behavioral interventions should engage the support of a computational scientist in the study design planning stage to take advantage of the largely under-utilized data collected in these trials.

For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to


LIvES Study (GOG 0225) NCT00719303


NIH/NCI 1R21CA256680-01 (MPI: Tracy E. Crane/Steven J. Bethard)