Annotated English Gigaword Linguistic Data Consortium (1994-2010)

dataset

posted on 2021-12-10, 20:45 authored by University of Arizona LibrariesUniversity of Arizona Libraries

Dataset available only to University of Arizona affiliates. To obtain access, you must log in to ReDATA with your NetID. Data is for research use by each individual downloader only. Sharing and/or redistribution of any portion of this dataset is prohibited.

Annotated English Gigaword, Linguistic Data Consortium (LDC) Catalog Number LDC2012T21 and ISBN 1-58563-629-0, was developed by Johns Hopkins University's Human Language Technology Center of Excellence. It adds automatically generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07) and also contains an API and tools for reading the dataset's XML files. The goal of the annotation is to provide a standardized corpus for knowledge extraction and distributional semantics which enables broader involvement in large-scale knowledge acquisition efforts by researchers.

Annotated English Gigaword contains the nearly ten million documents (over four billion words) of the original English Gigaword Fifth Edition from seven news sources: Agence France-Press, Associated Press Worldstream, Central News Agency of Taiwan, Los Angeles Times/Washington Post Newswire Service, Washington Post/Bloomberg Newswire Service, New York Times Newswire Service, and Xinhua News Agency. The following levels of annotation were added: tokenization and segmented sentences, treebank-style constituent parse trees, syntactic dependency trees, named entities, and in-document coreference chains.

The files in Annotated English Gigaword are compressed in *.gz files. If you are using a Windows computer, you may need to install a program that is able to decompress these files. The data is stored in a form similar to the gigaword SGML format with XML annotations containing the additional markup. The included API provides object representations for the contents of the XML files.

NOTE: The uncompressed datasets are very large.

Detailed file descriptions and MD5 hash values for each file can be found in the README.txt file.

Portions © 1994-2010 Agence France Presse, © 1994-2010 The Associated Press, © 1997-2010 Central News Agency (Taiwan), © 1994-1998, 2003-2009 Los Angeles Times-Washington Post News Service, Inc., © 1994-2010 New York Times, © 2010 The Washington Post News Service with Bloomberg News, © 1995-2010 Xinhua News Agency, © 2012 Matthew R. Gormley, © 2003, 2005, 2007, 2009, 2011, 2012 Trustees of the University of Pennsylvania

For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu

History

Usage metrics

Keywords

Annotated English Gigaword Linguistic Data Consortium Linguistics LDC LDC2012T21 Gigaword Computational Linguistics English Language

Licence

In Copyright

Annotated English Gigaword Linguistic Data Consortium (1994-2010)

History

Usage metrics

Categories

Keywords

Licence

Exports