Newspaper and News Sources: Text data mining: newspapers
Text and data mining: newspaper corpora
Stanford Libraries have negotiated direct access to newspaper corpora for text and data mining (TDM) research. As we build our collections, we will include information about new resources in this guide. The collections in this newspaper guide are the bulk files for use with text data mining methodologies and tools such as R and Python.
Information about text data mining of Proquest historical newspaper collections can be found on the Proquest TDM Studio library guide page. Note: Proquest text and data mining collections extend beyond newspaper corpora.
New York Times TDM Archive
The New York Times TDM Archive (1980-2020) is now available to Stanford University researchers with SUNet IDs for text and data mining. Researchers can now access article text and metadata, encoded as XML objects. This 40-Year textual digital archive of the nytimes.com, consists of approximately 3 million articles published by The New York Times, including but not limited to news, lifestyle, opinion and The New York Times Magazine. The collection excludes reader comments, paid obituaries and the kids section.
For access:
- Stanford researchers must have an active SUNet ID.
- Researchers must agree to the terms of a Data Use Agreement (DUA) available via the Searchworks record.
- The research must be for non-commercial and academic purposes.
- Dates covered: 1980-2020.
- Instructions for using the XML files are included in the data documentation.
Please note: This archive of the NYT corpora does not include access to the online version of the NYT. Please see News databases section for information on how to access specific news articles published by the New York Times.
For additional questions, please contact: Regina Roberts
Washington Post Archival Data
Stanford Libraries has access to Washington Post Archival Data for text and data mining.
- The Washington Post Archival Data includes article text and metadata encoded as JSON objects. All articles in the archive were printed in the physical newspaper, and may or may not appear on The Washington Post website.
- Files are compressed using gzip and labelled by YYYY-MM-DD (e.g. articles-1977-01-27.json.gz). Each file covers a single day of available article data.
- Files are bundled into folders by year. The folders are archived and compressed (e.g. 1977.tar.gz). Each year has its own file manifest. The file manifest lists the files associated with that year, their sizes in bytes, and their md5 checksums.
- Article data goes as far back as 1977. New article data is appended quarterly.
To access the data, researchers must agree to and follow the Washington Post Data Use Agreement.
For further information about this source, please see the Searchworks record and the GitLab page. The GitLab page includes the README file and data documentation.
For information on how to access the Washington Post online edition for daily news, please see "Online U.S. Newspapers (Stanford-wide access)".
For additional questions, please contact: Regina Roberts
- Last Updated: Dec 5, 2024 9:56 AM
- URL: https://guides.library.stanford.edu/newspapers
- Print Page