Skip to Main Content

Newspaper and News Sources: Text data mining: newspapers

Newspapers and news sources

Text and data mining: newspaper corpora

Stanford Libraries have negotiated direct access to newspaper corpora for text and data mining (TDM) research.  As we build our collections, we will include information about new resources on this guide. The collections listed in this newspaper guide are the bulk files for use with text data mining methodologies and tools such as R and Python.

Information about text data mining of Proquest historical newspapers collections can be found on the Proquest TDM Studio library guide page. Note: Proquest text and data mining collections extend beyond newspaper corpora. 

New York Times TDM Archive

The New York Times TDM Archive (1980-2020) is now available to Stanford University researchers with SUNet IDs for text and data mining. Researchers can now access article text and metadata, encoded as XML objects. This 40-Year textual digital archive of the, consists of approximately 3 million articles published by The New York Times, including but not limited to news, lifestyle, opinion and The New York Times Magazine. The collection excludes reader comments, paid obituaries and the kids section.

For access:

  • Stanford researchers must have an active SUNet ID.
  • Researchers must agree to the terms of a Data Use Agreement (DUA), which is linked in the Searchworks record.
  • The research must be for non-commercial and academic purposes.
  • Dates covered: 1980-2020.
  • Instructions for using the XML files are included in the data documentation.

Please note: This archive of the NYT corpora does not include access to the online version of the NYT. Please see News databases section for information on how to access specific news articles published by the New York Times. 

For additional questions, please contact: Regina Roberts

Washington Post Archival Data

Stanford Libraries has access to Washington Post Archival Data for text and data mining.

  • The Washington Post Archival Data includes article text and metadata encoded as JSON objects. All articles in the archive were printed in the physical newspaper, and may or may not appear on The Washington Post website.
  • Files are compressed using gzip and labelled by YYYY-MM-DD (e.g. articles-1977-01-27.json.gz). Each file covers a single day of available article data.
  • Files are bundled into folders by year. The folders are archived and compressed (e.g. 1977.tar.gz). Each year has its own file manifest. The file manifest lists the files associated with that year, their sizes in bytes, and their md5 checksums.
  • Article data goes as far back as 1977. New article data is appended quarterly.

To access the data, researchers must agree to and follow the Washington Post Data Use Agreement

For further information about this source, please see the Searchworks record and the GitLab page. The GitLab page includes the README file and data documentation.

For information on how to access the Washington Post online edition for daily news, please see "Online U.S. Newspapers (Stanford-wide access)". 

For additional questions, please contact: Regina Roberts