Stanford Libraries have negotiated direct access to newspaper corpora for text and data mining (TDM) research. As we build our collections, we will include information about new resources on this guide. The collections listed in this newspaper guide are the bulk files for use with text data mining methodologies and tools such as R and Python.
Information about text data mining of Proquest historical newspapers collections can be found on the Proquest TDM Studio library guide page. Note: Proquest text and data mining collections extend beyond newspaper corpora.
The New York Times TDM Archive (1980-2020) is now available to Stanford University researchers with SUNet IDs for text and data mining. Researchers can now access article text and metadata, encoded as XML objects. This 40-Year textual digital archive of the nytimes.com, consists of approximately 3 million articles published by The New York Times, including but not limited to news, lifestyle, opinion and The New York Times Magazine. The collection excludes reader comments, paid obituaries and the kids section.
For access:
Please note: This archive of the NYT corpora does not include access to the online version of the NYT. Please see News databases section for information on how to access specific news articles published by the New York Times.
For additional questions, please contact: Regina Roberts
Stanford Libraries has access to Washington Post Archival Data for text and data mining.
To access the data, researchers must agree to and follow the Washington Post Data Use Agreement.
For further information about this source, please see the Searchworks record and the GitLab page. The GitLab page includes the README file and data documentation.
For information on how to access the Washington Post online edition for daily news, please see "Online U.S. Newspapers (Stanford-wide access)".
For additional questions, please contact: Regina Roberts