Go to https://tdmstudio.proquest.com and create an account using your institution/university email address.
Users can text and data mine against Stanford University Libraries' subscriptions and perpetually licensed content on the ProQuest Platform, including current and historical newspapers; dissertations and theses; and scholarly journals. For more information, see the available titles.
Yes, up to 5 researchers can collaborate on the same workbench. However, users should note that data storage and export limits are fixed, and do not increase when multiple researchers are attached to the same workbench. Contact contact-cidr@stanford.edu to add collaborators to your workbench.
Yes, as long as the lead researcher is affiliated with Stanford, research colleagues from other institutions can be added to the workbench. Contact contact-cidr@stanford.edu to add collaborators to your workbench.
What are the data storage limits?
The workbench has two components, each with different storage limits: the dashboard and the Jupyter Notebook. The dashboard can hold up to 10 datasets of up to 2 million documents each. Note that this limit is fixed and does not increase if there are multiple researchers attached to the workbench. In contrast, the Jupyter Notebook can hold up to 100 GB of data. To maximize dashboard space, users can copy datasets to the Jupyter Notebook. The original copies can then be deleted from the dashboard, creating room for new datasets.
What are the data export limits?
There is a rolling, 7-day export limit of 15MB. The export limit is shared by all users affiliated with the workbench. ProQuest does sometimes make exceptions for larger file exports -- contact contact-cidr@stanford.edu for more information.
Downloads are available for 2 hours. If 2 hours have passed and your export is no longer available, you can resubmit the download request.
Can I export full text from ProQuest TDM Studio?
Researchers can export scripts and their outputs, but they cannot export the full text. Stanford University Libraries has offline copies of some (but not all) ProQuest full-text corpora. For more information, contact contact-cidr@stanford.edu.
In what structure are the text data provided?
The text data are provided as XML documents. Specific tags will depend on the content type (e.g. SchoolCodeName would only be appropriate for dissertations and theses). For more information, see the Tag Overview.
What is the difference between the Text and HiddenText tags?
HiddenText elements are derived from OCR. Text elements are full-text received electronically, as through daily feeds from publishers. Because HiddenText elements are derived from scanning, they may be more likely to include errors.
When full-text isn’t available for a particular publication or database, can I still access the other metadata?
Yes, if users pull documents from publications or databases for which their institution does not have full-text access, those documents would not have Text and/or HiddenText elements, but would contain all other relevant metadata.
Sometimes I see several versions of a newspaper with overlapping dates. How can I ensure that no duplicates appear in my dataset?
In the case of newspapers, the historical, recent and contemporary files are created separately from different sources, so the same article in two sources will not have identical GOIDs (a unique article ID that identifies the article in the ProQuest database). The only reliable way to de-duplicate content is to compare article titles.
ProQuest recommends that researchers using multiple titles with overlapping coverage create multiple datasets with no overlapping dates. Then the researchers can either combine the datasets or process them sequentially in the Jupyter Notebook.
What programming languages are available for use in the Jupyter Notebook?
The R and Python programming languages are available for use in the Jupyter Notebook.
If I close my browser, will my script continue to run in the Jupyter Notebook?
Yes, the process will run on the workbench notebooks for up 48 hours. ProQuest recommends logging in each day during long processing jobs to ensure that the workbench remains active.
At this time, ProQuest TDM Studio does not have a summary statistics feature.
In order to see the total number of documents associated with a publication, users can (1) select the title(s) of interest; and (2) in the Refine Content step, view the total number of documents in parentheses after each title. Since many publications have daily feeds, this number could change everyday.
In order to see the total number of documents associated with a database, users can (1) select the database of interest; and (2) in the Refine Content step, refer to the number that appears above the sample documents.
Email contact-cidr@stanford.edu with questions or concerns.