Generating a Dataset Using Corpus-DB - Important Terms Search

IMPORTANT: Follow the steps below if you are going to generate a database based on important words and phrases. If you would like to create a dataset based on Author Open Alex IDs, then follow the Generating a Dataset Using Corpus-DB - Author Open Alex ID Search.

Follow these steps to generate a dataset using Corpus-DB

  1. Go to https://corpus-db.sdsc.edu.
  2. Enter the username and password you use for https://suave-net.sdsc.edu.
  3. If you are looking to generate a dataset, click on “Search” and then click “Submit”.
    • If you receive an error after doing this, open a new tab and navigate back to https://corpus-db.sdsc.edu. Make sure you enter your correct username and password.
  4. Make sure your email is correct at the top of the page.
  5. Name the project under “Project Name” with a fitting title.
  6. Next, select your search type as “Scope” (which is defaulted).
  7. (This step is optional): You have the option to exclude authors with a certain number of coauthors. You have the choice to select the threshold for too many coauthors. It is defaulted at 25.
  8. Fill out the scope of texts you want in your search. Please follow the directions closely as they are important for a successful search (i.e. include plurals of words).
    • Select “Allow relaxed search” if you want to populate items with the scope words in no specific order. Do not check this if you want items to exactly match the scope specified.
  9. (This step is optional): Under “Exclusions”, specify any words that you do not want to show up in your dataset.
  10. (This step is optional): Enter the keywords to tag authors by if the word appears in a title or abstract.
    • (This step is optional): You have the option here to exclude pieces that contain none of the keywords.
  11. (This step is optional): Under “Starting Year” and “Ending Year”, include year boundaries for search results.
  12. (This step is optional): Under “Institutions”, enter institutions you would like to include in the search separated by commas. Please include all names for the institution as to not exclude relevant results (i.e., UC San Diego, UCSD, University of California San Diego).
  13. (This step is optional): Under “Collaborating Institutions”, enter collaborating institutions you would like to include in the search separated by commas. Like “Institutions”, please include all names for the institution as to not exclude relevant results (i.e., UC San Diego, UCSD, University of California San Diego).
  14. Click “Submit”.
  15. Depending on the number of results, an email will be sent to the address specified in the beginning in seconds, minutes, or, in the worst case, hours.
  16. Follow the link sent to your email (https://corpus-db.sdsc.edu/collect) and enter the code included in the email.
  17. Your dataset will automatically download to your downloads folder.

Checkpoint

  • A dataset of the network files
  • A Netvis .json file