Workshop: Getting Started with Textual Data

UC Davis DataLab
Winter 2021
Instructors: Tyler Shoemaker <tshoemaker@ucdavis.edu> and Carl Stahmer <cstahmer@ucdavis.edu>

In this three-part workshop series you will learn the basics of text mining with Python. We will focus primarily on unstructured text data, discussing how to format and clean text to enable the discovery of significant patterns in collections of documents. Sessions will introduce participants to core terminology in text mining/natural language processing and will walk through different methods of ranking terms and documents. We will conclude by using these methods to classify texts and to build models of “topics.” Basic familiarity with Python is required. We welcome students, postdocs, faculty, and staff from a variety of research domains, ranging from health informatics to the humanities.

Common Links

Workshop Data

The data used in this workshop can be downloaded from the web at https://datalab.ucdavis.edu/tm_workshop_data.zip. It is also stored on the DataLab GDrive at gdrive/teaching/Workshop Series - Text Mining and NLP/workshop_getting_started_with_textual_data.zip

Contributing

The course reader is a live webpage, hosted through GitHub, where you can enter curriculum content and post it to a public-facing site for learners.

To make alterations to the reader:

Run git pull, or if it's your first time contributing, see the Setup section of this document.
Edit an existing chapter file or create a new one. Chapter files are Markdown files (.md) in the chapters/ directory. Enter your text, code, and other information directly into the file. Make sure your file:
- Follows the naming scheme ##_topic-of-chapter.md (the only exception is index.md, which contains the reader's front page).
- Begins with a first-level header (like # This). This will be the title of your chapter. Subsequent section headers should be second-level headers (like ## This) or below.
Put any supporting resources in data/ or img/. For large files, see the Large Files section of this document. You do not need to add resources generated by your code (such as plots). The next step saves these in docs/ automatically.
Run the command jupyter-book build . in a shell at the top level of the repo to regenerate the HTML files in the _build/.
When you're finished, git add:
- Any files you edited directly
- Any supporting media you added to img/
- The .gitattributes file (if you added a large file)
Then git commit and git push. This updates the main branch of the repo, which contains source materials for the web page (but not the web page itself).
Run the command ghp-import -n -p -f _build/html in a shell at the top level of the repo to update the gh-pages branch of the repo. This uses the ghp-import Python package, which you will need to install first (pip install ghp-import). The live web page will update automatically after 1-10 minutes.

Large Files

If you want to include a large file (say over 1 MB), you should use git LFS. You can register a large file with git LFS with the shell command:

git lfs track YOUR_FILE

This command updates the .gitattributes file at the top level of the repo. To make sure the change is saved, you also need to run:

git add .gitattributes

Now that your large is registered with git LFS, you can add, commit, and push the file with git the same way you would any other file, and git LFS will automatically intercede as needed.

GitHub provides 1 GB of storage and 1 GB of monthly bandwidth free per repo for large files. If your large file is more than 50 MB, check with the other contributors before adding it.

Setup

Jupyter-Book

Install jupyter-book using pip

pip install -U jupyter-book

Git LFS

This repo uses Git Large File Storage (git LFS) for large files. If you don't have git LFS installed, download it and run the installer. Then in the shell (in any directory), run:

git lfs install

Then your one-time setup of git LFS is done. Next, clone this repo with git clone. The large files will be downloaded automatically with the rest of the repo.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
_static/css		_static/css
assessment		assessment
chapters		chapters
img		img
src/utils		src/utils
start_scripts		start_scripts
.gitignore		.gitignore
.nojekyll		.nojekyll
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
_toc.yml		_toc.yml
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workshop: Getting Started with Textual Data

Common Links

Workshop Data

Contributing

Large Files

Setup

Jupyter-Book

Git LFS

About

Releases

Packages

Contributors 3

Languages

License

ucdavisdatalab/workshop_getting_started_with_textual_data

Folders and files

Latest commit

History

Repository files navigation

Workshop: Getting Started with Textual Data

Common Links

Workshop Data

Contributing

Large Files

Setup

Jupyter-Book

Git LFS

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages