This repository has been archived by the owner on Apr 28, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path06_findingdata.Rmd
95 lines (48 loc) · 10.7 KB
/
06_findingdata.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Finding Data
Now that we know what data are, how to work with them in RStudio Cloud, and how to get them *into* RStudio Cloud, if you have a question you want to answer with data, where do you find data to work with? In some cases you'll have to create your own data set but in other cases you can find data that others have already generated and start from there! In this lesson, we'll discuss the difference between public and private data and direct you to a number of resources where you can find helpful data sets for data science projects!
### Public versus Private Data
Before discussing where to find data, we need to know the difference between private and public data. **Private data** are datasets to which a limited number of people or groups have access. There are many reasons why a dataset may remain private. If the dataset has personally-identifiable information within it (addresses, phone numbers, etc.), then the dataset may remain private for privacy reasons. Or, if the dataset has been generated by a company, they may hang onto it so that they have an advantage over their competitors. Often, you will not not have access to private data (although sometimes you can request and gain access to the data or pay for the data to get access). But that's OK because, in general, **public data** are freely-available. Unlike private data generated by companies, data generated by governments are often made public and are available to anyone for use.
### Publicly-available data
As a data scientist, there's a good chance you may work with private company data as part of your job. However, before you have that job, it's great practice to work with datasets that are publicly-available and waiting for you to use them! In this section, we'll direct you to sources of different datasets where you can find a dataset of interest to you and get working with it!
#### Open Datasets
There are a number of companies dedicated to compiling datasets into a central location and making these data easy to access. Two of the most popular are [Kaggle](https://www.kaggle.com/) and [data.world](https://data.world/). On each site, you'll have to register for a free account. After registering you'll have access to *many* different types of datasets! Explore what's available there and then start playing around with a dataset that interests you!
![kaggle and data.world are *great* places to look for datasets](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/export/png?id=1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8&pageid=g3d79cb93b6_0_0)
Publicly-available datasets are also curated at [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets/blob/master/README.rst), so feel free to look around there as well!
#### Government Data
Government data can provide a wealth of information to a data science. Government data sets cover topics from education and student loan debt to climate and weather. They include business and finance datasets as well as law and agriculture data.
Here we provide lists of governments' open data to just give you and *idea* of how many datasets are out there. This will only include a *tiny* portion of what cities and federal governments' data are available for you to use. So, if there's a place whose data you want to work with, look on Google for "open data" from that place!
##### US Data
If you're interested in working with government data from the United States, [data.gov](https://www.data.gov/) is place to get datasets that have been released by the the United States government. Here you can find hundreds of thousands of datasets. These data cover many topics, so if you're interested in working with government data, [data.gov datasets](https://www.data.gov/dataset) is a great place to start!
![data.gov has hundreds of thousands of datasets](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/export/png?id=1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8&pageid=g3d79cb93b6_0_93)
##### Census Data
The [US Census](https://www.census.gov/) is responsible for collecting data about the people within the United States and United States' economy every ten years. These [data](https://www.census.gov/data.html) are also accessible online *and* they can be worked with in R using the very helpful [`tidycensus`](https://walkerke.github.io/tidycensus/) package!
![The US Census provides data about the US people and economy](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/export/png?id=1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8&pageid=g3d79cb93b6_0_146)
##### Open City Data
The US's federal government is of course not the only place to obtain government data. More and more cities across the world are starting to release open data at the city level. A few of these cities and their respective open city data links are provided below:
* [Baltimore, MD (USA)](https://data.baltimorecity.gov/)
* [Cincinnati, OH (USA)](https://data.cincinnati-oh.gov/)
* [Las Vegas, NV (USA)](https://opendata.lasvegasnevada.gov/)
* [New York City, NY (USA)](https://opendata.cityofnewyork.us/)
* [San Francisco, CA (USA)](https://datasf.org/opendata/)
* [Toronto, Ontario (Canada)](https://www.toronto.ca/city-government/data-research-maps/open-data/)
Additionally, to see a summary of what datasets are available from cities across the USA, check out the [US Open City Data Census](http://us-cities.survey.okfn.org/) from the Sunlight Foundation.
![US City Open Data Census](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/export/png?id=1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8&pageid=g3d79cb93b6_0_109)
##### Global Data
In addition to the United States, there are many other countries providing access to open data with more and more providing access and updated datasets each year. These include (but are not limited to!) datasets from many countries within [Africa](http://dataportal.opendataforafrica.org/) and [Latin America](https://opendatabarometer.org/latin-american-open-data-initiative/) as well as [Canada](https://open.canada.ca/en/open-data), [Ireland](https://data.gov.ie/), [Japan](http://www.data.go.jp/?lang=english), [Taiwan](https://data.cdc.gov.tw/en/), and the [UK](https://data.gov.uk/).
Additionally, to see what datasets are available globally, the [Global Open Data Index](https://index.okfn.org/dataset/) is a great place to start!
![Global Open Data Index](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/export/png?id=1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8&pageid=g3d79cb93b6_0_116)
#### APIs
We've mentioned APIs previously, but it's important to include them here as well. APIs provide access to data you're interested in obtaining from websites. There are APIs for *so* many of the websites you access regularly. [Google](http://developers.google.com/apis-explorer/#p/), [Twitter](https://dev.twitter.com/), [Facebook](https://developers.facebook.com/), and [GitHub](https://developer.github.com/v3/?) (among *many* others) all have APIs that you can access to obtain the dataset you're interested in working with!
#### Company Data
Finally, we mentioned above that companies often keep their data private for a number of reasons, and that's ok! When companies do release their data, they will often be found on websites like [Kaggle](https://www.kaggle.com/) and [data.world](https://data.world/). If there is a company whose data you're interested in, you can search for the company's data on either of these two data repositories or on on the company's website directly to see if they provide the data there or if you can scrape their website to obtain the information you need! There may not always be a way to get the exact dataset you're looking for, but you can often find something that will work!
### Data You Already Have
Sometimes, it's not about finding data someone else has already collected on a bunch of individuals in a population. Rather, getting data sometimes just involves taking a look at things you already have but just haven't yet *realized* are data you can analyze.
For example, MP4 files you've bought and have on your computer are data! They can be analyzed using `tuneR` and `seewave`. You could use this type of data to categorize the music in your library or to build a model that takes data on what songs were already big hits to determine what qualities of a song predict that it may be a big hit.
Alternatively, you could scrape the websites you frequently visit (using `rvest!`) to answer interesting questions. For example, if you were interested in writing a *really* great title for the newest video of your pet doing something super cute, you might scrape the web for titles of pet videos that have recently gone viral. You could then craft the perfect title to use when you upload your pet video. Granted, this may not be an example answering the most *important* type of data science question; however, writing up how you did this would make a really great blog post, which is something we'll discuss in a lesson in a few courses!
Finally, social networking websites like Facebook and Twitter, collect a lot of data about you as an individual. You have access to this information through the websites APIs, but can also download data directly. After news of the [Facebook and Cambridge Analytica data breach](https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election), many articles were published about [how to download your Facebook data](https://www.wired.com/story/download-facebook-data-how-to-read/). These data can be downloaded and then analyzed to look at trends in your data over time. How many pictures have you uploaded and been tagged in over time - has that changed? What topics do you most frequently discuss in Messenger? Or, maybe you're interested in mapping the places you've been based on where you've checked in. All of these data can be analyzed from data that are already there, just waiting for you to work with them!
In all, sometimes getting the data just means realizing the data you already have at your disposal, figuring how to get the data into a format you can use, and then working with the data using the tools you have!
### Summary
In this lesson, our goal was to give you an idea of where to *find data* so that you can start working on interesting data science projects. Once you've located an interesting dataset, use the skills learned throughout this course to get the data into R. Then, get wrangling! Before you know it you'll be more than halfway through an interesting data science project. Often finding and wrangling the data take up the most time!
### Slides and Video
[Automated Videos](https://www.youtube.com/watch?v=-gAROtsR5dY)
* [Slides](https://docs.google.com/presentation/d/1G0lA8z561VirAggV4MxMXu2dwCudolXjeFZWO6P_3F8/edit?usp=sharing)