Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion about NYC Taxi Notebook #1

Open
mhmerrill opened this issue Sep 25, 2020 · 8 comments
Open

Discussion about NYC Taxi Notebook #1

mhmerrill opened this issue Sep 25, 2020 · 8 comments

Comments

@mhmerrill
Copy link
Contributor

mhmerrill commented Sep 25, 2020

This issue encapsulates discussion about the NYC Taxi data set example using Arkouda.
Notebook here

  • ideas of what to compute from the taxi data set
    • infer/recover probabilistic taxi entities (kalman filer?)
    • page rank on location id graph
    • estimate number of taxis in-flight or waiting using different data fields
    • estimate paths of taxis using join-with-delta-time operation
    • other suggestions or crazier things?
  • examples of using arkouda operations
  • define helper function to interoperate with NumPy or Pandas
  • other suggestions

Yellow Trips Data Dictionary

NYC Yellow Taxi Trip Records Jan 2020

NYC Taxi Zone Lookup Table

@mhmerrill
Copy link
Contributor Author

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert
I would love for you guys to chime in with anything, anyone else I should tag?

@bradcray
Copy link

I must not be a data scientist, because my head goes to things like "compute mean, median (requires sorting, right?), mode travel times" which seem trivial compared to some of your suggestions. On the other end of the spectrum, my head goes to "Figure out who owns all the taxi medallions and how much they paid for them", though I suspect that's not a task for this dataset. :D

I haven't really taken the time to look through what's in the data sets yet, though. Will try to do that tomorrow, I'm being called to dinner ATM.

@mhmerrill
Copy link
Contributor Author

Here are a couple of papers and articles about analysis of the NYC Taxi data.

Anonymizing NYC Taxi Data: Does It Matter?

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

@ben-albrecht
Copy link

Do these datasets contain the same fields as the data from the NYC taxi kaggle competition?

We might find some interesting ideas in those notebooks.

@mhmerrill
Copy link
Contributor Author

I think it is the same data. Thanks for the links.

@mhmerrill
Copy link
Contributor Author

mhmerrill commented Sep 25, 2020

20200925: I updated the notebook a bit and uploaded html and pdf of the notebook with output.

@mhmerrill
Copy link
Contributor Author

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert
I would love for you guys to chime in with anything, anyone else I should tag?

@hokiegeek2 i forgot to include you on this.

@bradcray
Copy link

I like the idea of looking at the notebooks BenA pointed to. Left to my own devices, and looking a bit at the fields that are available, I wondered whether there were correlations that could be drawn about tip amount as a percentage of fare based on length of ride or where the ride originated or time of day. Something that would try to draw some conclusion based on different axes like that. But I don't feel like I'm enough of a data scientist to know whether that's trivial or difficult or interesting. (example hypotheses: tips are more generous as a percentage of total fare for shorter rides and ones originating in Manhattan).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants