Discussion about NYC Taxi Notebook #1

mhmerrill · 2020-09-25T00:08:49Z

This issue encapsulates discussion about the NYC Taxi data set example using Arkouda.
Notebook here

ideas of what to compute from the taxi data set
- infer/recover probabilistic taxi entities (kalman filer?)
- page rank on location id graph
- estimate number of taxis in-flight or waiting using different data fields
- estimate paths of taxis using join-with-delta-time operation
- other suggestions or crazier things?
examples of using arkouda operations
define helper function to interoperate with NumPy or Pandas
other suggestions

NYC Yellow Taxi Trip Records Jan 2020

mhmerrill · 2020-09-25T00:13:45Z

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert
I would love for you guys to chime in with anything, anyone else I should tag?

bradcray · 2020-09-25T01:15:24Z

I must not be a data scientist, because my head goes to things like "compute mean, median (requires sorting, right?), mode travel times" which seem trivial compared to some of your suggestions. On the other end of the spectrum, my head goes to "Figure out who owns all the taxi medallions and how much they paid for them", though I suspect that's not a task for this dataset. :D

I haven't really taken the time to look through what's in the data sets yet, though. Will try to do that tomorrow, I'm being called to dinner ATM.

mhmerrill · 2020-09-25T13:54:09Z

Here are a couple of papers and articles about analysis of the NYC Taxi data.

Anonymizing NYC Taxi Data: Does It Matter?

Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

ben-albrecht · 2020-09-25T13:57:51Z

Do these datasets contain the same fields as the data from the NYC taxi kaggle competition?

We might find some interesting ideas in those notebooks.

mhmerrill · 2020-09-25T14:03:26Z

I think it is the same data. Thanks for the links.

mhmerrill · 2020-09-25T16:06:25Z

20200925: I updated the notebook a bit and uploaded html and pdf of the notebook with output.

mhmerrill · 2020-09-25T16:09:30Z

@bradcray @buddha314 @reuster986 @timothyneumann1 @ben-albrecht @jt-halbert
I would love for you guys to chime in with anything, anyone else I should tag?

@hokiegeek2 i forgot to include you on this.

bradcray · 2020-09-25T19:44:59Z

I like the idea of looking at the notebooks BenA pointed to. Left to my own devices, and looking a bit at the fields that are available, I wondered whether there were correlations that could be drawn about tip amount as a percentage of fare based on length of ride or where the ride originated or time of day. Something that would try to draw some conclusion based on different axes like that. But I don't feel like I'm enough of a data scientist to know whether that's trivial or difficult or interesting. (example hypotheses: tips are more generous as a percentage of total fare for shorter rides and ones originating in Manhattan).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion about NYC Taxi Notebook #1

Discussion about NYC Taxi Notebook #1

mhmerrill commented Sep 25, 2020 •

edited

Loading

mhmerrill commented Sep 25, 2020

bradcray commented Sep 25, 2020

mhmerrill commented Sep 25, 2020

ben-albrecht commented Sep 25, 2020

mhmerrill commented Sep 25, 2020

mhmerrill commented Sep 25, 2020 •

edited

Loading

mhmerrill commented Sep 25, 2020

bradcray commented Sep 25, 2020

Discussion about NYC Taxi Notebook #1

Discussion about NYC Taxi Notebook #1

Comments

mhmerrill commented Sep 25, 2020 • edited Loading

mhmerrill commented Sep 25, 2020

bradcray commented Sep 25, 2020

mhmerrill commented Sep 25, 2020

Here are a couple of papers and articles about analysis of the NYC Taxi data.

ben-albrecht commented Sep 25, 2020

mhmerrill commented Sep 25, 2020

mhmerrill commented Sep 25, 2020 • edited Loading

mhmerrill commented Sep 25, 2020

bradcray commented Sep 25, 2020

mhmerrill commented Sep 25, 2020 •

edited

Loading

mhmerrill commented Sep 25, 2020 •

edited

Loading