Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard Survey Metrics #966

Merged
merged 7 commits into from
May 29, 2024
Merged

Conversation

jiji14
Copy link
Contributor

@jiji14 jiji14 commented May 5, 2024

Related PR:

e-mission/e-mission-phone#1142

Overview

Get survey data for phone app dashboard - leaderboard, comparison, trip

Related UI

Screenshot 2024-04-24 at 6 02 00 PM
Screenshot 2024-04-24 at 6 02 17 PM
Screenshot 2024-04-24 at 6 02 09 PM

@jiji14
Copy link
Contributor Author

jiji14 commented May 5, 2024

@JGreenlee
I designed the data structure based on your e-mission-common codes and the steps for processing survey data.


[ Processing survey data steps ]

  1. Collect all trip data for a certain study (e.g., dfc-fermata).
  2. Group by userId and separate my survey from others'.
  3. My survey:
    1. Group it by trip label (e.g., 'ev_roaming_trip', 'ev_return_trip', 'gas_car_trip').
    2. Count 'answered', 'unanswered', and 'mismatched' for each group.
  4. Others' survey:
    1. Count 'answered', 'unanswered', and 'mismatched' for each user.
    2. Sort others' survey in survey response order.

I don't know how the survey data will be stored in the database, so I haven't figured out how to code this step. Can you please guide me? I think I can code this shortly once I understand how the survey data will be stored.

Thanks!

@JGreenlee
Copy link
Contributor

The current structure of the server response for aggregate metrics is as follows:

{
  aggregate_metrics: [
    [
      {
        fmt_time: "2024-04-29T00:00:00+00:00",
        label_hybrid_drove_alone: 5004,
        label_not_a_trip: 1451,
        label_unlabeled: 8347,
        local_dt: {minute: 0, hour: 0, second: 0, day: 29, weekday: 0, …},
        nUsers: 4,
        ts: 1714348800
      },
      ...
    ],
    ...
  ]
}

So aggregate metrics is a 2d array. the outer elements correspond to duration, mean_speed, count, and distance in that order.
The inner elements represent days of data for that metric. The actual values are separated by mode under keys prefixed with label_

I propose this structure:

{
  duration: {
    [
      {
        date: "2024-04-29",
        mode_hybrid_drove_alone: 5004,
        mode_not_a_trip: 1451,
        mode_unlabeled: 8347,
        num_users: 4
      },
      ...
    ],
    ...
  }
}
  • The metrics are keyed by name ("duration"), rather than a specific position in an array
  • Instead of using timestamps to represent whole days, use YYYY-MM-DD strings
  • Instead of label_*, use mode_*. Then maybe in the future we can implement purpose_*.

@JGreenlee
Copy link
Contributor

Then we would be able to use a similar structure when we add the surveys section:

{
  ...
  surveys: {
    [
      {
        date: "2024-04-29",
        survey_DfcGasTrip: { answered: 3, unanswered: 5, mismatched: 0 },
        survey_DfcEvRoamingTrip: { answered: 4, unanswered: 2, mismatched: 0 },
        num_users: 4 
      },
      ...
    ]
  }
}

@shankari
Copy link
Contributor

I think that this restructuring looks good.

@shankari
Copy link
Contributor

shankari commented May 14, 2024

@JGreenlee Couple of high-level comments here:

  • At a high level, these metrics are similar to the leaderboard - they are precomputing/memoizing the aggregate metrics for improved performance during access time. So they tradeoff storage for latency and may be doing extra work if people are not accessing the dashboard
  • I had planned on precomputing these values, and I even added some plumbing to make that happen. Please see:
    ### ** BEGIN: metric outputs. These are not currently stored
    • I don't recollect why I did not precompute them; we may need to take a look at history and blame to ensure that we take those "gotchas" into account.
    • One thought that comes to mind immediately is the aggregate chunking and recomputation cadence.
    • If we precompute and store in week-long "chunks", we can pull one set of aggregate data in the dashboard, but then it is not clear how we can support different historical time ranges. Maybe we should precompute in day chunks instead and further combine on the phone?
    • people can go back and label trips at any time; so unlike the classic pipeline, we may have to recompute any object at any time. We should figure out how to handle this (I can think of a couple of options).
    • We should handle the migration gracefully - we will not have these entries for existing programs, so we will need to write a script to fill them in. And we can't migrate the phone until the script has run for all existing programs.

@JGreenlee
Copy link
Contributor

JGreenlee commented May 14, 2024

  • If we precompute and store in week-long "chunks", we can pull one set of aggregate data in the dashboard, but then it is not clear how we can support different historical time ranges. Maybe we should precompute in day chunks instead and further combine on the phone?

Yes, I think it would have to be day-by-day as long as we still allow people to choose specific days with the datepicker, and keep the graphs which show distance, duration, and count by day.
Currently the UI only supports day-by-day requests anyway. If we want to support chunking by month or week, we could still store in day-by-day chunks, and then combine on the server before sending a response (to save bandwidth).

  • So they tradeoff storage for latency and may be doing extra work if people are not accessing the dashboard
  • We should handle the migration gracefully - we will not have these entries for existing programs, so we will need to write a script to fill them in. And we can't migrate the phone until the script has run for all existing programs.

I think that a memoization/caching pattern would work well here (as opposed to precomputing everything). Every request is a range of days. For each day, we'll check if that day is in the cache. If not we'll compute it on the fly and cache it. The cache expires after some time (maybe 24 hours).

The advantages of this:

  • If a day is not requested it will not be computed
  • We will not be re-computing metrics for the same day within a 24-hour period
  • Migration will be easy because the computation happens gradually rather than all at once

Disadvantages:

  • Doesn't lend itself well to the public dashboard

@JGreenlee
Copy link
Contributor

  • User A requests agg metrics for 4/24 – 5/7.
  • There is no existing cache
  • Compute 4/24 – 5/7 on the fly and send back to User A
  • User B requests agg metrics for 4/17 – 4/30
  • We have cache for 4/24 – 4/30 but not 4/17 – 4/23.
  • Compute 4/17 – 4/23 on the fly, combine with cached 4/24 – 4/30, send back to User B

The next day,

  • User C requests agg metrics for 4/25 – 5/8
  • The cache from yesterday has expired, so all days are recomputed, taking into account any new labels

@shankari
Copy link
Contributor

At the meeting last week, we decided to continue computing on the fly initially and defer the question about how to cache until later as a performance optimization. To support backwards compatibility, one option is to use the new structure for the fleet version only.

@JGreenlee had planned to use the local date as part of the change to the structure, so the backwards compat could be "new structure for local date, old structure for timestamp".

@JGreenlee I remembered one issue with using the queries by local date - querying by a single day at a time would be fine, but querying by a range of dates, can run into issues. For example, if we query for "2024-04-28" to "2024-05-03", then the local date query won't return anything because 28 > trip.start_local_dt > 3 will not return any trips.

@JGreenlee
Copy link
Contributor

We met today to divide the tasks. I will be making the server changes and Jiji will be tweaking the UI to accommodate the new structure. If she finishes that before I am done with server changes, she will work on some admin dashboard tasks (e-mission/e-mission-docs#1066)

@JGreenlee I remembered one issue with using the queries by local date - querying by a single day at a time would be fine, but querying by a range of dates, can run into issues. For example, if we query for "2024-04-28" to "2024-05-03", then the local date query won't return anything because 28 > trip.start_local_dt > 3 will not return any trips.

I will have to investigate this. It may explain some of the weirdness I found. I tested the existing metrics endpoint with local_date by doing this:

window.cordova.plugins.BEMServerComm.pushGetJSON('/result/metrics/local_date', (m) => Object.assign(m, {
  aggregate: true,
  end_time: {year: 2024, month: 4, day: 24},
  freq: 'DAILY',
  is_return_aggregate: true,
  metric_list: ["duration", "count", "distance"],
  start_time: {year: 2024, month: 5, day: 24},
}), (...p) => console.log(...p), (...p) => console.error(...p));

And got back an interesting result. I expected every day from Apr 24 to May 24. But it seems like it gave data for Jan 24, Feb 24, Mar 24, and Apr 24

{
  "aggregate_metrics": [
    [
      {
        "ts": 1706079600,
        "label_unlabeled": 25550.522970199585,
        "fmt_time": "2024-01-24",
        "nUsers": 7,
        "label_walk": 4241.722962617874,
        "local_dt": {
          "timezone": "America/Denver",
          "year": 2024,
          "month": 1,
          "day": 24
        }
      },
      {
        "nUsers": 6,
        "label_hybrid_drove_alone": 10569.768949985504,
        "label_walk": 2147.520366668701,
        "ts": 1708758000,
        "label_unlabeled": 26029.928629398346,
        "fmt_time": "2024-02-24",
        "label_hybrid_shared_ride": 2881.8205523490906,
        "local_dt": {
          "timezone": "America/Denver",
          "year": 2024,
          "month": 2,
          "day": 24
        }
      },
      {
        "nUsers": 4,
        "label_walk": 1730.4454126358032,
        "label_bike": 4934.96901011467,
        "ts": 1711252800,
        "label_unlabeled": 10737.807738542557,
        "fmt_time": "2024-03-24",
        "label_e_car_shared_ride": 3664.9217643737793,
        "local_dt": {
          "timezone": "America/New_York",
          "year": 2024,
          "month": 3,
          "day": 24
        }
      },
      {
        "nUsers": 5,
        "label_e-bike": 4831.289476633072,
        "label_hybrid_drove_alone": 10362.993374109268,
        "label_walk": 1366.0865366458893,
        "ts": 1713938400,
        "label_not_a_trip": 3969.932086467743,
        "label_unlabeled": 5386.386969804764,
        "fmt_time": "2024-04-24",
        "label_e_car_shared_ride": 2676.1139929294586,
        "local_dt": {
          "timezone": "America/Denver",
          "year": 2024,
          "month": 4,
          "day": 24
        }
      }
    ],
    ...
}

@shankari
Copy link
Contributor

shankari commented May 15, 2024

That is why we don't use the local date more often. It works as a filter (if you want to see the 24th of each month), or only trips on Tuesdays, but is not great for a range.

see: https://github.com/e-mission/e-mission-server/blob/master/emission/storage/timeseries/tcquery.py
or

def get_range_query(field_name, start_local_dt, end_local_dt):

@JGreenlee
Copy link
Contributor

I got it to work with ranges with these changes: #968

@shankari
Copy link
Contributor

Changes merged, with some comments on requesting additional comments and restoring filter functionality

@JGreenlee
Copy link
Contributor

I implemented the above structure with a few modifications to make it more flexible.

I realized that we basically have 4 metrics we want right now: distance, duration, count, and response_count.
And we want to be able to calculate these metrics based on either 1) the 'mode' label, 2) the BLE sensed mode, or 3) the survey that was prompted. (Maybe in the future, by 'purpose' label, 'replaced_mode' label, or inferred mode)

So I think that when we query for a list of metrics, we should pass the metric name and the 'grouping fields' instead of just a single-dimensional list.
Instead of just asking for "distance", we'll ask for "distance grouped by mode_confirm" or "distance grouped by primary_ble_sensed_mode". Or even "distance grouped by mode_confirm, purpose_confirm, and primary_ble_sensed_mode".

Concretely, this:

{
    "distance": [ 'mode_confirm', 'purpose_confirm', 'primary_ble_sensed_mode' ],
    "duration": [ 'mode_confirm', 'purpose_confirm', 'primary_ble_sensed_mode' ],
    "count": [ 'mode_confirm', 'purpose_confirm', 'primary_ble_sensed_mode' ],
    "response_count": [ 'survey' ]
}

which will get a response like this:

{
    "distance": [
        {
            "date": "2024-04-20",
            "mode_confirm_UNLABELED": 157413.07847061226,
            "purpose_confirm_UNLABELED": 157413.07847061226,
            "primary_ble_sensed_mode_UNKNOWN": 157413.07847061226
        },
        {
            "date": "2024-04-21",
            "mode_confirm_UNLABELED": 138402.30063849426,
            "purpose_confirm_UNLABELED": 138402.30063849426,
            "primary_ble_sensed_mode_UNKNOWN": 111513.57056705082,
            "primary_ble_sensed_mode_CAR": 26888.730071443442
        },
        {
            "date": "2024-04-22",
            "mode_confirm_UNLABELED": 62882.08469231638,
            "purpose_confirm_UNLABELED": 62882.08469231638,
            "primary_ble_sensed_mode_UNKNOWN": 62882.08469231638
        }
    ],
    "duration": [
        {
            "date": "2024-04-20",
            "mode_confirm_UNLABELED": 85459.56064605713,
            "purpose_confirm_UNLABELED": 85459.56064605713,
            "primary_ble_sensed_mode_UNKNOWN": 85459.56064605713
        },
        {
            "date": "2024-04-21",
            "mode_confirm_UNLABELED": 208896.70914292336,
            "purpose_confirm_UNLABELED": 208896.70914292336,
            "primary_ble_sensed_mode_UNKNOWN": 207448.5625026226,
            "primary_ble_sensed_mode_CAR": 1448.1466403007507
        },
        {
            "date": "2024-04-22",
            "mode_confirm_UNLABELED": 8125.852761030197,
            "purpose_confirm_UNLABELED": 8125.852761030197,
            "primary_ble_sensed_mode_UNKNOWN": 8125.852761030197
        }
    ],
    "count": [
        {
            "date": "2024-04-20",
            "mode_confirm_UNLABELED": 9,
            "purpose_confirm_UNLABELED": 9,
            "primary_ble_sensed_mode_UNKNOWN": 9
        },
        {
            "date": "2024-04-21",
            "mode_confirm_UNLABELED": 7,
            "purpose_confirm_UNLABELED": 7,
            "primary_ble_sensed_mode_UNKNOWN": 6,
            "primary_ble_sensed_mode_CAR": 1
        },
        {
            "date": "2024-04-22",
            "mode_confirm_UNLABELED": 10,
            "purpose_confirm_UNLABELED": 10,
            "primary_ble_sensed_mode_UNKNOWN": 10
        }
    ],
    "response_count": [
        {
            "date": "2024-04-20"
        },
        {
            "date": "2024-04-21",
            "survey_DfcGasTrip": {
                "not_responded": 1
            }
        },
        {
            "date": "2024-04-22"
        }
    ]
}

This would be very flexible, allowing us to easily pick and choose what dimensions of the data to dive into.

JGreenlee added a commit to JGreenlee/e-mission-common that referenced this pull request May 20, 2024
The list of metrics input to generate_summaries is now a dict of {metric name : [ list of fields to group by] }.
e-mission/e-mission-server#966 (comment)

Add 'response_count' metric which counts the occurences of 'responded' and 'not_responded', as opposed to the existing metrics which just sum up number values.

app_config and trip_labels as globals to avoid argument/parameter drilling

Add comments and rename variables throughout improve comprehension
JGreenlee added a commit to jiji14/e-mission-phone that referenced this pull request May 20, 2024
Instead of querying by timestamp we can now query by "YYYY-MM-DD" strings.
The yyyy_mm_dd endpoint also has a different format for metric_list: instead of just an array of metrics, it is a mapping of "metric names" to "grouping fields" (e-mission/e-mission-server#966 (comment))
We will specify this in the config as "metrics_list".

The yyyy_mm_dd endpoint will also ask for app_config, so we will pass that through to `fetchMetricsFromServer` and include it in the `query` (note that `survey_info` is the only field that needs to be included)

We will be supporting another metric called "response_count"; added a card to the 'summary' section (if it is configured to show)

Updated types in appConfigTypes.ts
@shankari
Copy link
Contributor

So I think that when we query for a list of metrics, we should pass the metric name and the 'grouping fields' instead of just a single-dimensional list.

If you recall our discussion about the public dashboard, the structure is very similar. Basically, the generic outline is:

filtered_df = df.query(....)
to_plot = filtered_df.groupby(...).agg(...)
# group_by the field (e.g. mode_confirm, purpose_confirm, primary_mode...)
# agg the grouped values (e.g. 'sum', 'count')

e.g.

The filtering is not explicit in your example above, but it is implicit in the date range, which is the only kind of filtering that we currently support for this query.
e-mission/em-public-dashboard#86 (comment)

This is good because it gives us some validation that this structure is consistent with what makes sense for our data model.
It also gives us some thoughts on potentially unifying this query structure, potentially as part of a future restructuring.

@JGreenlee
Copy link
Contributor

@jiji14 Would you mind giving me write access to your fork of e-mission-server so I can update this PR with my changes? If not I will create my own

@JGreenlee
Copy link
Contributor

In the meantime, these are the changes:

support 'yyyy_mm_dd' time type in 'result/metrics' endpoint

The updated in-app dashboard will be using this new time type via /result/metrics/yyyy_mm_dd.
When using this time type, e-mission-common is used for the metrics calculations, which has a new structure (#966 (comment))
start_time and end_time are given as ISO dates (YYYY-MM-DD). app_config also needs to be supplied.

diff --git a/emission/net/api/cfc_webapp.py b/emission/net/api/cfc_webapp.py
index e585d6a2..a5e0d1dd 100644
--- a/emission/net/api/cfc_webapp.py
+++ b/emission/net/api/cfc_webapp.py
@@ -328,14 +328,18 @@ def summarize_metrics(time_type):
     else:
         old_style = True
         is_return_aggregate = True
+
+    app_config = request.json['app_config'] if 'app_config' in request.json else None
+
     time_type_map = {
-        'timestamp': metrics.summarize_by_timestamp,
-        'local_date': metrics.summarize_by_local_date
+        'timestamp': metrics.summarize_by_timestamp, # used by old UI
+        'local_date': metrics.summarize_by_local_date,
+        'yyyy_mm_dd': metrics.summarize_by_yyyy_mm_dd # used by new UI
     }
     metric_fn = time_type_map[time_type]
     ret_val = metric_fn(user_uuid,
               start_time, end_time,
-              freq_name, metric_list, is_return_aggregate)
+              freq_name, metric_list, is_return_aggregate, app_config)
     if old_style:
         logging.debug("old_style metrics found, returning array of entries instead of array of arrays")
         assert(len(metric_list) == 1)
diff --git a/emission/net/api/metrics.py b/emission/net/api/metrics.py
index 9bd5ffda..5385f1ad 100644
--- a/emission/net/api/metrics.py
+++ b/emission/net/api/metrics.py
@@ -10,16 +10,31 @@ import logging
 
 import emission.analysis.result.metrics.time_grouping as earmt
 import emission.analysis.result.metrics.simple_metrics as earms
+import emission.storage.decorations.analysis_timeseries_queries as esda
+import emission.storage.decorations.local_date_queries as esdl
+import emission.storage.timeseries.tcquery as esttc
 
-def summarize_by_timestamp(user_id, start_ts, end_ts, freq, metric_list, include_aggregate):
+import emcommon.metrics.metrics_summaries as emcms
+
+def summarize_by_timestamp(user_id, start_ts, end_ts, freq, metric_list, include_aggregate, app_config):
     return _call_group_fn(earmt.group_by_timestamp, user_id, start_ts, end_ts,
                           freq, metric_list, include_aggregate)
 
-def summarize_by_local_date(user_id, start_ld, end_ld, freq_name, metric_list, include_aggregate):
+def summarize_by_local_date(user_id, start_ld, end_ld, freq_name, metric_list, include_aggregate, app_config):
     local_freq = earmt.LocalFreq[freq_name]
     return _call_group_fn(earmt.group_by_local_date, user_id, start_ld, end_ld,
                           local_freq, metric_list, include_aggregate)
 
+def summarize_by_yyyy_mm_dd(user_id, start_ymd, end_ymd, freq, metric_list, include_agg, app_config):
+    time_query = esttc.TimeComponentQuery(
+        "data.start_local_dt",
+        esdl.yyyy_mm_dd_to_local_date(start_ymd),
+        esdl.yyyy_mm_dd_to_local_date(end_ymd)
+    )
+    trips = esda.get_entries(esda.COMPOSITE_TRIP_KEY, None, time_query)
+    print('found ' + str([e for e in trips]))
+    return emcms.generate_summaries(metric_list, trips, app_config)
+
 def _call_group_fn(group_fn, user_id, start_time, end_time, freq, metric_list, include_aggregate):
     summary_fn_list = [earms.get_summary_fn(metric_name)
                             for metric_name in metric_list]
diff --git a/emission/storage/decorations/local_date_queries.py b/emission/storage/decorations/local_date_queries.py
index 8a0e2d14..d6aca9fb 100644
--- a/emission/storage/decorations/local_date_queries.py
+++ b/emission/storage/decorations/local_date_queries.py
@@ -48,3 +48,15 @@ def get_comparison_query(field_prefix, base_ld, limit_ld, units, gt_or_lt):
         return { "$or": or_conditions }
     else:
         return {}
+
+def yyyy_mm_dd_to_local_date(ymd: str) -> ecwl.LocalDate:
+    return ecwl.LocalDate({
+        'year': int(ymd[0:4]),
+        'month': int(ymd[5:7]),
+        'day': int(ymd[8:10])
+    })
+
+def get_yyyy_mm_dd_range_query(field_name, start_ymd: str, end_ymd: str) -> dict:
+    start_local_date = ymd_to_local_date(start_ymd)
+    end_local_date = ymd_to_local_date(end_ymd)
+    return get_range_query(field_name, start_local_date, end_local_date)
diff --git a/setup/environment36.yml b/setup/environment36.yml
index d0223175..49dc9c1f 100644
--- a/setup/environment36.yml
+++ b/setup/environment36.yml
@@ -19,7 +19,7 @@ dependencies:
 - scipy=1.10.0
 - utm=0.7.0
 - pip:
-  - git+https://github.com/JGreenlee/e-mission-common@0.4.3
+  - git+https://github.com/JGreenlee/e-mission-common@0.5.0
   - pyfcm==1.5.4
   - pygeocoder==1.2.5
   - pymongo==4.3.3

@jiji14
Copy link
Contributor Author

jiji14 commented May 21, 2024

@jiji14 Would you mind giving me write access to your fork of e-mission-server so I can update this PR with my changes? If not I will create my own

Added you as a collaborator! Let me know if you can't still push the code

JGreenlee added 3 commits May 21, 2024 16:49
The updated in-app dashboard will be using this new time type via `/result/metrics/yyyy_mm_dd`.
When using this time type, `e-mission-common` is used for the metrics calculations, which has a new structure (e-mission#966 (comment))
`start_time` and `end_time` are given as ISO dates
Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that, I don't see anything much in here. Briefly reviewing the common code as well since that is where the secret sauce is.

Comment on lines +331 to +333

app_config = request.json['app_config'] if 'app_config' in request.json else None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, so you are assuming that the client passes in the app config? That sounds fairly bonkers, particularly since the app config is fairly large. We read the app_config directly from the study_config in multiple locations, including in cfc_webapp - why would we not just read in the app_config where it is used (presumably in api/metrics.py) instead of passing in around?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, so you are assuming that the client passes in the app config? That sounds fairly bonkers, particularly since the app config is fairly large.

I agree it's not the best solution. to mitigate the size issue I have the phone sending a partial app_config with only survey_info, since that is what is really needed for the metrics

why would we not just read in the app_config where it is used (presumably in api/metrics.py) instead of passing in around?

If we did this, then every time we query the server, the server has to query github and wait for the config before proceeding, right? I thought this would probably introduce even more latency.

I also didn't think there was a good way to keep the config in memory because I believe cfc_webapp is continuously running and wouldn't receive config updates.

Is there a better way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. The standard way to mitigate the performance issue is through caching. You would read the config once a day, and use the cached version otherwise. That would not necessarily introduce a perceptible performance impact.

However, that may still leave us with a consistency issue. If the phone config has updated, but the server has not yet updated, I suspect we may have issues because the survey names may be inconsistent across them. The standard way to resolve such consistency issues is to send over the version that the phone is using. The server would cache the last several versions and use the version that the phone sends in.

While thinking through this, I just realized that this might be even more complex. For example, if you had surveyNameA, surveyNameB and surveyNameC in May, and switched to surveyNameD, surveyNameE and surveyNameF in October, you should still apply the older version of the config (e.g. use surveyNameA, surveyNameB and surveyNameC when generating metrics for June in November. That will not work with the current approach of passing the config from the phone either.

I will merge this for now to unblock ourselves, and we should say that, after a study has been deployed to real users, we will not support changing the surveys. That is a reasonable restriction to impose, given that changing the survey while the data collection is in progress is not a recommended best-practice from a social science perspective - the results are not necessarily comparable with each other.

And then we should figure out the correct performance optimization vs. complexity tradeoff that we want to be at

@jiji14
Copy link
Contributor Author

jiji14 commented May 23, 2024

@JGreenlee I just checked the server code and I can see the flow. I'd love to discuss the details during the meeting. Thanks for all your work!

Copy link
Contributor

@shankari shankari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I indicated earlier, the main changes here are high level comments on the code in e-mission-common:

  • why are we using composite trips? The user inputs are also available in the confirmed trips, and using confirmed trips prevents us from having to read all the trajectory data in the composite trips, and reduce network demands on the database.
  • what is labels_map used for and why do we use it as fallback if there is no user_input? It is passed in as input to generate_summaries, but the invocation of generate_summaries does not pass it in.
  • Given that we are changing this structure anyway, I wonder if it would be helpful to have this structured as
{ "date": "2024-05-20",
   "mode_confirm": {"bike": 1000, "walk": 500}
    "purpose_confirm":{"home": 1500 }

They will then be similar to the summary objects. We could also add the metric and then they would have an even more similar type.

  • why do we parse out the fmt_time instead of using the local date objects directly?
        date = trip['start_fmt_time'].split('T')[0]
        if date not in days_of_metrics_data:
            days_of_metrics_data[date] = []
        days_of_metrics_data[date].append(trip)

Is it just for the return values?

Most of these are "future fix". The only actual potential bug is the comment below.

emission/storage/decorations/local_date_queries.py Outdated Show resolved Hide resolved
JGreenlee and others added 2 commits May 29, 2024 00:27
Co-authored-by: K. Shankari <shankari@eecs.berkeley.edu>
@shankari shankari merged commit e7d028a into e-mission:master May 29, 2024
5 checks passed
@JGreenlee
Copy link
Contributor

  • why are we using composite trips? The user inputs are also available in the confirmed trips, and using confirmed trips prevents us from having to read all the trajectory data in the composite trips, and reduce network demands on the database.

I was using composite trips because confirmed trips don't have any vehicle-specific info (until e-mission/e-mission-docs#1073 is done). However, I didn't implement vehicle-specific metrics yet anyway, so there's really no reason it needs to be composite trips.
When I do the rest of the metrics work, I can make it compatible with confirmed trips (or composite trips since they inherit from confirmedtrip)


  • what is labels_map used for and why do we use it as fallback if there is no user_input? It is passed in as input to generate_summaries, but the invocation of generate_summaries does not pass it in.

That's for unprocessed labels, which are passed in on the phone but not applicable on the server


  • Given that we are changing this structure anyway, I wonder if it would be helpful to have this structured as
{ "date": "2024-05-20",
   "mode_confirm": {"bike": 1000, "walk": 500}
    "purpose_confirm":{"home": 1500 }

Yes, I think that would be much better.

We could also add the metric and then they would have an even more similar type.

What would this look like in context?

Currently each day is a child of the metric:

{
  "distance": [
    {
      "date": "2024-05-20",
      "mode_confirm": {
        "bike": 1000,
        "walk": 500
      },
      "purpose_confirm": {
        "home": 1500
      }
    },
    ...
  ],
  "duration": [ ... ],
  "count": [ ... ],
}

Are you suggesting this, where the metric is a child of the day? To me, this makes more sense.

[
  {
    "date": "2024-05-20",
    "distance": {
      "mode_confirm": {
        "bike": 1000,
        "walk": 500
      },
      "purpose_confirm": {
        "home": 1500
      }
    },
    "duration": ...,
    "count": ...
  },
  ...
]

  • why do we parse out the fmt_time instead of using the local date objects directly?
    Is it just for the return values?

It's for the date field which is currently an ISO date string (YYYY-MM-DD) and to get that, trip['start_fmt_time'].split('T')[0] is cleaner than
f"{trip['start_local_dt']['year']}-{trip['start_local_dt']['month']}-{trip['start_local_dt']['day']}"

It could be changed to use the full local date objects instead of YYYY-MM-DD strings. It's just more verbose.

Combined with the above changes, it would look like this:

[
  {
    "local_date": {
      "year": 2024,
      "month": 5,
      "day": 20
    },
    "distance": {
      "mode_confirm": {
        "bike": 1000,
        "walk": 500
      },
      "purpose_confirm": {
        "home": 1500
      }
    },
    "duration": ...,
    "count": ...
  },
  ...
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants