Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retry and fetch an updated firmware url if the first try 401s #194

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

joshk
Copy link
Contributor

@joshk joshk commented May 24, 2024

this is a fix for #47

if the download returns a 401 it will retry once and then exit

I have also trapped Downloader exits and nil'd UpdateManager state where appropriate

@joshk joshk force-pushed the firmware-update-401-retries branch 2 times, most recently from ca5748b to 7b0414b Compare May 24, 2024 05:21
Copy link
Contributor

@lawik lawik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes on port numbering in tests but no major weirdness :)

test/nerves_hub_link/update_manager_test.exs Outdated Show resolved Hide resolved
test/nerves_hub_link/update_manager_test.exs Outdated Show resolved Hide resolved
test/support/http_unauthorized_error_plug.ex Outdated Show resolved Hide resolved
test/support/http_unauthorized_error_plug.ex Outdated Show resolved Hide resolved
test/nerves_hub_link/update_manager_test.exs Outdated Show resolved Hide resolved
test/nerves_hub_link/update_manager_test.exs Outdated Show resolved Hide resolved
lib/nerves_hub_link/update_manager.ex Outdated Show resolved Hide resolved
lib/nerves_hub_link/update_manager.ex Outdated Show resolved Hide resolved
@joshk joshk force-pushed the firmware-update-401-retries branch 3 times, most recently from 63e983b to fcfd50f Compare May 26, 2024 02:50
@joshk
Copy link
Contributor Author

joshk commented May 26, 2024

@jjcarstens I'm wondering if maybe the better way to approach this issue is to have UpdateManager report back to the Socket on the status of the update (rescheduled, ignored, download error), and then Hub can decide if it sends the update again.

Thoughts?

@lawik
Copy link
Contributor

lawik commented May 26, 2024

For time being out of whack, one solution is to have the device include what they think the time is in join.

Then the server can tell whether the client is confused or has an expired thing. It can even pass down the current time. This could be an optional mechanism for getting the time (trust the server) which is helpful for a Pi for example. But mostly it lets one tell the difference between bad auth and being out of sync.

@jjcarstens
Copy link
Collaborator

A noble thought @lawik, but I don't think that will work. Time is a required component of the connect for a couple reasons:

  • Device certs - the SSL stack on device may not even get to a connect attempt because it will check the server cert first and believe it is expired when it compares its terribly off device time
  • Shared secret - Time is a component of the HMAC to assert validity. To detect the time is off and say "o, that's okay. Here's the time you should use" would be an anti-pattern negating the HMAC a bit

However, the middle ground is a decent idea to just include the date header in the rejection always. Though I am unsure if we could get that early enough in the SSL stack. Though Frank is currently working on a time server solution that could be used for conditions like this too which might be the better play to keep the time concern separate

@lawik
Copy link
Contributor

lawik commented May 26, 2024

Shared secret has this issue. The SSL is not failing, it gets a 401 and if we knew what it thinks the time is we can detect it and give a better "why is NervesHubLink not up"-experience.

My PIs have this and log a ton of 401s on every restart. My guess is that they generate times at UTC until NervesTime syncs. Or something. Time is not 1977-01-01 :) SSL works it seems

@joshk
Copy link
Contributor Author

joshk commented May 27, 2024

@jjcarstens I've reworked the logic of this feature to now send updates from the UpdateManager to the pid that called it. These updates are then sent to Hub, which can then update the registry and decide if there are further actions to take, like sending a new update to the device.

I think this is a better setup, and there are options to improve it further.

@joshk joshk requested review from jjcarstens and lawik June 12, 2024 07:47
%State{} = state
) do
state = maybe_update_firmware(update, fwup_public_keys, state)
state = maybe_update_firmware(update, fwup_public_keys, elem(from, 0), state)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjcarstens is it ok to use elem(from, 0)? or is this bad practice?

@joshk joshk force-pushed the firmware-update-401-retries branch from 0e54572 to 39df313 Compare June 13, 2024 22:07
@joshk joshk requested a review from jjcarstens June 13, 2024 22:08
if is_atom(result) do
%{result: result}
else
%{result: elem(result, 0), reason: elem(result, 1)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not doing anything with the {:error portion of the tuple, why have it? We could just simplify and make it an atom return

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That or change the success condition to be {:ok, result} and we always return the same structure of %{result: :ok | :error, reason: reason}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are six events that can be sent from the UpdateManager to Socket are, the update ...

  • started
  • was ignored
  • was rescheduled
  • completed successfully
  • failed due to a 401 error (download_unauthorized)
  • failed due to a non fatal download error (non_fatal)

I had modelled this as ...

  • :started
  • :ignored
  • {:rescheduled, ms, rescheduled_to(ms)}
  • :completed
  • {:error, :download_unauthorized}
  • and {:error, :non_fatal}

I could change :completed to {:ok, :completed} to match the structure of the error cases, although the :ok is kinda superfluous information. It would also mean that :ignored and :started would need to change.

In Hub each of these cases would update registry in slightly different ways.

Thoughts?

I also realized that I hadn't accounted for {:rescheduled, ms, rescheduled_to(ms)}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants