Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice entry demo #31010

Closed
wants to merge 64 commits into from
Closed

Conversation

jakethesnake420
Copy link
Contributor

@jakethesnake420 jakethesnake420 commented Jan 15, 2024

Basic demo starting point. Doesn't do anything useful yet

Demo running on Comma 3 device:

20240117_183950.mp4

@jnewb1 jnewb1 linked an issue Jan 15, 2024 that may be closed by this pull request
@cone-guy
Copy link
Contributor

Offline wake words and speech to text/text to speech are pretty nice right now due to Home Assistant's year of voice:
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

I found one model that'll run via ONNX, which should be accelerated by the comma 3/x

Would be interested to see how reliable it is over LTE!

@jakethesnake420 jakethesnake420 marked this pull request as ready for review January 16, 2024 22:31
Copy link
Contributor

It looks like you didn't use one of the Pull Request templates. Please check the contributing docs. Also make sure that you didn't modify any of the checkboxes or headings within the template.

@adeebshihadeh
Copy link
Contributor

@jnewb1 can you try this out in a car?

system/micd.py Outdated
self.pm.send('microphone', msg)
self.rk.keep_time()
msg = messaging.new_message('microphoneRaw', valid=True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace

print(f'Timeout reached. {loop_count=}, {time.time()-start_time=}')
break
elif self.stop_thread.is_set():
print(f'stop_thread.is_set()=')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixme

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean this whole file

selfdrive/ui/qt/widgets/assistant.cc Show resolved Hide resolved
@@ -33,15 +35,18 @@ MainWindow::MainWindow(QWidget *parent) : QWidget(parent) {
main_layout->setCurrentWidget(onboardingWindow);
}


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove


class MainWindow : public QWidget {
Q_OBJECT

public:
explicit MainWindow(QWidget *parent = 0);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white space

You can also run rev_speechd.py which will wait for the "WakeWordDetected" param to be set.
To setup the Rev.Ai api you need to install rev_ai:

pip install rev_ai
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my PR for rev_ai revdotcom/revai-python-sdk#111

"download_url": "https://github.com/dscripka/openWakeWord/releases/download/v0.5.1/weather_v0.1.tflite"
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all these paths

REFERENCE_SPL = 2e-5 # newtons/m^2
SAMPLE_RATE = 44100
SAMPLE_BUFFER = 4096 # (approx 100ms)
SAMPLE_RATE = 16000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reduces the frequency response so soundPressure is lower now. I tested that soundd still adjusts the volume for normal road noise and music.

msg.microphoneRaw.frameIndex = self.frame_index
if not (self.frame_index_last == self.frame_index or
self.frame_index - self.frame_index_last == SAMPLE_BUFFER):
cloudlog.info(f'skipped {(self.frame_index - self.frame_index_last)//SAMPLE_BUFFER-1} samples')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

figure out how to stop it from skipping sometimes

@jnewb1
Copy link
Contributor

jnewb1 commented Jan 22, 2024

tried it in a car, looks pretty good! openpilot seems to run fine with it running, I will check some cpu stats soon. couple things

  • noticed the last word always takes a bit longer in the live view, is that just a bug in the display or something with the API?
  • have you experimented with making a custom wake word (like "hey comma" or just "comma"?)
  • final parsing to get the full phrase seems to take 5 seconds, can we reduce that?

https://www.youtube.com/watch?v=TGMpwqr3r08

either way, this certainly fulfills the bounty qualifications!

@jakethesnake420
Copy link
Contributor Author

jakethesnake420 commented Jan 22, 2024

Thank you for testing it. I am glad you liked it!

  • have you experimented with making a custom wake word (like "hey comma" or just "comma"?)

That requires training a new model which I am not familiar with doing but it is well documented here.

  • noticed the last word always takes a bit longer in the live view, is that just a bug in the display or something with the API?
  • final parsing to get the full phrase seems to take 5 seconds, can we reduce that?

Yes, there are ways to decrease latency but there are some tradeoffs.
I am sending 80ms binary chunks instead of 250ms chunks. This should be fairly easy to change by concatenating chunks from the audio_queue. This could have a big effect on final transcript latency. You can also select different transcription models which will have an effect.

you can read the docs here but here are the important parts. https://docs.rev.ai/api/streaming/requests/#raw-file-content-type

max_segment_duration_seconds: This parameter potentially changes the amount of context our engine has when creating final hypotheses and therefore has a minor effect on word error rate. Higher values correlate with fewer errors in transcription.

skip_postprocessing=true: Only available for English and Spanish languages You can choose to skip post-processing operations, such as inverse text normalization (ITN), casing and punctuation, by adding skip_postprocessing=true to your request. Doing so will result in a small decrease in latency; however, your final hypotheses will no longer contain capitalization, punctuation, or inverse text normalization (for example, five hundred will not be normalized to 500).

I originally used Google's api and it was faster! you can see my google API implementation here f48cfbe. There may be optimizations if you select a different model from rev.ai but I am not sure. You could change the api provider fairly easily as long as it supports streaming.

The reason I moved away from the google api is because the authentication process was cumbersome and the python library they provide is not documented well enough for me and I wasn't able to end transcription sessions cleanly. There are many other API providers that you can try but rev.ai was easy to get started with.

@jakethesnake420
Copy link
Contributor Author

@jnewb1 Can you send interac e transfers?

@jnewb1
Copy link
Contributor

jnewb1 commented Jan 26, 2024

@jnewb1 Can you send interac e transfers?

What?

@jakethesnake420
Copy link
Contributor Author

@jnewb1 Can you send interac e transfers?

What?

Like for the bounty payout. Do you guys do PayPal or e-transfer or something else?

@jnewb1
Copy link
Contributor

jnewb1 commented Jan 27, 2024

@jnewb1 Can you send interac e transfers?

What?

Like for the bounty payout. Do you guys do PayPal or e-transfer or something else?

https://github.com/commaai/openpilot/blob/master/docs/BOUNTIES.md#rules

@jakethesnake420
Copy link
Contributor Author

@jnewb1 Can you send interac e transfers?

What?

Like for the bounty payout. Do you guys do PayPal or e-transfer or something else?

https://github.com/commaai/openpilot/blob/master/docs/BOUNTIES.md#rules

Ok if thats the case id like to do a few more and save up for a comma body!

@adeebshihadeh
Copy link
Contributor

Closing since I don't want to merge this in this state, but this was still very valuable and we're good to payout whenever you want.

@jakethesnake420
Copy link
Contributor Author

Thank you sounds good!

@jakethesnake420
Copy link
Contributor Author

tried it in a car, looks pretty good! openpilot seems to run fine with it running, I will check some cpu stats soon. couple things

  • noticed the last word always takes a bit longer in the live view, is that just a bug in the display or something with the API?
  • have you experimented with making a custom wake word (like "hey comma" or just "comma"?)
  • final parsing to get the full phrase seems to take 5 seconds, can we reduce that?

https://www.youtube.com/watch?v=TGMpwqr3r08

either way, this certainly fulfills the bounty qualifications!

I emailed rev AI about the latency. They said they will fix the issue.

Here is the email exchange:

I am developing speech to text for Openpilot, a self driving add-on system. I have chosen to use rev AI because of its ease of authentication and simple API. On issue I am facing is the delay between the interim and final results. For example if the user says "Navigate to home", the interim result will return "navigate to" immediately and it will take about 8 seconds for it to return the full final result "Navigate to home".

Using your streaming demo on your website it's very fast and I would like the same performance.I am using the python API SDK. I am recording 80ms chunks of audio. I tried sending 250ms chunks but it doesn't make a difference.

Reply from support:

The team does have an idea of how they can improve your experience. The issue is it is going to be weeks before they can implement it. At minimum, 2 weeks. Possibly longer if something more urgent comes up. As I understand it, they need to code the ability to disable overlaps and apply it to your account specifically. Given your responses will all likely be short, the overlap is what is holding things up. This is not something we have built yet, but they think it can be done. I am keeping your ticket open so I can circle back to you once I have confirmation this has happened. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[$200 bounty] Voice entry demo
4 participants