Implement and train an audio-only or audio-visual Source Separation system on the dataset provided by the teachers (see our official channel). You cannot use external datasets without teachers' permission, except for noise/rir augmentations.
You cannot use implementations available on the internet.
Each student is assigned with a team of 2-3 people that should work together on the project. To find your group, see our official channel. The submission consists of a code repository that reproduces the solution and a report that explains team's system.
As always, your code must be based on the provided Project Template. Feel free to choose any of the code branches as a starting point (maybe you will find the main
branch easier than the ASR
one).
The code must follow the following rules:
- The code should be stored in a public github (or gitlab) repository (one per team) and based on the provided template. (Before the deadline, use a private repo. Make it public after the deadline.)
- All the necessary packages should be mentioned in
./requirements.txt
,environment.yaml
, dockerfile or in an installation guide section of theREADME.md
. - You must use
W&B
/Comet ML
for logging losses, objects (like audio), and performance metrics. - All necessary resources (such as model checkpoints) should be downloadable with a script. Mention the script (or lines of code) in the
README.md
. - Your solution should be reproducible with a script. Create a script-file or mention the required lines in the
README.md
.
Important
This is a group project. So the whole team must work in a single repository. Create branches, pull-requests, merge branches, split the responsibilities, etc. We will look at the commits history and investigate how you managed team code development.
Note
Due to restrictions of the free Comet ML
subscription, we allow each student to have separate logs. If you use logs in your report, download them from WandB/Comet and create plots manually. Self-crafted figures will also improve the quality and design of your report.
During you research, you will (and have to) try different models, different training schemes, etc. This is when Hydra
-based configuration and purpose-based separation of configs should help a lot. Do not create litter commits just to update the config each time you want to try a new configuration, use command-line Hydra
options instead (See Seminar with Q&A and project template discussion).
You should add inference.py
script and a CustomDirDataset
Dataset class in src/datasets/
with a proper config in src/configs/
.
The CustomDirDataset
should be able to parse any directory with mixed speech of the following format:
NameOfTheDirectoryWithUtterances
├── audio
│ ├── mix
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ ├── s1 # ground truth for the speaker s1, may not be given
│ │ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ │ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ │ .
│ │ .
│ │ .
│ │ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
│ └── s2 # ground truth for the speaker s2, may not be given
│ ├── FirstSpeakerID1_SecondSpeakerID1.wav # also may be flac or mp3
│ ├── FirstSpeakerID2_SecondSpeakerID2.wav
│ .
│ .
│ .
│ └── FirstSpeakerIDn_SecondSpeakerIDn.wav
└── mouths # contains video information for all speakers
├── FirstOrSecondSpeakerID1.npz # npz mouth-crop
├── FirstOrSecondSpeakerID2.npz
.
.
.
└── FirstOrSecondSpeakerIDn.npz
It should have an argument for the path to this custom directory that can be changed via Hydra
-options.
The inference.py
script must apply the model on the given dataset (custom-one or any other supported in your src
) and save separated utterances in the requested directory. The separated utterance output name should be the same as the mix name (so they can be matched together, see example for how ground-truth utterances are located).
Provide a separate script that calculates all required metrics given the path to ground truth and predicted s1
and s2
utterances.
Mention the lines on how to run inference on your final model in the README
. Include the lines for the script too.
Each team must provide an article-style report with all the required sections: Abstract, Introduction\Related Work, Methodology\Experimental Setup, Results\Discussion, Conclusion, and References. The page limit is 4 pages (4.5 if your report is in Russian) and non-limited for the bibliography.
Look at these guidelines to understand what should be written in each section. Use the files in guidelines repository as a LaTeX
template.
Important
At the end of the report, include a Contributions Section (not counted in the page limit), explaining the contributions of each team member.
The provided dataset is audio-visual: each mix utterance s1id_s2id.wav
has corresponding npz
files s1id.npz
and s2id.npz
for a mouth-region of one of the speakers in the mix. Therefore, your model performance will increase significantly if you utilize video information.
You are allowed to use any pre-trained video feature extractors from this Lip-reading repository. The given npz
dataset files are compatible with these networks.
As the project starts before we discuss audio-visual models in the lectures, we advise you to start with audio-only models. This will also be helpful for the report to prove your design choices.
To start your research, we suggest reading some surveys (in general, this is a good way to quickly start working in a field that is new for you):
List of surveys
Papers discussed in the speech separation lecture:
Papers discussed in the audio-visual lecture:
Important
You can try to implement any of the suggested models, search for the literature with other systems, modify them, build something on top of them, or design your own new architectures. The goal of this project is to get practice in doing research, so explore the literature, conduct experiments, justify your choices. Note that some of the models are extremely computationally expensive, so some adjustments to the architectures will be required. You can also use gradient accumulation or mixed-precision training techniques (see more info here).
For the project, you will get two grades: Code and Report.
Code grade
Model performance grade is based on the competition score with other teams and baselines.
SI-SNRi | Grade for Passing Baseline |
---|---|
-
If you achieved
$< 5$ SI-SNRi, your performance grade is$0$ . -
If you achieved SI-SNRi between
$5$ and$9$ your grade is scaled between baseline grades according to your place in the competition across teams with the same SI-SNRi range. -
If you achieved SI-SNRi
$\ge 9$ your grade is scaled between baseline grade and$10$ according to your place in the competition across teams with the same SI-SNRi range.
Also indicate other relevant metrics, like SDRi, PESQ, and STOI.
Report grade
Bonus. Apart from model performance competition, we will have a model speed (grade
Important
In this project, you can choose any architecture you like. This can be one of the models from lectures or any model you found in the literature. We expect you to try different setups. Remember that using open-source code with these architectures is forbidden. Do not forget about proving your design choices by conducting ablation studies and citing literature.