Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Unknown committed Nov 24, 2024
1 parent 4d89e14 commit 99d76e8
Show file tree
Hide file tree
Showing 83 changed files with 3,320 additions and 3,279 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: db8c3136e21a7390d113749b043e83a3
config: 79bb7ee00347aa10773981b437044cde
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file not shown.
Binary file added _images/benefit_lm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/contextualize_word.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/conversational_retrieval.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed _images/doh_enrich.png
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/eval_confusion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/eval_prediction.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed _images/history.png
Binary file not shown.
Binary file added _images/limit_caption.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/limit_doh.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/limit_jinha.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed _images/main.png
Binary file not shown.
Binary file removed _images/overview.png
Binary file not shown.
Binary file added _images/similarity_query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/solution.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/tips.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 67 additions & 0 deletions _sources/retrieval/challenge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Challenges

While audio-text joint embedding models have enabled significant advances in retrieval systems, several open challenges still remain. In this chapter, we review three major issues:

1. Query-Caption Distribution Mismatch: There is often a significant gap between how users naturally formulate their queries and the captions used to train retrieval models. Training datasets typically contain formal descriptions or technical annotations, while real user queries tend to be more colloquial and diverse in their expression.

2. Single-Turn Retrieval Limitations: Current retrieval systems mostly operate in a single-turn fashion, where each query is processed independently. This fails to capture the natural back-and-forth nature of music discovery, where users often refine their searches based on previous results and may want to explore related but different musical directions.

## Query-Caption Distribution Mismatch

```{figure} ./img/limit_caption.png
---
name: Limitation of Current Caption Dataset
---
```

Current caption datasets unfortunately focus almost exclusively on musical attributes, lacking coverage of broader cultural aspects of music. For example, the [Song Describer dataset](https://github.com/mulab-mir/song-describer-dataset) presented in {cite}`manco2023song` is limited to technical musical attributes such as genre, style, mood, texture, instruments, and structure. While these attributes are important, they miss many other dimensions that users care about - like cultural historical significance, artist background, or similarity to other artists and songs. The dataset also tends to use formal, analytical language rather than the more natural, colloquial ways that people typically describe and search for music. This narrow focus leaves significant gaps in covering the full spectrum of users' musical needs and search behaviors. Without training data that captures these broader aspects of how people relate to and discover music, retrieval systems may struggle to effectively serve real-world use cases where users want to find music based on more than just its acoustic characteristics.


```{figure} ./img/limit_jinha.png
---
name: Query Understanding
---
```

We need to revisit previous research on user query understanding. In {cite}`lee2010analysis`, analysis of user query needs by topic revealed a much broader range of musical requirements than what current systems typically handle. For example, information about lyrics, scores, publishers, and dates are notably absent from current music caption datasets. More recently, {cite}`doh2024music` conducted a qualitative analysis of 917 music-related conversations, finding that users are interested in searching music not just based on musical content information, but also through queries about users, metadata, and other diverse aspects.


```{figure} ./img/limit_doh.png
---
name: User Query Analysis
---
```

These studies highlight the significant gap between how current retrieval systems are trained and actual user needs. While modern caption datasets focus primarily on musical content descriptions, real users seek music through a much richer variety of queries encompassing cultural, contextual, and metadata-based information. This mismatch limits the effectiveness of current retrieval systems in serving natural user queries.


## Single-Turn Retrieval Limitations

Current music retrieval systems are predominantly designed for single-turn interactions, where each query is treated as an independent event. However, this approach fails to capture the inherently iterative and conversational nature of music discovery. When users search for music, they often need multiple attempts to find what they're looking for, refining their queries based on previous results and system responses.

Consider a typical scenario: a user makes an initial query but isn't fully satisfied with the results. In a natural discovery process, they would build upon this first interaction - perhaps specifying additional criteria, requesting variations on the initial results, or steering the search in a slightly different direction. Current single-turn systems, however, treat each new query as a completely fresh start, discarding valuable context from previous interactions.

This limitation creates several key problems:

1. **Loss of Search Context**: Each new query starts from scratch, ignoring the valuable information contained in the user's search history and previous interactions. This forces users to repeatedly provide context that could have been inferred from their search trajectory.

2. **Inability to Learn from Feedback**: The system cannot effectively learn from the user's implicit or explicit feedback about previous results to improve subsequent recommendations. When a user modifies their query, the system doesn't understand which aspects of the previous results were unsatisfactory.

3. **Limited Refinement Capabilities**: Users cannot naturally refine their searches through iterative interactions. Instead of being able to say "like that, but more upbeat" or "similar but with different instruments," they must formulate entirely new queries.

4. **Missed Opportunities for Exploration**: The system cannot guide users through a natural exploration of the music space, suggesting related but different directions based on their evolving preferences and reactions to previous results.

To address these limitations, future retrieval systems need to incorporate multi-turn capabilities that can:
- Maintain conversation history and context across multiple interactions
- Understand when initial results don't fully satisfy the user's intent
- Use previous interactions to inform and improve subsequent recommendations
- Support natural query refinement and exploration patterns

This evolution towards conversational music retrieval would better align with how people naturally discover and explore music, leading to more satisfying and effective search experiences.


## References

```{bibliography}
:filter: docname in docnames
```
192 changes: 30 additions & 162 deletions _sources/retrieval/code.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -55,8 +55,8 @@
"name": "stdout",
"output_type": "stream",
"text": [
"GPU Available: True\n",
"GPU Device Name: Tesla T4\n"
"GPU Available: False\n",
"GPU Device Name: No GPU\n"
]
}
],
Expand All @@ -83,7 +83,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {
"id": "y7Q-QRhevB6N"
},
Expand All @@ -103,7 +103,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
Expand Down Expand Up @@ -140,7 +140,7 @@
"<IPython.lib.display.Audio object>"
]
},
"execution_count": 2,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -169,7 +169,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {
"id": "APdNnLlUuTXZ"
},
Expand Down Expand Up @@ -205,7 +205,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
Expand Down Expand Up @@ -326,158 +326,17 @@
"id": "-YpKMoQlGPWi",
"outputId": "1f28598c-98d5-41c1-c12f-824593586641"
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "430553c9d8c84b41b257cf4585ce4783",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"README.md: 0%| | 0.00/421 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "58b58203600a4b0f9c1eb93921eb7162",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"train-00000-of-00001.parquet: 0%| | 0.00/14.2M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "34718de62fe04b54b5c74789164156ba",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"test-00000-of-00001.parquet: 0%| | 0.00/1.86M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "2a549bf320f141f9b5c43e0a91b49733",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0%| | 0/3000 [00:00<?, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "393428f568a94ed2bb14e20e93b11e68",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating test split: 0%| | 0/300 [00:00<?, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "5695308b3e17476985f661ba4424637d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"README.md: 0%| | 0.00/419 [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "20794feae86e417eaec34e9116fe184d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"train-00000-of-00001.parquet: 0%| | 0.00/11.1M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "4a35f884a04d42589fbecf15134df051",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"test-00000-of-00001.parquet: 0%| | 0.00/1.44M [00:00<?, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "243a60f4e4bb4120a0317b150e39b8e1",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating train split: 0%| | 0/3000 [00:00<?, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "0b888a2099c34eb490b3fdbeb59e3b65",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Generating test split: 0%| | 0/300 [00:00<?, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"outputs": [],
"source": [
"train_dataset = MusicTextDataset(split=\"train\")\n",
"test_dataset = MusicTextDataset(split=\"test\")\n",
"tr_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=128, num_workers=2,shuffle=True, drop_last=True)\n",
"te_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=128, num_workers=2, shuffle=False, drop_last=True)"
"tr_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=128, num_workers=0,shuffle=True, drop_last=True)\n",
"te_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=128, num_workers=0, shuffle=False, drop_last=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 9,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand All @@ -491,7 +350,7 @@
"output_type": "stream",
"text": [
"18754\n",
"A powerful Indian woman's rock vocals take center stage, creating a captivating and non-operatic experience.\n",
"A powerful female Indian vocalist captivates listeners with her mesmerizing rock singing, infusing her foreign roots into an electrifying blend of contemporary sounds, delivering a captivating performance that evades the realm of opera.\n",
"torch.Size([1024])\n",
"torch.Size([768])\n"
]
Expand Down Expand Up @@ -523,7 +382,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 10,
"metadata": {
"id": "5TCeHQKhH6gm"
},
Expand Down Expand Up @@ -572,7 +431,7 @@
" temperature = torch.clamp(self.logit_scale.exp(), max=100)\n",
" logits = torch.einsum('nc,mc->nm', [z1, z2]) * temperature.to(self.device)\n",
" N = logits.shape[0] # batch size per GPU\n",
" labels = torch.arange(N, dtype=torch.long, device=device)\n",
" labels = torch.arange(N, dtype=torch.long, device=self.device)\n",
" return torch.nn.functional.cross_entropy(logits, labels)\n",
"\n",
" def forward(self, batch):\n",
Expand All @@ -586,7 +445,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
Expand Down Expand Up @@ -614,13 +473,13 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 12,
"metadata": {
"id": "_qx8xe5TId-g"
},
"outputs": [],
"source": [
"def train(model, dataloader, optimizer):\n",
"def train(model, dataloader, optimizer, epoch):\n",
" model.train()\n",
" total_loss = 0\n",
" pbar = tqdm(dataloader, desc=f'TRAIN Epoch {epoch:02}')\n",
Expand Down Expand Up @@ -985,7 +844,7 @@
"# Define optimizer and loss function\n",
"optimizer = torch.optim.AdamW(model.parameters(), lr=lr)\n",
"for epoch in range(NUM_EPOCHS):\n",
" train_loss = train(model, tr_dataloader, optimizer)\n",
" train_loss = train(model, tr_dataloader, optimizer, epoch)\n",
" valid_loss = test(model, te_dataloader)\n",
" print(\"[Epoch %d/%d] [Train Loss: %.4f] [Valid Loss: %.4f]\" % (epoch + 1, NUM_EPOCHS, train_loss, valid_loss))"
]
Expand Down Expand Up @@ -1121,7 +980,7 @@
" for item in tqdm(dataloader):\n",
" h_audio = item['h_audio']\n",
" with torch.no_grad():\n",
" z_audio = model.audio_forward(h_audio.to(\"cuda\"))\n",
" z_audio = model.audio_forward(h_audio.to(model.device))\n",
" item_joint_embedding.append(z_audio.detach().cpu())\n",
" track_ids.extend(item['track_id'])\n",
" item_vector_db = torch.cat(item_joint_embedding, dim=0)\n",
Expand Down Expand Up @@ -1336,7 +1195,16 @@
"name": "python3"
},
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.14"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
Expand Down
Loading

0 comments on commit 99d76e8

Please sign in to comment.