Recent copyright infringement lawsuits against leading music generation companies have sent shockwaves throughout the AI-Music community, highlighting the need for copyright-free training data. Meanwhile, the most prevalent format for symbolic music processing, MIDI, is well-suited for modeling sequences of notes but omits an abundance of extra musical information present in sheet music, which the MusicXML format addresses. To mitigate these gaps, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores. We also introduce MusicRender
, an extension of the Python library MusPy's universal Music
object, designed specifically to handle MusicXML. The dataset, and further specifics, can be downloaded on Zenodo.
To access the functionalities that we introduce, please clone the latest version of this repository. Then, install relevant dependencies to the Conda environment my_env
with conda env update -n my_env --file environment.yml
. Alternatively, you can use pip
to create a virtual environment with pip install -r requirements.txt
.
With Conda:
git clone https://github.com/pnlong/PDMX.git
cd PDMX
conda create -n myenv python=3.10
conda env update -n my_env --file environment.yml
conda activate my_env
With pip
:
git clone https://github.com/pnlong/PDMX.git
cd PDMX
python3 -m my_env .my_env
source .my_env/bin/activate
pip install -r requirements.txt
We present a few important contributions to interact with both the PDMX dataset and MusicXML-like files.
We introduce MusicRender
, an extension of MusPy's universal Music
object, that can hold musical performance directives through its annotations
field.
from pdmx import MusicRender
Let's say music
is a MusicRender
object. We can save music
to a JSON or YAML file at the location path
:
music.save(path = path)
However, we could just as easily use write()
, where path
ends with .json
or .yaml
. The benefit of this method is that we can write music
to various other output formats, where the output filetype is inferred from the filetype of path
(.wav
is audio, .midi
is symbolic).
music.write(path = path)
When writing to audio or symbolic formats, performance directive (e.g. dynamics, tempo markings) are realized to their fullest extent. This functionality should not be confused with the music.realize_expressive_features()
method, which realizes the directives inside a MusicRender
object. This method should not be used explicitly before writing, as it is implicitly called during that process and any directives will be doubly applied.
We store PDMX as JSONified MusicRender
objects (see the write()
or save()
methods above). We can reinstate these objects into Python by reading them with the load()
function, which returns a MusicRender
object given the path to a JSON or YAML file.
from pdmx import load
music = load(path = path)
PDMX was created by scraping the public domain content of MuseScore, a score-sharing online platform on which users can upload their own sheet music arrangements in a MusicXML-like format. MusPy alone lacked the ability to fully parse these files. Our read_musescore()
function can, and returns a MusicRender
object given the path to the MuseScore file.
from pdmx import read_musescore
music = read_musescore(path = path)
If you find this repository helpful, feel free to cite our publication PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing:
@article{long2024pdmx,
title={{PDMX}: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing},
author={Long, Phillip and Novack, Zachary and Berg-Kirkpatrick, Taylor and McAuley, Julian},
journal={arXiv:2409.10831},
year={2024},
}