MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction

In the area of natural language processing, various financial datasets have informed recent research and analysis including financial news, financial reports, social media, and audio data from earnings calls. We introduce a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&P 1500 companies. We describe the main features of MAEC, how it was collected and assembled, paying particular attention to the text-audio alignment process used. We present the approach used in this work as providing a suitable framework for processing similar forms of data in the future. The resulting dataset is more than six times larger than those currently available to the research community and we discuss its potential in terms of current and future research challenges and opportunities.

MAEC Dataset

Transcripts along with the low-level audio features (Total 147.7 MB)

Each folder named in the format of YearMonthDay_CompanyCode. There are two files in each folder:

Transcripts text file, named as text.txt.
Low-level audio features: features.csv.

High-level features (Total 59 GB)

We produced and released high-level (mfcc feature) audio features files, named as CompanyCode_YearMonthDay-OrderNumber.npy. Click here to download.

Iterative Forced Alignment Core Code

In the Iterative Forced Alignment folder, we released our code for text-audio segmentation.

Prerequisites

Python version and packages required to install for execute the code.

Python 3.5
Pydub
Aeneas
FFMPEG

How to execute the code

Please run a caller program to pass parameters into the execution of code. There are 8 parameters to be set up in total.

Example use case:

python3.5 alignmentCore.py FolderPath(CompanyCode_YearMonthDay) TextPath(WorkDirectory/CompanyCode_YearMonthDay) AudioPath(WorkDirectory/CompanyCode_YearMonthDay/CompanyCode_YearMonthDay) AudioFormat(Eg."mp3") WorkDirectory LogFileName(Eg."log1.txt")

Citation

@inproceedings{CIKM2020MAEC,
author = {Li, Jiazheng and Yang, Linyi and Smyth, Barry and Dong, Ruihai},
title = {MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction},
year = {2020},
isbn = {9781450368599},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3340531.3412879},
doi = {10.1145/3340531.3412879},
abstract = {In the area of natural language processing, various financial datasets have informed recent research and analysis including financial news, financial reports, social media, and audio data from earnings calls. We introduce a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&amp;P 1500 companies. We describe the main features of MAEC, how it was collected and assembled, paying particular attention to the text-audio alignment process used. We present the approach used in this work as providing a suitable framework for processing similar forms of data in the future. The resulting dataset is more than six times larger than those currently available to the research community and we discuss its potential in terms of current and future research challenges and opportunities. All resources of this work are available at https://github.com/Earnings-Call-Dataset/},
booktitle = {Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management},
pages = {3063–3070},
numpages = {8},
keywords = {multimodal aligned datasets, earnings conference calls, financial risk prediction},
location = {Virtual Event, Ireland},
series = {CIKM '20}
}

Terms Of Use

Shield:

This dataset and iterative forced alignment code is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Acknowledgments

The research project was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289_2.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Iterative Forced Alignment		Iterative Forced Alignment
MAEC_Dataset		MAEC_Dataset
MAEC_Dataset_Person_Label		MAEC_Dataset_Person_Label
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction

MAEC Dataset

Transcripts along with the low-level audio features (Total 147.7 MB)

High-level features (Total 59 GB)

Iterative Forced Alignment Core Code

Prerequisites

How to execute the code

Citation

Terms Of Use

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

License

Earnings-Call-Dataset/MAEC-A-Multimodal-Aligned-Earnings-Conference-Call-Dataset-for-Financial-Risk-Prediction

Folders and files

Latest commit

History

Repository files navigation

MAEC: A Multimodal Aligned Earnings Conference Call Dataset for Financial Risk Prediction

MAEC Dataset

Transcripts along with the low-level audio features (Total 147.7 MB)

High-level features (Total 59 GB)

Iterative Forced Alignment Core Code

Prerequisites

How to execute the code

Citation

Terms Of Use

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages