UCI datasets

Regression datasets from the UCI machine learning repository prepared for benchmarking studies with test-train splits.

Installation

Install using pip (the download size is about 312 Mb):

python -m pip install git+https://github.com/treforevans/uci_datasets.git

Usage

The following code gets the first test-train split (i.e., split=0) of the challenger dataset:

from uci_datasets import Dataset
data = Dataset("challenger")
x_train, y_train, x_test, y_test = data.get_split(split=0)

There are 10 test-train splits for each dataset (as in 10-fold cross validation) with 90% of the dataset being training points and 10% being testing points in each split. The split parameter of the Dataset.get_split method accepts integers from 0 to 9 (inclusive).

Datasets

The below table contains the size (number of observations) and the number of input dimensions of each dataset. All datasets have a single output dimension.

Dataset name	Number of observations	Input dimension
`3droad`	434874	3
`autompg`	392	7
`bike`	17379	17
`challenger`	23	4
`concreteslump`	103	7
`energy`	768	8
`forest`	517	12
`houseelectric`	2049280	11
`keggdirected`	48827	20
`kin40k`	40000	8
`parkinsons`	5875	20
`pol`	15000	26
`pumadyn32nm`	8192	32
`slice`	53500	385
`solar`	1066	10
`stock`	536	11
`yacht`	308	6
`airfoil`	1503	5
`autos`	159	25
`breastcancer`	194	33
`buzz`	583250	77
`concrete`	1030	8
`elevators`	16599	18
`fertility`	100	9
`gas`	2565	128
`housing`	506	13
`keggundirected`	63608	27
`machine`	209	7
`pendulum`	630	9
`protein`	45730	9
`servo`	167	4
`skillcraft`	3338	19
`sml`	4137	26
`song`	515345	90
`tamielectric`	45781	3
`wine`	1599	11

Dataset information can be obtained from the all_datasets dictionary. For example, to obtain a list of all datasets with fewer than 1000 observations, execute the following:

from uci_datasets import all_datasets
[name for name, (n_observations, n_dimensions) in all_datasets.items() if n_observations < 1000]

Papers using these datasets

The following papers use the same datasets and test-train splits present in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
test		test
uci_datasets		uci_datasets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UCI datasets

Installation

Usage

Datasets

Papers using these datasets

About

Releases

Packages

Contributors 2

Languages

License

treforevans/uci_datasets

Folders and files

Latest commit

History

Repository files navigation

UCI datasets

Installation

Usage

Datasets

Papers using these datasets

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages