Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing feature names in Wisconsin dataset #20

Open
trangdata opened this issue Apr 10, 2020 · 5 comments
Open

Missing feature names in Wisconsin dataset #20

trangdata opened this issue Apr 10, 2020 · 5 comments

Comments

@trangdata
Copy link
Collaborator

trangdata commented Apr 10, 2020

Currently, the features in the Wisconsin Prognostic Breast Cancer dataset do not have names.

The (I think) corresponding dataset on OpenML or even Kaggle seem to have this information. It would be helpful for these feature names to be added.

@weixuanfu
Copy link
Contributor

@lacava Any idea? Should we update this dataset based on OpenML?

@trangdata
Copy link
Collaborator Author

Similar issue for the tic-tac-toe dataset. OpenML ref: https://www.openml.org/d/50

@lacava
Copy link
Collaborator

lacava commented Apr 10, 2020

@lacava Any idea? Should we update this dataset based on OpenML?

sure, we just need to make sure they match.

It would be helpful for these feature names to be added.

agreed! if you have bandwidth to submit a PR please do

@trangdata
Copy link
Collaborator Author

I think it's difficult for outsiders to help because we're not sure where the current datasets came from. I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

@lacava
Copy link
Collaborator

lacava commented Apr 10, 2020

I think it's difficult for outsiders to help because we're not sure where the current datasets came from.

Unfortunately we are all in that situation with this project. Fortunately, the source of most of these datasets is pretty obvious. If everyone tackled a few datasets and verified their origin (e.g. through a checksum as in here) we could quickly have origin information attached to most of the datasets. The only realistic way I see it happening is if everyone does a few and submits PRs.

I think in general it would also be helpful to add details/metadata for these datasets, e.g. source, meaning of features/classes, as asked here and wished here.

Agreed; that's discussed in issue #13. At the moment, metadata properties for the datasets are extracted for the readme files since PR #11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants