Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new method: explore: plot sample metadata categories/values #14

Open
nbokulich opened this issue Nov 8, 2017 · 3 comments
Open

new method: explore: plot sample metadata categories/values #14

nbokulich opened this issue Nov 8, 2017 · 3 comments

Comments

@nbokulich
Copy link
Member

nbokulich commented Nov 8, 2017

Proposed Behavior
Example and idea provided by @elong0527 and issue moved from q2-longitudinal:

image

X-axis = time (or other continuous metadata column) (possibly also support categorical columns?)

y-axis = subject ID (e.g., to support plotting individuals that are plotted repeatedly over time). This was originally planned for q2-longitudinal but should be generalized for non-longitudinal sampling designs — perhaps y-axis should be an optional parameter (if True, plot as scatter plot; if false, plot barplot?)

points colored by group category (should accept categorical or continuous metadata, infer type, and color-code accordingly)

Questions
Could also add a parameter to change size or shape of points based on other optional metadata category inputs???

@elong0527
Copy link

elong0527 commented Nov 9, 2017

Below is an implementation of the scatterplot. I am not sure which file should I save the function. So I keep the code here :)

One thing I am not sure is how QIIME2 export figures. @nbokulich could you help me on that? Thanks !

def design_plog(metadata: qiime2.Metadata,
                individual_id_column: str,
                individual_time_column: str,
                individual_group_column: str,
                fig_width: int,
                fig_height: int):

  # load and prep metadata
  metadata = _load_metadata(metadata)
  _validate_metadata_is_superset(metadata, table)
  metadata = metadata[metadata.index.isin(table.index)]
  
  # validate id column  (#How could I ensure, time column is a int/numeric?)
  _validate_input_columns(metadata, individual_id_column, None, None, None)
  _validate_input_columns(metadata, individual_time_column, None, None, None)
  _validate_input_columns(metadata, individual_group_column, None, None, None)
  
  _design_plot(sample_md, individual_id_column, individual_time_column,
               individual_group_column, fig_width, fig_height)


def _design_plot(sample_md,
                 individual_id_column,
                 individual_time_column,
                 individual_group_column,
                 fig_width,
                 fig_height):
    '''Function to create study design plot.
    sample_md: pd.DataFrame
        Sample metadata
    individual_id_column: str
        Metadata column containing IDs for individual subjects
    individual_time_column: str
        Metadata column containing sample collection time for individual subjects
    individual_group_column: str
        Metadata column containing group indicator of individual subjects
    fig_width: int
        Figure Width
    fig_height: int
        Figure Height
    '''

    sample_md = sample_md.rename(columns={individual_id_column: 'id',
                                  individual_time_column: 'time',
                                  individual_group_column: 'group'})

    sample_md["id_loc"] = sample_md["id"].astype('category').cat.codes
    # Keep for potential operation of the label
    sample_md["id_label"] = sample_md["id"]

    u_group = sample_md["group"].unique()
    n_group = len(u_group)
    sample_md_meta = sample_md[["id", "id_loc", "id_label"]]
    sample_md_meta = sample_md_meta.drop_duplicates().reset_index(drop=True)

    plt.figure(figsize=(fig_width, fig_height))

    for grp in u_group:
        _md = sample_md[sample_md.group == grp]
        plt.scatter(_md.time, _md.id_loc, label = grp)

    plt.xlabel(individual_time_column)
    plt.yticks(sample_md_meta["id_loc"], sample_md_meta["id_label"])
    plt.ylabel(individual_id_column)
    plt.legend(loc=9, bbox_to_anchor = (0.5, -0.1), ncol = n_group)

# Test 
from matplotlib import pyplot as plt
import pandas as pd

sample_md_fp = "ecam_map_maturity.txt"
sample_md = pd.DataFrame.from_csv(sample_md_fp, sep='\t')
_design_plot(sample_md, "studyid", "month", "diet_3", 6, 8)
plt.show()

@nbokulich
Copy link
Member Author

thanks @elong0527 ! I think for now the best thing to do is add these functions to my fork of q2-metadata, that way we can work together on this (e.g., I can review what you have put together and add a visualization template that displays the plots) before making a pull request into the main repository. @jairideout does this sound like a good plan?

@elong0527 could you please add these functions into a new file named _explore.py in this directory and make a pull request into my branch? Do not add the test that you wrote — we will work on tests later after we figure out which test data, etc, we will use.

Also now that this action is in q2-metadata instead of q2-longitudinal we will probably want to make it usable on categorical data as well as numerical data — see the notes that I made in the first post in this thread, and we should test whether these scatter plots can still be made with categorical data on the x-axis.

If you have any questions on how to make a pull request into my fork, etc, please just email me directly.

@jairideout
Copy link
Member

@jairideout does this sound like a good plan?

Sounds perfect!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants