Skip to content

Commit

Permalink
Merge pull request #1 from Creeki/patch-1
Browse files Browse the repository at this point in the history
Update 6_filtering_variants.md
  • Loading branch information
joanam authored Jan 18, 2024
2 parents f68d4a0 + 9902ab7 commit ba2485a
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions _pages/6_filtering_variants.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Luckily, `vcftools` makes it possible to easily calculate these statistics. In t

#### Setting up

Before we calculate our stats, lets make a little effort to make our commands simpler and also to ensure the output is written to the right place. First we need to make a directory for our results.
Before we calculate our stats, let's make a little effort to make our commands simpler and also to ensure the output is written to the right place. First we need to make a directory for our results.

```shell
mkdir ~/vcftools
Expand Down Expand Up @@ -260,7 +260,7 @@ summary(var_miss$fmiss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.01312 0.00000 0.93750

Most sites have almost no issing data. Although clearly, there are sum (as the max value shows). This means we can be quite conservative when we set our missing data threshold. We will remove all sites where **over 10% of individuals are missing a genotype**. One thing to note here is that `vcftools` inverts the direction of missigness, so our 10% threshold means **we will tolerate 90% missingness** (yes this is confusing and counterintuitive... but that's the way it is!). Typically missingness of 75-95% is used.
Most sites have almost no missing data. Although clearly, there are some (as the max value shows). This means we can be quite conservative when we set our missing data threshold. We will remove all sites where **over 10% of individuals are missing a genotype**. One thing to note here is that `vcftools` inverts the direction of missingness, so our 10% threshold means **we will tolerate the minimum 90% call-rate** (yes this is confusing and counterintuitive... but that's the way it is!). Typically 75-95% is used.

### Minor allele frequency

Expand Down Expand Up @@ -404,7 +404,7 @@ What have we done here?
* `--gvcf` - input path -- denotes a gzipped vcf file
* `--remove-indels` - remove all indels (SNPs only)
* `--maf` - set minor allele frequency - 0.1 here
* `--max-missing` - set minimum missing data. A little counterintuitive - 0 is totally missing, 1 is none missing. Here 0.9 means we will tolerate 10% missing data.
* `--max-missing` - set minimum non-missing data. A little counterintuitive - 0 is totally missing, 1 is none missing. Here 0.9 means we will tolerate 10% missing data.
* `--minQ` - this is just the minimum quality score required for a site to pass our filtering threshold. Here we set it to 30.
* `--min-meanDP` - the minimum mean depth for a site.
* `--max-meanDP` - the maximum mean depth for a site.
Expand Down

0 comments on commit ba2485a

Please sign in to comment.