02-IdentifyRepeats.Rmd

# Materials and Methods{-}

## DNA Sequencing{-}

## Assembly{-}

## Identify repeats (RepeatModeler){-}

Make an annotation directory in /data/fish/

Copy the 3 genome assemblies into them and unzip them.

Run RepeatModeler:

+ Input
    + <YourSeq.fasta>
    
+ Output
    + YourSeq.Consensi.fa.classified
    + many other files


```
export PATH=/home/jm/sw/miniconda3/bin/:$PATH

/home/jm/saved/Perlscripts/annotation/run-repeat.bash <fasta> <my_db_name>

```

## Softmask genome (RepeatMasker){-}

Link to the classified file in the dated directory that RepeatModeler makes
(my links didn’t work)

Softmask each in a separate directory first.
I did softmask in the dated directory that Repeat Modeler made for each  

Put whole path to RepeatMasker because I have a different miniconda:

```
/home/jm/sw/miniconda3/bin/RepeatMasker --help	

/home/jm/saved/Perlscripts/annotation/run-repeat-masker.bash
```


NOTE: BRAKER software:
https://github.com/Gaius-Augustus/BRAKER

UPDATE:

Template:

```
/home/jm/saved/Perlscripts/annotation/run-repeat-masker.bash <fasta> <fa.classified>
```

+ Input
    + <YourSeq.fasta>
    + <YourSeq.consensi.fa.classified>
    
+ Output
    + <YourSeq.consensi.fa.classified.gff>
    + <YourSeq.consensi.fa.classified.masked>
    + <YourSeq.consensi.fa.classified.?>
    + <YourSeq.consensi.fa.classified.?>


Code that worked:

```
/home/jm/saved/Perlscripts/annotation/run-repeat-masker.bash /data/fish/Annotation/Cbrontotheroides.fasta CbronConsensi.fa.classified
```

```
/home/jm/saved/Perlscripts/annotation/run-repeat-masker.bash /data/fish/Annotation/Cdesquamator.fasta CdesConsensi.fa.classified
```

```
/home/jm/saved/Perlscripts/annotation/run-repeat-masker.bash /data/fish/Annotation/Cvariegatus.fasta  CvarConsensi.fa.classified
```

run-repeat-masker.bash:

```
#! /usr/bin/bash
# Link to the *classified file in the dated directory it makes.
# Feed in the fasta file and the classified file
# repeat environment
#/home/jm/miniconda3/envs/repeat/bin/RepeatMasker
/home/jm/miniconda3/envs/repeat/bin/RepeatMasker -pa 16 -gff -nolow -lib $2 $1
```

+ pa - parallel, to use multiple processors

+ gff - creates an additional General Feature Finding format output (http://www.sanger.ac.uk/Software/GFF)

+ nolow - does not mask low complexity DNA or simple repeats, does not display simple repeats or low_complexity DNA in the annotation

+ lib - use the -lib option to specify a custom library of sequences to be masked in the query

Notes:

You can combine the repeats available in the RepeatMasker library  with a custom set of consensus sequences.  To accomplish this use the famdb.py tool:

```
`./famdb.py -i Libraries/RepeatMaskerLib.h5 families --format fasta_name --ancestors --descendants 'species name' --include-class-in-name`
```

## Trim RNAseq{-}

### RNAseq data acquisition{-}

First get the accession list of RNAseq SRA files from NCBI/SRA:

Cvar:
https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=8&WebEnv=MCID_66031893c3876c0c0e7d41eb&o=acc_s%3Aa 

Cbron:
https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=4&WebEnv=MCID_65f47f2d1a0a6262163f5ef1&o=acc_s%3Aa 

Cdes:
https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=3&WebEnv=MCID_65f47f2d1a0a6262163f5ef1&o=acc_s%3Aa

+ This accession list will be in a .txt format. 

+ It includes all paired-end, RNA, SRA entries per species. 

+ In order to pull all of the SRRxxxxxxxx entries to inbre01 (the server) use the following fastq dump command saved under JM's home directory:

+ **Input**
    + <YourSRAaccessionList.txt>
    
+ **Output** (pair for every SRR on YourSRAaccessionList.txt)
    + <SRRxxxxxxxx_1.fastq>
    + <SRRxxxxxxxx_2.fastq>

```
for i in `cat CbronList.txt`; do
/home/jm/sw/sratoolkit.3.1.0-centos_linux64/bin/fasterq-dump $i -S
done

```

To run in background in parallel on multiple threads:

```
for i in `cat CbronList.txt`; do /home/jm/sw/sratoolkit.3.1.0-centos_linux64/bin/fasterq-dump $i -S -e40  done
```

Each SRR- entry downloaded from the sequence read archive (SRA) has **2 reads per run**. 
By running the command above on the list.txt accession file (downloaded from NCBI SRA) we will get an SRRxxxxxxxx_1 and an SRRxxxxxxxx_2 file per entry. 

### Trim reads{-}

+ **Input** (pair for every SRR on YourSRAaccessionList.txt)
    + <SRRxxxxxxxx_1.fastq>
    + <SRRxxxxxxxx_2.fastq>
    
+ **Output**
    + <trimmed.>

Now we need to trim reads _1 and _2. Because we have multiple sequences and 2 reads per sequence, we need to run a loop. Something like the following:

This failed:

```
for i in *_1.fastq; do
	bn=`basename $i _1.fastq`
	~/saved/Perlscripts/annotation/run-trim-galore.sh $i ${bn}_2.fastq &
done

```

What Joann ran:

```
for i in `cat list.txt`; do
 /home/jm/saved/Perlscripts/annotation/run-trim-galore.sh ${i}_1.fastq ${i}_2.fastq &
done
```

might need to run before:

```
export PATH=$PATH:/home/agomez/miniconda3/bin/
```

The scripts are saved here:

```
~/saved/Perlscripts/annotation/run-trim-galore.sh forloop ${i}_r1 <read_1> <read_2>
```

run-trim-galore.sh:

```
#! /usr/bin/bash
# read1 read2
# Installed in Adam's base environment
export PATH=$PATH:/home/agomez/miniconda3/bin/trim_galore
trim_galore --paired $1 $2

```
–paired - 


## Alignment of RNA-seq reads to the masked genome assembly (HISAT2){-}

```
/home/jm/saved/Perlscripts/run-hisat2-build.sh <genome_fasta> <my_hisat2_db_name>

```

```
/home/jm/saved/Perlscripts/run-hisat2-paired.sh <ht2-ix> <trimmed_read1> <trimmed_read2> <sam_output>
```

## Convert to BAM and sort bam files by reads (samtools sort -n){-}

```
/home/jm/saved/Perlscripts/annotation/run-bam-sort-n.sh <sam-file>

```

## Annotation (BRAKER3){-}

```
/home/jm/saved/Perlscripts/annotation/run-braker-3.bash output_directory_name genome_name "comma-sep list of bam files"
```

## Cyprinodon Species{-}

How captured
Grown


## Experimental design (wetlab){-}
Dna extraction


Provenance

+ OS
+ Software
+ Versions
+ Dependencies