Skip to content

Latest commit

 

History

History
170 lines (92 loc) · 4.24 KB

File metadata and controls

170 lines (92 loc) · 4.24 KB

1. Background

Gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.




Source: https://en.wikipedia.org/wiki/Gene_prediction

We perform gene prediction with the GeneMark-EP + AUGUSTUS + BRAKER2 pipeline. This pipeline combines extrinsic and ab initio approaches by mapping protein and EST data to the genome to validate ab initio predictions. Augustus results can also incorporate hints in the form of EST alignments or protein profiles to increase the accuracy of the gene prediction.

See: http://opal.biology.gatech.edu/GeneMark/

Then we perform protein function prediction using InterProScan and blastp with the UniprotDB/swissprot.

See:

https://interpro-documentation.readthedocs.io/en/latest/interproscan.html

https://www.uniprot.org/

2. Dependencies

3. Gene Prediction

The script prothint_genemark_braker_slurm.sh runs the gene prediction pipeline.

3.1 Setup

  • Copy the key files to your home folder. GeneMark looks for them in the $PATH global variable

  • Download a curated database of close orthologs in fasta format. We chose https://v100.orthodb.org/download/odb10_arthropoda_fasta.tar.gz Must edit the database and replace ambiguous amino acids such as * with Ns. The gene predictions will be as good as this databasel so it is essential that you select a set of curated orthologs that is closest to your assembly in the taxonomy tree.

  • Sequence headers with long names should be avoided. Make sure that your database of orthologs and your genome have short headers.

3.2 Run the gene prediction script

Edit the script prothint_genemark_braker_slurm.sh with your information

To run the script

sbatch prothint_genemark_braker_slurm.sh

The execution could take from 4 hours to more than a day.

3.3 Outputs

braker/
cufflinks/
EB15_shortHeaders.fa.gz
genemark/
prothint/
  • Gene models are found in prothint and genemark folders

  • proteome in fasta and gff formats are in cufflinks folder

3.4 run QC

The script BUSCO_protein.sh generates proteome completeness report calculated with BUSCO.

4. Protein Function Prediction

We used InterProScan and blastp with the UniprotDB/swissprot databases

4.1. Setup

The script is called protein_annotation_slurm.sh and it requires other programs that are provided with this repo.

  • Copy the scripts to your path

  • Index the databases for blast.

  • Edit the script with your information.

  • The input file is called proteins.fa that the previous step produced. It is located inside the cufflinks folder.

4.2 Run the script

sbatch protein_annotation_slurm.sh

The execution could take a few hours.

4.3 Output

Example:

braker/
BUSCO_proteins_vs_insecta/
cufflinks/
EB15_output.blastp
EB15_output.iprscan
EB15_proteins.putative_function.domain_added.gff
EB15_proteins.putative_function.fasta.gz
EB15_proteins.putative_function.gff
EB15_proteins.putative_function_rename.gff
EB15_shortHeaders.fa.gz
genemark/
prothint/

The BUSCO results are inside the BUSCO_proteins_vs_insecta/ folder

The annotated proteome in fasta and gff format are the files:

  • EB15_proteins.putative_function.fasta.gz

  • EB15_proteins.putative_function_rename.gff