AnnoTEP is a platform for annotating Transposable Elements in Plants, based on the famous and widely adopted EDTA program (Ou et al., 2019; PMID: 31843001 ). By using an input file in FASTA format containing the plant genome on a chromosomal or contig/scaffold scale, it is possible to obtain:

Annotation of all Class I and Class II elements.

Data visualization in graphic format and phylogenetic trees.

Executable via CLI or GUI with GitHub/Docker/Singularity

Pre-computed Annotation Examples: Click here

Obtaining AnnoTEP

AnnoTEP can be obtained in different ways: by downloading the compressed installation package available on this site, by accessing the official repository on GitHub, or by using the containerised (Docker/Singularity) versions available on this website. For detailed installation and configuration instructions, see "Help > Downloading and Configuring AnnoTEP".

Command Line - Docker

docker pull annotep/annotep-cli:v1

Graphic Interface - Docker

docker pull annotep/annotep-gui:v1

Mutation rate table

The table provides suggested values for use during the tool's annotation process, considering different contexts and scenarios. These values were calculated based on statistical analyses and literature reviews, aiming to offer a reliable reference for estimating mutation rates in various organisms or systems.

Use this table as an initial guide, but always validate the values with empirical data and adjust them as needed for your project. For more information, access "Help > General Recommendations for Using Mutation Rates to Calculate LTR Ages" or "General Mutation Rates by Ecological Category" via the side menu.

Preprocessed Genomes

AnnoTEP documentation - Annotation of Transposable Elements in Plants

Introduction

Welcome to the AnnoTEP documentation, a specialised tool for the annotation of transposable elements (TEs) in plant genomes. Developed based on the famous and widely adopted EDTA pipeline (Ou et al., 2019; PMID: 31843001), AnnoTEP extends its functionalities by offering additional features that enhance the analysis and interpretation of genomic elements.

AnnoTEP has been designed to meet the demands of researchers working with plant genomes, providing greater accuracy and detail in the identification and classification of TEs. Among its main capabilities, the following stand out:

  • Detection of non-autonomous LTR-RTs, such as TRIM, LARD, TR_GAG, and BARE-2.
  • Advanced classification of the Copia and Gypsy superfamilies at the lineage level, following the criteria established by Orozco et al. (2019; PMID: 31390781).

  • Greater accuracy in the classification of autonomous TIRs.
  • Annotation of Helitrons, distinguishing between autonomous and non-autonomous elements.
  • Application of appropriate soft masking for subsequent analyses.
  • Generation of detailed classification reports.
  • Creation of graphical outputs, including phylogenetic trees and age analyses.

How to Use the Tool

AnnoTEP is a tool capable of adapting to the needs and skills of the user, catering to both researchers with little technical experience and users specialised in annotations. It offers two distinct interfaces – a Graphical User Interface (GUI) and a Command-Line Interface (CLI).

AnnoTEP GUI (Graphic User Interface)

The AnnoTEP GUI has been developed for researchers who prefer a visual and interactive approach. With customised fields and menus, the GUI simplifies the input and manipulation of genomic data.

GUI Features:

  • Installation: Available via GitHub, as a Docker image on Docker Hub or through conversion of the Docker image to Singularity;
  • Execution: The tool runs locally via localhost after installation;
  • Notification System: Receive email alerts about the status of the annotation process.

AnnoTEP CLI (Command Line Interface)

For experienced users, the CLI offers a more technical approach. Predefined commands and specific parameters allow greater control over the annotation process, following the EDTA pipeline terminology.

CLI Features:

  • Installation: Available via GitHub, as a Docker image on Docker Hub or through conversion of the Docker image to Singularity;
  • Execution: Like traditional pipelines, the CLI runs directly in the terminal and displays the entire annotation process in real-time, without a notification system.

Recommendations

The time required for genome analysis of any size depends solely on the user's hardware. The greater the resources used, the faster the analysis will be.

System requirements

Software

Hardware

Minimum requirements for both versions for Genomes up to 1GB

More resources are recommended for larger genomes.

General Recommendations for Using Mutation Rates to Calculate LTR Ages

  1. Understanding LTR Age Calculation:
    • The age of an LTR retrotransposon can be estimated by comparing the divergence between the 5' and 3' LTR sequences of the same retrotransposon. The assumption here is that these sequences were identical at the time of insertion and have diverged due to mutations over time.
    • The formula commonly used is: Age = Divergence / (2 x Mutation Rate), where Divergence is the genetic distance between the two LTR sequences.
  2. Accurate Divergence Estimation:
    • Use reliable bioinformatics tools to accurately measure the sequence divergence between the LTRs. Tools like LTR_retriever provide mechanisms to identify LTRs and calculate divergence.
    • Ensure that the alignment and comparison of LTR sequences are accurately performed to avoid underestimation or overestimation of divergence.
  3. Appropriate Mutation Rate:
    • Use species-specific mutation rates when available. The mutation rates you have for each species are critical as they can significantly affect age estimations.
    • If species-specific mutation rates are not available, use rates from closely related species or general rates for the plant family as a proxy, acknowledging the potential for error this introduces.
  4. Literature Review for Validation:
    • Review recent literature to validate the mutation rates and the methodologies used for similar studies in the same or related species. This can help confirm that your approach is aligned with current scientific standards.
    • Especially look for studies that have used LTR_retriever or similar tools in the same species for comparisons.
  5. Consideration of Evolutionary and Environmental Factors:
    • Remember that mutation rates can be influenced by various factors including environmental stress, life history traits, and population dynamics. These factors might cause the actual mutation rate in certain environments or periods to deviate from the average.

The mutation rate list provided below can be a valuable resource for calculating the ages of LTR retrotransposons. However, this list should be used with caution due to several important considerations:

  1. Species-Specific Variability:
    • Mutation rates can vary significantly even within a single species due to environmental factors, genetic background, and historical population dynamics. The rates provided are averages and may not capture this intra-species variability.
  2. Generalization Risks:
    • Using mutation rates from closely related species or generalized rates for an entire plant family can introduce errors. Such rates might not accurately reflect the specific evolutionary pressures and genetic history of the species of interest.
  3. Methodological Differences:
    • The methods used to estimate these mutation rates might differ, affecting their accuracy. Some rates might be derived from lab observations under controlled conditions, which may not perfectly mimic natural environments.
  4. Evolutionary and Environmental Influences:
    • Mutation rates are influenced by numerous factors including climate, soil conditions, and exposure to mutagens, which can fluctuate over time and across geographies. This context-dependent nature of mutation rates can lead to underestimations or overestimations of LTR ages.
  5. Technological and Analytical Limitations:
    • The precision of mutation rate calculations and the subsequent age estimations of LTR retrotransposons rely heavily on the technology and algorithms used in their determination. Advances in sequencing technology or bioinformatics tools may refine these rates, potentially altering previous calculations.
  6. Literature Support:
    • It is crucial to consult the latest peer-reviewed studies for the most recent and robust mutation rates and to understand the context in which they were measured. Research publications often provide more nuanced insights into the conditions and accuracy of reported mutation rates.

Recommendations

When using this list to calculate LTR ages, clearly state any assumptions made about mutation rates and the potential sources of error in your methods and results. Consider validating your findings with multiple approaches and seek peer feedback or additional data where possible. Always stay updated with the latest research and methodological advances that may impact the interpretation of these rates.

General Mutation Rates by Ecological Category

  1. Tropical Plants:
    • Tropical plants often have higher rates of growth and reproduction, which could lead to higher mutation rates. However, the rich biodiversity and complex interactions in tropical ecosystems might also promote genetic stability to some extent.
  2. Aquatic Plants:
    • Aquatic environments provide a relatively stable thermal environment but can expose plants to varying levels of UV radiation and other mutagenic factors depending on water clarity and depth. This rate assumes a moderate mutation rate reflecting these mixed conditions.
  3. Estimated Rate:
    • Plants in arid or desert environments are exposed to extreme conditions that can increase oxidative stress and potential DNA damage, possibly leading to slightly higher mutation rates.
  4. Arctic and Alpine Plants:
    • The harsh, cold environments can slow metabolic processes and potentially reduce mutation rates. These plants also have longer life spans and slower growth rates, which might contribute to a lower rate of mutation accumulation.
  5. Temperate Forest Plants:
    • This rate is based on the assumption that temperate plants experience seasonal variations that might impact their metabolic rates and, consequently, their mutation rates. This is a mid-range estimate considering the moderate environmental stresses.

Notes on General Mutation Rates by Ecological Category Estimations

  • These estimates are highly speculative and should be used with caution in scientific contexts. They are based on ecological reasoning rather than direct experimental evidence, which is the ideal method to determine such rates.
  • Mutation rates can vary widely even within a single ecological category due to species-specific factors, including life cycle length, reproductive strategy, and exposure to environmental mutagens.

Suggested Use

These general rates can be useful for preliminary models or simulations in ecological genetics and evolutionary studies. They provide a starting point for discussions about how different environments might influence genetic variability in plants. However, for rigorous scientific research, specific studies and data are always recommended.

Preprocessed Genomes

Here, you will find a list of genomes that have been analysed using AnnoTEP. These genomes have been carefully tested and processed, and the results are available for consultation. To access these results, simply click on the plant image, and you will be redirected to a new page.

What will you find in the preprocessed genomes?

Each preprocessed genome contains the following results and analyses:

  1. Annotated data:
    • This section contains annotated data, including FASTA (.fa) and GFF3 files with detailed classification of transposable elements (TEs), as well as additional files with the graphical outputs and the data used to generate them
  2. TE classification table and genomic distribution:
    • A table that categorises elements hierarchically by order, superfamily, and autonomy. It also calculates metrics such as size and the percentage of each identified element in the analysed genome.
  3. RepeatLandscape graphic:
    • A graph providing an overview of the distribution of TEs in the analysed genome.
  4. LTR Age Graph:
    • Graphs representing the age of the Gypsy and Copia superfamilies.
  5. Phylogenetic Tree and Density Graph:
    • Graphs representing the phylogeny of LTR lineages.

Downloading and Configuring AnnoTEP

GitHub

  1. In terminal, Download the repository
  2. git clone https://github.com/Marcos-Fernando/AnnoTEP.git $HOME/AnnoTEP
  3. Enter into the folder
  4. cd $HOME/AnnoTEP

Installing with library and conda

Installing Miniconda

  1. Download Miniconda
  2. After downloading Miniconda from the link above, run the following command in your terminal:
  3. bash Miniconda3-latest-Linux-x86_64.sh

Once Miniconda is installed, make sure you are inside the AnnoTEP directory, then set up the environment as follows:

cd AnnoTEP
conda env create -f environment.yml
conda activate EDTA-new

Still within the AnnoTEP directory, copy the break_fasta.pl script to /usr/local/bin to make it accessible system-wide:

sudo cp Scripts/break_fasta.pl /usr/local/bin

RepeatMasker Fixes for Long Names

During execution, you may encounter the following error: FastaDB::_cleanIndexAndCompact(): Fasta file contains a sequence identifier which is too long ( max id length = 50 ). To fix this issue, follow the steps below:

To fix this issue, follow the steps below:

  1. Edit the RepeatMasker File
    • Access the RepeatMasker file installed in the Conda environment:
    • /home/"user"/miniconda3/envs/EDTA-new/bin/RepeatMasker
    • Locate all occurrences of FastaDB where the following snippet appears:
    • my $db = FastaDB->new(
      c fileName => $file,
      openMode => SeqDBI::ReadWrite,
      maxIDLength => 50
      );
    • Change the value of maxIDLength from 50 to a higher value, for example:
    • my $db = FastaDB->new(
      c fileName => $file,
      openMode => SeqDBI::ReadWrite,
      maxIDLength => 80
      );
  2. Edit the ProcessRepeats File
    • Acess the ProcessRepeats File
    • /home/"user"/miniconda3/envs/EDTA-new/share/RepeatMasker/ProcessRepeats
    • Repeat the same procedure to change the value of maxIDLength to 80.

Testing

  1. Download the genome
  2. Arabidopsis thaliana

    • Download the TAIR10_chr_all.fas.gz file from the TAIR website and extract its contents.

    • gzip -d TAIR10_chr_all.fas.gz
      cat TAIR10_chr_all.fas | cut -f 1 -d" " > At.fasta
      rm TAIR10_chr_all.fas
  3. Inside the AnnoTEP directory, run EDTA on the downloaded genome
  4. cd AnnoTEP
    mkdir Athaliana
    cd Athaliana
    nohup ../EDTA/EDTA.pl --genome ../At.fasta --species others --step all --sensitive 1 --anno 1 --threads 20 -u 7.0e-9 > EDTA.log 2>&1 &
  5. Monitor the progress
  6. tail -f EDTA.log

Adjust the number of threads - Set the number of threads --threads according to the capacity of your machine or server. For optimal performance, use the maximum available. In the example above, it is set to 20.

Improving TE detection - Enable --sensitive 1. for more accurate TE detection and annotation. This option runs RepeatModeler to identify additional TEs and repeat sequences, and it also provides Superfamily and Lineage-level classifications.

Enhancing genome analysis with mutation rate - For a more refined analysis of TE insertion age, we recommend setting the mutation rate using the -u parameter. Suggested values and detailed explanations can be found in the Genome section

Generating Graphs

  1. Run the processing script: With the Conda environment still activated, navigate to the folder where the annotated genome was stored (e.g., Athaliana) and run the script below to generate summary data and graphs from the input genome (e.g., At.fasta):
  2. cd Athaliana
    bash -u ../Scripts/generate_PLOTs-for-TE-pipe.sh At.fasta

    Make sure to replace At.fasta with the name of the input genome file you wish to process, if it is different.


    At the end of the analysis, a directory named REPORT will be created. It contains all the outputs, including bubble and bar plots, phylogenetic trees, and summary reports.


Using AnnoTEP with Graphical User Interface

Recommendations

Before proceeding, make sure the Conda environment is properly set up and activated.

  1. Navigate to the graphic-interface folder within the AnnoTEP directory.
  2. cd AnnoTEP/graphic-interface
  3. Configure flaskenv: With the Conda environment still active, you will need to create and configure a .flaskenv file. This file defines essential Flask settings and, optionally, enables email functionality.
    • You can create the .flaskenv file using the following content:
    • FLASK_APP = "main.py"
      FLASK_DEBUG = True
      FLASK_ENV = development
    • If you plan to use the built-in email system (for the notification system), you must also include the following email configuration:
    • MAIL_SERVER=server-email
      MAIL_PORT=number
      MAIL_USE_TLS=True or False
      MAIL_USE_SSL=True or False
      MAIL_USERNAME=your@email.com
      MAIL_PASSWORD=app*password*

    Email Server Settings:

    App Password for Gmail:

    To use Gmail securely, create an app-specific password:


    Security Recommendations:

  4. Run the Application: Within the graphic-interface folder, and with the Conda environment activated, start the application by running the following command:
  5. flask run
  6. Access the Platform: Click on the link http://127.0.0.1:5000/, or copy and paste it into your browser to access the platform and start testing it.

Docker

Docker GUI

Recommendations

If you intend to use the email notification system, please note that your machine must have access to the internet for this feature to function properly.

Open the terminal and run the following commands:

  1. Download the AnnoTEP image: Open your terminal and run the following command to download the AnnoTEP Docker image:
  2. docker pull annotep/annotep-gui:v1
  3. Run the container: Next, run the container using the command below. Specify a folder on your machine to store the annotation results:
  4. docker run -it -v <path-to-results-folder>:/usr/local/AnnoTEP/graphic-interface/results -dp 0.0.0.0:5000:5000 annotep/annotep-gui:v1

    Parameter descriptions:

    • -v <path-to-results-folder>:/usr/local/AnnoTEP/graphic-interface/results: Creates a volume between your machine and the container to store results. Replace -v <path-to-results-folder> with the path to a folder on your machine. If the folder doesn't exist, Docker will create it. The path /usr/local/AnnoTEP/graphic-interface/results is the directory inside the container and should not be changed.
    • -dp 0.0.0.0:5000:5000: Maps port 5000 of the container to port 5000 of the host.
    • annotep/annotep-gui:v1: Specifies the Docker image to use.
  5. Access the AnnoTEP Interface: After running the container, open your browser and go to the following address to use the graphical interface: http://127.0.0.1:5000
  6. Submit Data for Analysis: In the graphical interface, input the required data, such as:
    • Email Address: To receive notifications about the process status (optional).
    • Genome: The genome file to be analysed.
    • Features: Choose the type of analysis to be performed.

    Once the process is complete, you will receive an email confirming whether it finished successfully or with errors. The email will include:

    • The name of the generated folder (available in the results directory specified via -v <path-to-results-folder>);
    • A detailed log of the annotation steps;

    Recommendations

    Avoid shutting down your machine during the process, as this may interrupt the analysis. Even when using the web interface, processing occurs locally on your machine.

    Annotation speed depends on your machine's performance. Ensure your system meets the recommended requirements for optimal results.

Docker CLI

Follow the steps below to download and configure the AnnoTEP CLI. This version is ideal for advanced users who prefer greater control and customization via commands.

  1. Download the AnnoTEP CLI image: To get started, download the AnnoTEP CLI Docker image by running the following command:
  2. docker pull annotep/annotep-cli:v1
  3. Display the User Guide: Use the -h parameter to display a detailed guide on how to use the script:
  4. docker run annotep/annotep-cli:v1 python run_annotep.py -h

    This will display a detailed guide with usage options:


    Command Flag Description Required?

  5. Run the Container: To simplify this step, we recommend creating a folder to store your genomic data in FASTA format. Once created, run the container using the command below as a guide. Ensure you provide the full path to the folder where you want to save the results, as well as the full path to the genomes folder:
  6. docker run -it -v <path-to-results-folder>:/usr/local/AnnoTEP/bash-interface/results -v <absolute-path-to-folder-genomes>:<absolute-path-to-folder-genomes> annotep/annotep-cli:v1 python run_annotep.py --genome <absolute-path-to-folder-genomes>/genome.fa --threads <number>

    Parameter descriptions:

    • -v <path-to-results-folder>:/usr/local/AnnoTEP/bash-interface/results: Creates a volume between your machine and the container to store results. Replace -v <path-to-results-folder> with the path to a folder on your machine. If the folder doesn't exist, Docker will create it. The path /usr/local/AnnoTEP/bash-interface/results is the directory inside the container and should not be changed.
    • -v <absolute-path-to-folder-genomes>:<absolute-path-to-folder-genomes>: Creates a temporary copy of the genomic files inside Docker. Replace <absolute-path-to-folder-genomes> with the full path of the folder containing the genomes.
    • --genome <absolute-path-to-folder-genomes>/genome.fa: Specify the full path to the genome file you want to annotate.
    • --threads <number>: Define the number of threads to be used.
  7. Monitor the Annotation Process: Wait for the genome annotation to complete. You can monitor the progress directly through the terminal.
  8. Now, just wait for the annotation to complete. You can monitor the progress directly in the terminal, where logs will be displayed in real-time.

Resolving Memory Issues in Docker Containers

If Docker containers experience memory issues or unexpected terminations due to intensive resource usage, you can adjust the process limits (--pids-limit) and swap memory (--memory-swap). Example usage:

docker run -it -v <path-to-results-folder>:/usr/local/AnnoTEP/graphic-interface/results -dp 0.0.0.0:5000:5000 --pids-limit <threads x 10000> --memory-swap -1 annotep/annotep-gui:v1

Explanation

  • --pids-limit <threads x 10000>: Sets the maximum number of processes the container can create. For example, if you use 12 threads, set this value to 120,000. This ensures each thread can create subprocesses without hitting the process limit, maintaining performance under high load.
  • --memory-swap -1: Disables the swap memory limit, allowing the container to use unlimited virtual memory. This helps avoid errors when physical RAM is insufficient.

Singularity

You can use AnnoTEP with Singularity by converting the official Docker images. Below are the available methods to obtain and run .sif images.

  1. Obtaining the Singularity Image: There are two ways to obtain the image:
    • Method 1 – Direct Conversion from Docker Hub: Download and convert the image directly from Docker Hub using:
    • singularity build <name-image>.sif docker://annotep/annotep-cli:v1
      #or
      singularity build <name-image>.sif docker://annotep/annotep-gui:v1

      Description:

      • <name-image>: you can name the image anything you like; the extension must be .sif.
      • docker://: specifies that the image will be pulled from a remote repository (e.g. Docker Hub).
    • Method 2 – Conversion from a Local Docker Image: This method involves saving the Docker image locally and then converting it:
      1. Save the Docker image to a .tar file:
      2. docker save annotep/annotep-cli:v1 -o annotep_cli1.tar
        #or
        docker save annotep/annotep-gui:v1 -o annotep_gui1.tar
      3. Convert the .tar file to a Singularity image:
      4. singularity build <name-image>.sif docker-archive://annotep_cli1.tar
        #or
        singularity build <name-image>.sif docker-archive://annotep_gui1.tar

        Description:

        • -o: specifies the name of the .tar file.
        • <name-image>: you can name the image anything you like; the extension must be .sif.
        • docker-archive://: indicates the image will be built from a local .tar archive.
  2. Running the Image: How you run the container depends on the interface you choose:

Singularity GUI

  1. To launch the graphical interface, use:
  2. singularity exec --bind <path-to-results-folder>:/usr/local/AnnoTEP/graphic-interface/results <name-image>.sif bash -c "cd /usr/local/AnnoTEP/graphic-interface && source /usr/local/miniconda3/etc/profile.d/conda.sh && conda activate EDTA-new && python main.py"
  3. After running the container, access the AnnoTEP interface by typing the following address into your web browser:127.0.0.1:5000
  4. Description:

    • --bind <path-to-results-folder>:/usr/local/AnnoTEP/graphic-interface/results: maps a directory from your local machine to a directory inside the container
    • bash -c "...": executes a sequence of commands within the container.

Singularity CLI

  1. To run via the command line, use:
  2. singularity exec -B <path-to-results-folder>:/usr/local/AnnoTEP/bash-interface/results -B <absolute-path-to-folder-genomes>:/genomas <name-image>.sif python /usr/local/AnnoTEP/bash-interface/run_annotep.py --genome /genomas/genome.fasta --threads <threads>

    Description:

    • -B: equivalent to --bind, links local directories to container paths.
    • <path-to-results-folder>:/usr/local/AnnoTEP/bash-interface/results: folder where analysis results will be saved.
    • <absolute-path-to-folder-genomes>:/genomas: folder containing the input genome files.
    • python /usr/local/AnnoTEP/bash-interface/run_annotep.py: the main command that starts the analysis.
    • --genome /genomas/genome.fasta: path to the genome file to be annotated.

The AnnoTEP platform was developed to expand the range of tools available for annotating transposable elements (TEs) in plant genomes. Our main goal is to simplify and optimize the work of researchers, regardless of their level of experience, by facilitating the TE annotation process and providing as much information as possible about the genome.

Contributors

Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Fundação de Amparo à Pesquisa do Estado de São Paulo
Fundação Amazônia de Amparo a Estudos e Pesquisas
Universidade Federal do Pará
Universidade Estadual Paulista
Laboratório de Bioinformatica e Computação de Alto desempenho
Contact us with any questions or to report a bug in the platform (github or web or container)
marcosnandosc@gmail.com

(Support)

alessandro.varani@unesp.br

(Advisor)

vabreu@ufpa.br

(Advisor)