GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.

Overview

GenErode pipeline

logo

(C) Jonas Söderberg

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.

Documentation

The full pipeline documentation can be found on the repository wiki.

Citation

If you've used GenErode to produce results, please cite our bioRxiv article:

Kutschera VE, Kierczak M, van der Valk T, von Seth J, Dussex N, Lord E, Dehasque M, Stanton DWG, Emami P, Nystedt B, Dalén L, Díez-del-Molino D. GenErode: a bioinformatics pipeline to investigate genome erosion in endangered and extinct species. bioRxiv 2022. https://doi.org/10.1101/2022.03.04.482637

Pipeline overview

processing

Figure 1: Overview of the GenErode pipeline data processing tracks. Input and output files formats, dependencies between steps, and main software used are shown. Optional steps are highlighted in red.

analysis

Figure 2: Overview of the GenErode pipeline data analysis tracks and final reports. Input file formats and main software used are shown.

Licence information

GenErode pipeline

Copyright (C) 2022 Verena Kutschera

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Comments
  • snpEff prepare_db_build does not recognize file type

    snpEff prepare_db_build does not recognize file type

    ImproperOutputException in line 148 of /crex/proj/sllstore2017093/b2016342/b2016342_nobackup/lts/genome_erosion_pipeline/verena_testing/maintenance/issue
    s/GenErode/workflow/rules/12_snpEff.smk:
    Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule prepare_db_buil
    d:
    /proj/sllstore2017093/b2016342/b2016342_nobackup/lts/genome_erosion_pipeline/verena_testing/development/testdata/gerp/outgroup_Sc9M7eS_2_HRSCAF_41/all_sc
    affolds/snpEff/data/GCF_000283155.1_CerSimSim1.0_genomic.Sc9M7eS_2_HRSCAF_41/genes.gtf
    
    bug 
    opened by verku 1
  • compressing vcf file (rule remove_CpG_vcf - 8.1_vcf_CpG_filtering.smk)

    compressing vcf file (rule remove_CpG_vcf - 8.1_vcf_CpG_filtering.smk)

    Hi,

    In the '8.1_vcf_CpG_filtering.smk' file, the rule 'remove_CpG_vcf' uses bedtools intersect to generate a vcf file without CpG sites, such as:

    bedtools intersect -a {input.vcf} -b {input.bed} -header -sorted -g {input.genomefile} > {output.filtered} 2> {log}

    However, the {output.filtered} file is not compressed and thus very large, which means that a project directory can very quickly be full.

    Would it be possible to compress this vcf file to save space with something like below to generate a *vcf.gz file?

    bedtools intersect -a {input.vcf} -b {input.bed} -header -sorted -g {input.genomefile} | bgzip -c > {output.filtered} 2> {log}

    Much appreciated, Nic

    bug enhancement 
    opened by ndussex 1
  • Add docker images combining bedtools and bgzip and update code to compress filtered vcf files (issue #19)

    Add docker images combining bedtools and bgzip and update code to compress filtered vcf files (issue #19)

    Docker images currently on private Dockerhub repository. Add them to NBIS Dockerhub and update links to bedtools image in the rules before merging this branch.

    opened by verku 0
  • snpEff build fails with

    snpEff build fails with "Out of memory"

    For a user, the rule build_snpEff_db failed due to out of memory.

    She'll test the following:

    java -jar -Xmx64g /usr/local/share/snpeff-4.3.1t-3/snpEff.jar build -gtf22 -c {params.abs_config} -dataDir {params.abs_data_dir} -treatAllAsProteinCoding -v {params.ref_name} 2> {log}

    If it works, update all snpEff rules with the java -jar -Xmx flags and automatic calculation of memory from the number of threads (like QualiMap and Picard)

    bug enhancement 
    opened by verku 0
  • Rerun issue with snakemake 7.8

    Rerun issue with snakemake 7.8

    Many of us are now running GenErode with snakemake 7 to avoid issues with Singularity version changes. However, Snakemake has changed their rerun behaviour in Snakemake 7.8 (see https://github.com/snakemake/snakemake/issues/1694). This means that when changing metadata tables for example, snakemake will run everything from the beginning, stating "Set of input files has changed since last execution". To get around this you can use "--rerun-triggers mtime" in the snakemake command. Also applies to any local changes in code or other parameters.

    opened by lored322 0
  • Issues with cookiecutter in UPPMAX

    Issues with cookiecutter in UPPMAX

    Hello,

    I'm trying to set up the slurm profile with cookiecutter (using the latest version of the pipeline v0.4.2), but the options look very different to what is described in GenErode's Wiki.

    This is how it looks like:

    2022-12-14

    Apparently it is asking me to manually input all the configuration instead of retrieving it from the config/cluster.yaml.

    I used cookiecutter earlier this year both in UPPMAX and another slurm-based cluster and it was working exactly as it is described in the Wiki. I'm not sure if the issue I'm finding now is related to recent changes in UPPMAX or in the Snakemake git profile.

    Thanks in advance, I'd appreciate any suggestions on how to proceed in order to set up the Snakemake profile.

    opened by jcchacond 0
  • Singularity issue?

    Singularity issue?

    Hi,

    Since the last maintenance window on uppmax, I am having issues running the pipeline and got the following error:

    WorkflowError: Minimum singularity version is 2.4.1. File "/home/nicd/.conda/envs/generode/lib/python3.7/site-packages/snakemake/deployment/singularity.py", line 48, in init

    I tried recreating the generode environment, as such:

    conda env create -n generode -f environment.yml conda activate generode

    But I still have the same issue when launching the snakemake command.

    Would you mind helping me with this please?

    Thanks!

    opened by ndussex 3
  • Upgrade to Snakemake version 7

    Upgrade to Snakemake version 7

    Snakemake version 6 is not compatible with the latest version of the slurm profile (see https://github.com/NBISweden/GenErode/wiki/8.-FAQ#im-trying-to-run-the-pipeline-with-the-snakemake-slurm-profile-but-im-getting-the-following-error-snakemake-error-unrecognized-arguments---cluster-cancelscancel-how-do-i-solve-this)

    bug 
    opened by verku 1
  • Rewrite memory allocation for java-based tools (e.g. qualimap)

    Rewrite memory allocation for java-based tools (e.g. qualimap)

    resources: mem_mb=lambda wildcards, input, threads, attempt: 6000 * threads - 2000

    unset DISPLAY qualimap bamqc -bam {input.bam} --java-mem-size=${resources.mem_mb}M -nt {threads} -outdir {output}

    or: def qualimap_mem(wildcards, input, threads, attempt): return 6000 * threads - 2000

    resources: mem_mb=qualimap_mem

    enhancement 
    opened by verku 0
Releases(0.4.2)
  • 0.4.2(Sep 5, 2022)

    Updates related to large genome sizes and/or large sample sizes:

    • Run snpEff with option to specify -Xmx for large genomes and add the rules to cluster.yaml
    • Fix y-axis labels for mutational load plot so that there is no overlap for large sample sizes
    • Create new Docker images with bedtools and htslib (bgzip) so that VCF files filtered with bedtools can be compressed in a pipe to reduce intermediate file sizes

    Minor bug fixes:

    • Update conda in GitHub actions to reduce run time
    • Shorten run time and lower number of cores for mutational load calculations in cluster.yaml
    • Remove temp flag from bam index file of rescaled bam files
    • Embed pipeline logo into GenErode pipeline report via link to file on repository so that the pipeline report can be moved to a different location
    • Fix "rerun incomplete" warning for rule make_reference_bed by separating it from the group job reference_prep_group

    see https://github.com/NBISweden/GenErode/wiki/9.-Changelog#2022-09-05-version-042

    Source code(tar.gz)
    Source code(zip)
  • 0.4.1(Mar 3, 2022)

    This is the first public version of GenErode, a Snakemake pipeline that analyzes whole genome re-sequencing data from ancient/historical and modern samples.

    Source code(tar.gz)
    Source code(zip)
Owner
NBIS -- National Bioinformatics Infrastructure Sweden
NBIS is a distributed national bioinformatics infrastructure, supporting life sciences in Sweden.
NBIS -- National Bioinformatics Infrastructure Sweden
Contains key documentations of my masterthesis. Includes code for computing counterfactual explanations, layout of user study and analysis code of the study results.

Evaluating the Practicality of Counterfactual Explanations This repository holds key information about my Masterthesis (18 EC) for the Master's degree

Nina Spreitzer 1 Sep 15, 2022
Snakemake worflow to process and filter long read data from Oxford Nanopore Technologies.

Nanopore-Workflow Snakemake workflow to process and filter long read data from Oxford Nanopore Technologies. It is designed to compare whole human gen

null 5 May 13, 2022
Hands on session on how to generate and deal with sequencing data in bioinformatic analyses.

IBECourse2022_SequencingData Hands on session on how to generate and deal with sequencing data in bioinformatic analyses. VCF files and bcftools Mater

null 3 Jul 22, 2022
A Python tool for formatting GA4 data to match and be backfilled with historical GA3 data in BigQuery.

GA3toGA4 A Python tool for formatting GA4 data to match and be backfilled with historical GA3 data in BigQuery. About Welcome to GA3 to GA4 tool ⚠️ Wa

LOCOMOTIVE® 39 Dec 18, 2022
A Python tool for formatting GA4 data to match and be backfilled with historical GA3 data in BigQuery.

GA3toGA4 A Python tool for formatting GA4 data to match and be backfilled with historical GA3 data in BigQuery. View the Dashboard About Welcome to GA

LOCOMOTIVE® 34 Jul 4, 2022
GitHub Issues Blog, powered by GitHub Issues and GitHub Actions

看看月亮吧 置顶 ?? 为什么会出现这个博客 0 ?? 最新 ?? 为什么会出现这个博客 0 ?? 2022-05-12 06:03:26 ??️ : ?? 置顶, ✏️ 随笔 碎碎念 关于为什么现在才开始写博客,其实我更愿意把这称为笔记 之前看过一篇文章,他提到,快速改变人生的五件事情:早起,阅读

Kyun Wong 1 Jul 14, 2022
A repository for scripts used in a Final Year Project conducted under the National University of Singapore and the Genome Institute of Singapore.

FYP_conjugation_blocking_gene A repository for scripts used in a Final Year Project conducted under the National University of Singapore and the Genom

null 1 Mar 23, 2022
BSc thesis: "Convolutional Neural Networks and their Application in Cancer Diagnosis based on RNA-Sequencing"

This project uses RNA-sequences, converted in 2D images, to classify cancer types using Deep Learning. It is divided in three parts. The first one is

Marianna-Kanellaki 2 Mar 12, 2022
sangerseq_viewer is a python package to automatically visualize Sanger sequencing results and the corresponding annotated sequence map.

sangerseq_viewer Installation and User Manual sangerseq_viewer is a python package to automatically visualize Sanger sequencing results and the corres

Hideto Mori 11 Nov 6, 2022
Metagenome assembly tutorial, JAX long-read sequencing workshop 2022

Metagenomic assembly using metaFlye This tutorial is a part of the Long read sequencing workshop, held at The Jackson Laboratory for Genomic Medicine

Mikhail Kolmogorov 8 Dec 19, 2022
Quantifying splice junctions coverage from data released by STAR and mapping it to genome positions.

QSplice Quantifying splice junctions coverage from SJ.out.tab released by STAR mapping it to genome positions. The Snakemake workflow stored in gitlab

Fernando Pozo 1 Sep 7, 2022
This is a Text Data Analysis Project Involving (YouTube Case Study).

Text_Data_Analysis This is a Text Data Analysis Project Involving (YouTube Case Study). Problem Statement => Sentiment Analysis. Package1: There are m

null 1 Mar 5, 2022
Firelink is based on scikit-learn pipeline and adding the functionality to store the pipeline in `.yaml` or `.ember` file for production.

Firelink Firelink is based on scikit-learn pipeline and adding the functionality to store the pipeline in .yaml or .ember file for production. Quickst

Owen Ouyang 8 Sep 2, 2022
Find support & resistance zones in historical data

find-SRs This program detects support & resistance zones in historical [forex] data. It's a project of the reddit r/algotrading community. It's intend

null 5 Aug 16, 2022
Investing method to download last 30 days of financial historical data from Investing.com

Investing.com Financial Historical Data Retriever Tired of "403 Error Retry later" error from InvestPy package. Decided to create this tiny script tha

Amir C 2 Nov 4, 2022
Repository for a simple data engineering project/study

Sell Here Data Engineering project What is sell here? Sell Here is a fake web platform to help companies sell their products through sell_here website

Guilherme 1 Sep 20, 2022
The modern, all-batteries-included GitHub SDK for Python, including rest api, graphql, webhooks, like octokit!

✨ The modern, all-batteries-included GitHub SDK for Python ✨ ✨ Support both sync and async calls, fully typed ✨ ✨ Always up to date, like octokit ✨ In

Ju4tCode 52 Jan 7, 2023
Data Cleaning. Data Integration. Data Reduction for 1-Data Quality. 2-Data Transformation. 3-Data Mining. 4-Pattern Evaluation. 5-Representing Knowledge in Data Mining.

Python_application_for_dataMining Oreview: We aim to achieve a prediction model for improving data analysis and reporting. The programming language th

abdelghani 1 Sep 21, 2022