Details can be found as a dropdown by pressing on the title.
Thursday, February 29
10:00 Registration desk opens
11:00 Welcome note
11:15 Invited Talk: Guessing, meassuring and inferring metabolic fluxes
– by Nicola Zamboni
12:00 Dynamics of algae-bacteria interactions in artificial plankton communities
– by Marco Mauri
Marine phytoplankton is composed of unicellular algae and their associated bacteria. These algal
communities are highly variable in species composition, but their patterns tend to recur seasonally. This
diversity suggests the existence of ecological niches for the associated bacteria.
To clarify the role of the bacterial community on the phytoplankton, we co-cultured several bacterial species
with microalgae. We found that the inhibitory or stimulatory function of a particular bacterium depends on
both the co-cultured partner and on the different growth phases of the algal culture. Using a generalized
Lotka-Volterra model, we were able to predict the behaviour of an artificial community composed of C.
radiatus diatoms and four bacterial species, and demonstrate the importance of pairwise bacteria-algae
interactions [1].
Our study supports the idea that interactions between bacteria and algae are highly species-specific and
depend on algal fitness, bacterial metabolism, and community composition. This species specificity may
underlie the high complexity of the multi-species plankton communities observed in nature.
[1] Deng, M. Mauri, M. Vallet, M. Staudinger, R.J. Allen, G. Pohnert, Appl Environ Microbiol. 88, 22, 2022.
12:20 Predicting the environment of bacterial community composition
– by Rose Brouns
Abiotic factors, including salinity, temperature, and pH, play a crucial role in shaping the microbial community
structure. Considering that these abiotic factors are crucial in shaping microbial communities, we expect that
conversely, they can be predicted based on community features. Being able to predict how communities are
shaped in relation to abiotic factors can have huge impact on microbiome research because it can help
predict what will happen to a community once the environment changes.
We downloaded the following publicly available data from the MGnify database: taxonomic composition,
biome description, salinity, pH and temperature of study sites around the world. We used dimension
reduction techniques to visualize the data based on different abiotic factors and biome. While some
separation was observed, the samples did not form clear clusters, suggesting that there is more to learn
about the complex ecology of microbial communities.
Using Random Forest (RF) models we could classify samples in high or low salinity groups based on its
taxonomic composition on different taxonomic ranks. The models achieved a mean accuracy of 96-97%
(standard deviation between 0.005 and 0.008) in predicting high or low salinity of the environment.
To cross-check the results of our models, we conducted differential abundance analysis (DAA) in which we
identified taxa that are significantly and differentially abundant between high and low saline environments.
These results are cohesive with the most important taxa identified by the RF modelss.
Our success in predicting salinity levels suggests that we can also predict other abiotic factors such as pH
and temperature, which are currently under development. This opens up the possibility of using
metagenomic data to predict a wide range of environmental conditions and gaining insight about the interplay
between community composition dynamics and abiotic factors.
12:40 Lunch (Recommendations are marked in purple on the map)
14:10 Flexible multi-omics data integration using autoencoders by a versatile PyTorch-based toolbox: AUTOENCODIX
– by Jan Ewald
Current omics techniques generate data characterized by varying types (gene counts, mutations) and huge
feature spaces (genome-wide). Hence, data integration and dimension reduction methods are critical not
only from a computational perspective but also to gain biological insights. Further, latent spaces and
embeddings have been successfully used to develop disease prognosis models with high predictive power.
Recently, several types of autoencoders for multi-omics data have proven their success in unsupervised learning and feature-engineering tasks such as cancer subtyping or single-cell atlases. However, currently,
no easy- and flexible-to-use implementation for autoencoders exists and we propose here a versatile tool
based on PyTorch for autoencoders on multi-omics data.
Our toolbox will provide the following key features. Users can flexibly choose and specify different Omics
types (transcriptome, proteome, mutome, clinical parameters etc.) as well as several autoencoder types and
topologies (variational, hierarchical etc.) including explainability by design (ontology-based autoencoder) or
translation of data modalities. Further, autoencoders can be optimized via feature filtering and scaling options
as well as fully integrated hyperparameter tuning by Optuna. Lastly, our implementation integrates, as a
pipeline, all necessary steps from data preprocessing, model tuning to model evaluation including the
visualization and clustering of resulting latent spaces. To illustrate the power of our versatile implementation,
we showcase multi-omics data integration on the TCGA pan-cancer data set for cancer subtyping and
prognosis models. First results show that in particular variational autoencoders learn compact and dense
latent space in comparison to methods like t-SNE or UMAP.
In summary, our toolbox offers an easy-to-use solution while being versatile for multi-omics data integration
by autoencoders. Further, it can be extended to additional types of autoencoders and applied on large-scale
data sets, especially single-cell sequencing data.
14:30 Encoding and decoding the Microverse
– by Aristeidos Litos
Complex dynamics and co-occurrence patterns transpire throughout the Microverse and are reflected in the
composition of microbial communities. These dynamics, which could be stochastic, universal, niche-specific,
or even sample-specific, play a crucial role in determining the structure and function of microbial
communities. However, discerning the relative contribution of universal and context-specific processes in
shaping the composition of microbial communities remains a challenge. If microbial communities are largely
shaped by a set of conserved and universal processes, then the information needed to reconstruct their
composition is minimal, which signifies a large level of compressibility in community composition.
Conversely, if communities are shaped by context-specific rules, making each community essentially unique,
a comprehensive understanding of the entire community would be necessary to reconstruct its composition,
indicating that compositions are largely incompressible.
Intrinsic dynamics on the global scale are still understudied, whereas the volume of available data increases.
Here, we introduce a neural network-based framework to unravel intrinsic dynamics in microbial communities
and its concepts. To achieve that we utilize all the data in MGnify database and develop a neural network
with an autoencoder architecture for microbial compositions.
Autoencoders learn compressed representations of data by reducing dimensions and reconstructing them. In
our implementation we further introduce taxonomic information, by projecting the composition to a universal
taxonomy tree. Hyperparameters regarding the model’s architecture, training procedure and preprocessing
of the data are tuned based on the beta-diversity between predictions and samples along with other metrics.
All samples are assigned with a compressibility score that considers the performance.
High compressibility scores indicate wider-spread intrinsic patterns in the composition, while low scores
suggest uniqueness. Therefore, with our generative model, we can classify microbial taxonomic profiles in a
gradient from universal to context-specific dynamic patterns. Universal patterns throughout the Microverse
can be revealed with models trained and tested on datasets from different biomes.
Shared patterns of intrinsic dynamics encrypted in the composition between samples of different biomes can
emerge and further our understanding of microbial communities, their structures, functions and role in nature.
14:50 Genomic insights through antibiotic resistance of wastewater E. coli using positive selection, GWAS, and AI analyse
– by Negin Malekian
We have explored the link between antibiotic resistance and genomic variations in wastewater E. coli
samples. Wastewater is chosen as a representative source due to having diverse genomic content from
various bacteria.
Our research addresses two main aspects: whether to include antibiotic resistance data and the granularity
of representing genomic variations. We investigated positive selection in E. coli using only genomic data [1].
Our study revealed mutations and genes linked to antibiotic resistance, especially those situated on the rim
of membrane porins.
Examining both genomic and antibiotic resistance data at different resolution levels (single mutations, unitigs
(genome words), and machine learning models), our study [2] finds that even the simplest representation of
single mutations detects known resistance mutations and offers insights into potential novel mechanisms. By
using larger genomic units (unitigs), we confirm these findings and identify joint mutations in known and
novel genes [3]. Utilizing advanced machine learning models, including AI language models, the research
predicts antibiotic resistance based on genomic data, with an accuracy above 90%.
In summary, we uncovered new antibiotic resistance mechanisms, conducted a significant positive selection
study in E. coli, and pioneered the utilization of AI language models for predicting antibiotic resistance.
[1] Malekian, Negin, et al. „A genome-wide scan of wastewater E. coli for genes under positive selection:
focusing on mechanisms of antibiotic resistance.“ Scientific Reports 12.1 (2022): 8037.
[2] Malekian, Negin, et al. „Mutations in bdcA and valS correlate with quinolone resistance in wastewater
Escherichia coli.“ International Journal of Molecular Sciences 22.11 (2021): 6063.
[3] Malekian, Negin, et al. „Word-based GWAS harnesses the rich potential of genomic data for E. coli
quinolone resistance.“ Frontiers in microbiology, under revision.
15:10 JIPipe2AI – deep learning access for everyone
– by Jan-Phillip Praetorius
In recent years, deep learning has become an essential tool for the quantitative analysis of biomedical image
data.
Cell segmentation—focused algorithms like Cellpose [1] and StarDist [2] already offer a powerful and easy-to-understand approach to utilize deep learning by providing simple parameters, pre-trained models for
biological object extraction, and a user-friendly graphical interface. More general bioimage analysis tasks are
covered by tools like deepImageJ [3] and Ilastik [4] that support a wider range of image analysis model types
and can import existing pre-trained models.
Yet, the training of models, design of entirely new architectures is currently only accessible via advanced
deep learning frameworks that require extensive programming knowledge. Additionally, frameworks also
enable the creation of pipelines that combine models with image processing operations for pre- and
postprocessing, as well as the application of explainable AI techniques that visualize image regions
associated to the decision making process.
To address these limitations and provide an easy-to-use deep learning toolbox for non-programmers, we
developed JIPipe2AI, which combines the powerful visual programming language JIPipe [5] with a
standardized and versatile interface to advanced deep learning frameworks. The functions are accessible via
fully documented and intuitive processing steps that seamlessly integrate with the existing ImageJ-based [6]
functions of JIPipe.
JIPipe2AI thus allows non-programmers to take advantage of state-of-the-art techniques, including transfer
learning, automated data augmentation, distributed learning across multiple devices, and multi-level
regularization. In addition, the design of the interface allows users to implement their own models and
conveniently compare their performance with models already included in JIPipe2AI. These features make
JIPipe2AI a powerful tool for beginners and advanced users to create (custom) model architectures, train
models on their own data from scratch or improve the performance of existing ones, generate and
postprocess predictions, and apply explainable AI techniques.
1. Pachitariu, M. & Stringer, C. Cellpose 2.0: how to train your own model. Nat. Methods 19, 1634–
1641 (2022).
2. Schmidt, U., Weigert, M., Broaddus, C. & Myers, G. Cell Detection with Star-Convex Polygons. in
Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 (eds. Frangi, A. F., Schnabel,
J. A., Davatzikos, C., Alberola-López, C. & Fichtinger, G.) vol. 11071 265–273 (Springer International
Publishing, 2018).
3. Gómez-de-Mariscal, E. et al. DeepImageJ: A user-friendly environment to run deep learning models
in ImageJ. Nat. Methods 18, 1192–1195 (2021).
4. Berg, S. et al. ilastik: interactive machine learning for (bio)image analysis. Nat. Methods 16, 1226–
1232 (2019).
5. Gerst, R., Cseresnyés, Z. & Figge, M. T. JIPipe: visual batch processing for ImageJ. Nat. Methods
(2023) doi:10.1038/s41592-022-01744-4.
6. Rueden, C. T. et al. ImageJ2: ImageJ for the next generation of scientific image data. BMC
Bioinformatics 18, 529 (2017).
15:30 Using DNA language models to derive per-nucleotide somatic mutation probabilities for individual patients
– by Anna R. Poetsch
Identical sequences in the human genome can have multiple layers of information content, simultaneously
encoding information for proteins, gene regulation, and its own replication. At the same time, it needs to keep
stable to prevent damage, mutation and resulting dysfunction and/or information change/loss with
consequences for somatic evolution, which leads to somatic mosaicism, ageing and cancer.
We use DNA language models to investigate how stability is encoded in the DNA. EAGLE-MUT (Efficient
Analysis with a Genome-wide LSTM to Evaluate per-nucleotide MUTation susceptibility) learns the sequence
context of somatic single base substitutions (SBS) in individual patient samples of somatic mutagenesis. Pre-trained on mutations pooled from multiple tumors and fine-tuned for individual samples, retrospective
probabilities are derived for every nucleotide in the genome. We are evaluating the model via a ratio to a
model of random mutation distributions over the genome and reach a median performance for each patient
of 19-fold over random with a peak performance of 93-fold.
The probability tracks can be used to quantify selective pressures, to correlate other continuous data or to
derive treatment-related probabilities of mutagenesis. Most importantly, they allow discovery of mechanisms
that predispose or protect from mutagenesis at specific sites. We found patterns of mutation distribution to
follow sequence content, genome elements, diverse epigenetic profiles, and discovered for distinct mutation
cold- and hotspots novel biologically meaningful motifs of up to 200 bp in length. Together this shows that
mutation distribution is not random, but the location of mutations is dependent on genome sequence.
Models like EAGLE-MUT overcome the sparse nature of mutation data and lead to novel insight and
unprecedented detail to understand somatic human genome evolution and thus the development of somatic
mosaicism, ageing, and cancer.
15:50 Coffee break
16:20 Bayesian Inference of Chromosome Structures at Multiple Scales from Single-Cell Hi-C Data
– by Vincent Messow
Recent research has shown that the 3D architecture of genomes impacts the regulation of essential cellular
processes.
Chromosome conformation capture techniques such as Hi-C have become instrumental in characterizing the
3D structure of chromatin.
Although the development of ever improving protocols is advancing at great pace, contacts between
genomic loci as reported by Hi-C methods remain sparse, noisy and low in resolution.
These challenges are even more severe in single-cell Hi-C variants.
Bayesian approaches to chromosome structure calculation have proven to be capable of addressing noise
and lack of resolution by modeling the uncertainty in the data and chromosome conformations.
Chromosomes are commonly represented coarsely as *beads on a strings* with bead radii, force field
parameters and contact distances chosen *ad hoc* by trial and error.
The hierarchical organization of genome architecture as well as variations in the density of Hi-C contacts
suggest a 3D representation of chromosomes at multiple scales in which both the number of beads and their
physical sizes vary locally.
In contrast to single-scale models with fixed parameters, an *ad hoc* selection of model parameters is no
longer suitable for multi-scale settings.
A Bayesian perspective suggests that bead parameters should be inferred from the data along with bead
coordinates.
State-of-the-art multi-scale methods for large-scale genome structure determination refrain from taking a full
Bayesian approach due to putatively resource-intensive conformation sampling[^meng2021].
We aim to illustrate that fully Bayesian *Inferential Structure Determination* (ISD) from single-cell Hi-C data
is still feasible by means of an efficient custom-developed particle engine that strongly improves upon the
computation of pairwise interactions of previous work[^carstens2016].
To further improve sampling, we will explore *resolution exchange* Monte Carlo, a replica exchange inspired
group of techniques which swap sampled conformations between independent simulations at different
resolutions.
We also discuss potential extensions to our approach, paving the way towards integrating inference of bead
parameters with the ISD framework.
[^meng2021]: Meng et al. 2021 [https://doi.org/10.1038/s41467-021-24662-z](https://doi.org/10.1038/
s41467-021-24662-z)
[^carstens2016]: Carstens et al. 2016
[https://doi.org/10.1371/journal.pcbi.1005292](https://doi.org/10.1371/journal.pcbi.1005292)
16:40 Finding plant protective sRNAs via deep degradome sequencing – Is counting still adequate?
– by Bernhard Timo Gadow-Werner
Small regulatory RNAs (sRNAs) regulate gene expression in nearly all higher organisms via RNA
interference (RNAi). RNAi causes, among other effects, the targeted degradation of messenger RNAs at
sites complementary to a respective sRNA. Ongoing research in this field could not only reveal a plethora of
endogenous functions of RNAi, but also silencing between different organism. In this process of crosskingdom RNAi (ckRNAi) sRNAs are transferred from one organism (e.g. a pathogen) to another organism
(e.g. a host). Since the discovery of ckRNAi a decade ago, a wide range of different examples from all mayor
eukaryotic clades were discovered. Among them are two examples of plants silencing pathogen genes
(Borniego & Innes, 2023).
The project WheatInterfere aims to find anti-fungal sRNAs naturally occurring in the staple crop, to aid
breeders and to find promising and safe candidates for RNA-based crop protection. For this purpose, we
generated a substantial dataset of more than a billion reads from plants challenged with two fungal
pathogens – Fusarium graminearum and Magnaporthe oryzae a.k.a. cereal killers. Especially promising the
improved parallel analysis of RNA ends (PARE) protocol, which produces more reads and longer tags than
former degradome sequencing approaches.
A closer look on this data reveals formerly unknown biases, overlooked by established bioinformatics
pipelines, which solely rely on binomial distributions. In this talk we will introduce a new pipeline for positional
bias reduction using non-discrete statistics.
Borniego, M.L. and Innes, R.W., 2023: Extracellular RNA, mechanisms of secretion and potential functions.
Journal of experimental botany, erac512, DOI: 10.1093/jxb/erac512.
17:00 Nonlinear DNA methylation trajectories in aging mice
– by Alena van Bömmel
Although DNA methylation data yields highly accurate age predictors, little is known about the dynamics of
this quintessential epigenomic biomarker during lifespan. To narrow the gap, we investigated the methylation
trajectories of male mouse colon at five different time points of aging. Our study indicates the existence of
sudden hypermethylation events at specific stages of life. Precisely, we identify two epigenomic switches
during early-to-midlife (3-9 mo) and mid-to-late-life (15-24 mo) transitions, separating the rodents’ life into
three stages. These surprisingly nonlinear methylation dynamics predominantly affect genes associated with
the nervous system and enrich in bivalently marked chromatin regions. Based on groups of nonlinearly
modified CpGs, we construct a novel clock-like classifier STageR that accurately predicts murine epigenetic
stage. We demonstrate the universality of our clock in an independent mouse cohort and with publicly
available datasets.
17:20 Convergent promoters form cooperative domains
– by Martin Fischer
Transcription is one of the most fundamental processes in biology. Over the past decade, it has become
evident that most promoters and enhancers initiate bidirectional transcription, generating two RNAs that are
transcribed in divergent directions. More recently, antisense transcription downstream of promoters has been
found to be surprisingly prevalent, but its function is unknown. According to the standing model of
transcription interference, convergent transcription, i.e., the collision of sense and antisense transcription, is
expected to display a negative correlation. Here, we show that convergent transcription marks an
unexpectedly cooperative promoter constellation. By assessing transcriptional dynamic systems with deep
CAGE-seq, we identified cooperative constituent promoters that are connected by a distinct chromatin
structure. Visualization of nascent RNA molecules using single-molecule FISH revealed co-transcription from
convergent promoters. Within these cis-regulatory domains, transcription factors can regulate both
constituting promoters by binding to only one of them. Epigenetic data suggest that convergent promoters
form a joint regulatory domain spanning CpG islands enriched for active promoter marks. Convergent
promoters account for about a quarter of all active transcript starts and initiate 5’-overlapping antisense
RNAs, a class of RNA previously thought to be rare. Integrative analysis of CAGE-seq, RNA-seq, and
QuantSeq data provides precise annotations for downstream antisense RNAs generated from convergent
promoters that have missing or incomplete annotations in GENCODE. Taken together, our results
demonstrate that cooperative promoters substantially expand the cis-regulatory repertoire, reveal limitations
of the transcription interference model, and call for adjusting the promoter concept
17:40 Poster session
18:00 Dinner, poster session continued
Friday, March 1
9:00 Invited Talk: Mapping the Microverse and modeling its drivers
– by Bas E. Dutilh
9:45 Computational prediction of pH changes during bacterial growth
– by Rosalind Allen
Changes in pH due to bacterial metabolism play a key role in diverse areas such as tooth decay and food
production, and are also important for detection of bacterial contamination and infection. Flux Balance
Analysis (FBA) is a well-established method for computational prediction of metabolism, but it has rarely
been used to predict pH changes. We aimed to develop a computational framework to predict pH changes
associated with bacterial growth, and to compare its results to measurements for diverse bacterial species
under a range of growth conditions. A mathematical modelling framework for predicting pH changes
associated with bacterial growth was developed by combining FBA with the pH calculation software CurTipot.
Experimental measurements of changes in pH during bacterial growth were made for lab strains of
Escherichia coli and Staphylococcus epidermidis, on LB medium with and without glucose and under aerobic
and anaerobic conditions. Our computational predictions were in good agreement with our measurements for
E. coli under the various growth conditions, but agreement for S. epidermidis was less good, pointing to
possible deficiencies in the underlying metabolic model for this organism. Our computational approach is
promising and may provide a way in future to identify the key metabolites and metabolic pathways that
mediate the effect of microbial growth on pH for different microorganisms.
10:05 Nested metabolic defense strategies of antibiotic-resistant bacterium Pseudomonas aeruginosa
– by Wassili Dimitriew
P. aeruginosa, a versatile Gram-negative bacterium, has emerged as a formidable human pathogen, causing
a range of infections that pose significant challenges to medical treatment. This opportunistic pathogen is
particularly problematic in hospital settings, affecting individuals with compromised immune systems, such as
those undergoing chemotherapy or with cystic fibrosis [1]. One of the key features contributing to the
resilience of P. aeruginosa is its intrinsic resistance to many antibiotics. The bacterium possesses an array of
efflux pumps and produces biofilms, protective communities of bacteria embedded in a self-produced matrix,
both of which contribute to its ability to resist conventional treatments. P.aeruginosa falls within the ”critical”
category of the World Health Organization’s priority list of bacterial pathogens, demanding urgent research
and development of new antibiotica [2]. This bacterium can also adapt to various environments, making it a
persistent threat in diverse clinical settings [1, 3]. As an example of such an adaptability, one can mention the
ability of P. aeruginosa to survive inside macrophages. The resulting interplay can be described in terms of
nested defense strategies, where the host and the pathogen develop mechanisms to counter the opponent’s
measures. For P. aeruginosa, the mechanism of itaconic acid degradation was described [4], which is a rare
example of counter-counter-counter-(or counter3-)defense. We have constructed a model of the nested
defense strategy, using P. aeruginosa as an example organism with counter3-defense mechanism. We have
explored the system parameters’ time courses, the optimal resource allocation from the pathogen’s side, and
parameter sensitivity for different infection scenarios, using the quasi-sequential approach to dynamic
optimization [5]. It allowed us not only to achieve deeper understanding of P. aeruginosa infection process,
but also to hypothesize about new methods to fight this infection.
References
[1] Irene Jurado-Mart ́ın, Maite Sainz-Mej ́ıas, and Siobh ́an McClean. “Pseudomonas aeruginosa: An
audacious pathogen with an adaptable arsenal of virulence factors”. In: International journal of molecular
sciences 22.6 (2021), p. 3128.
[2] Jo ̃ao Botelho, Filipa Grosso, and Lu ́ısa Peixe. “Antibiotic resistance in Pseudomonas aeruginosa–
Mechanisms, epidemiology and evolution”. In: Drug resistance updates 44 (2019), p. 100640.
[3] Karolina Anna Mielko et al. “Metabolomic studies of Pseudomonas aeruginosa”. In: World Journal of
Microbiology and Biotechnology 35 (2019), pp. 1–11.
[4] Jahminy Sasikaran et al. “Bacterial itaconate degradation promotes pathogenicity”. In: Nature chemical
biology 10.5 (2014), pp. 371–377.
[5] Martin Bartl, Pu Li, and Lorenz T Biegler. “Improvement of state profile accuracy in nonlinear dynamic
optimization with the quasi-sequential approach”. In: AIChE Journal 57.8 (2011), pp. 2185–2197
10:25 Genome-scale metabolic modeling of Aspergillus fumigatus strains reveals growth dependencies on the lung microbiome
– by Sascha Schäuble
Aspergillus fumigatus, an opportunistic human pathogen, frequently infects the lungs of people with cystic
fibrosis and is one of the most common causes of infectious-disease death in immunocompromised patients.
Here, we construct 252 strain-specific, genome-scale metabolic models of this important fungal pathogen to
study and better understand the metabolic component of its pathogenic versatility. The models show that
23.1% of A. fumigatus metabolic reactions are not conserved across strains and are mainly associated with
amino acid, nucleotide, and nitrogen metabolism. Profiles of non-conserved reactions and growth-supporting
reaction fluxes are sufficient to differentiate strains, for example by environmental or clinical origin. In
addition, shotgun metagenomics analysis of sputum from 40 cystic fibrosis patients (15 females, 25 males)
before and after diagnosis with an A. fumigatus colonization suggests that the fungus shapes the lung
microbiome towards a more beneficial fungal growth environment associated with aromatic amino acid
availability and the shikimate pathway. Our findings are starting points for the development of drugs or
microbiome intervention strategies targeting fungal metabolic needs for survival and colonization in the non-native environment of the human lung.
10:45 Metabolic Quest: Unveiling metabolic intricacies through expression patterns
– by Shrutakirti Saha
Candida albicans, a significant fungal pathogen represents a great health challenge globally. To understand
and unravel its intricate mechanisms, metabolic modelling is used, integrating omics data. Proliferation and
virulence are mainly caused by gene expression, which are also regulated by metabolic pathways. Serine-arginine (SR) protein kinases are characterized by phosphorylating SR proteins and are highly conserved in
eukaryotes. To investigate the role of SR protein kinases in Candida albicans and their metabolic
implications, a study was conducted of RNA sequencing of sky2 and sky2Δ C.albicans in three distinct
growth media: glucose, succinic acid and malic acid at 6 hours. At each time point, we systematically
gathered omics data, encompassing transcriptomics and metabolomics data. This dataset forms a basis of
understanding into the metabolic fluxes and the dynamic behavior of C.albicans. We developed a topological
model of the central metabolism of C.albicans with the Tricarboxylic acid (TCA) cycle. This plays an
important role in governing the conversion of diverse carbon sources into energy and the nicotinate pathway
which is responsible for the biosynthesis of vitamin B3(niacin). The nicotinate pathway allows us to
understand the cellular processes under limited nutrient availability. Knockout strains grew less well on the
three different media with lower fluxes than the wild-type variant. We created a topological model to analyze
the arginine metabolism and sphingolipid metabolism in C.albicans. Sphingolipid biosynthesis plays an
important role in filamentation of the fungus. Our study of metabolism with the help of elementary modes that
combine omics data allows us to unravel a new perspective on enzyme and pathway usage for C.albicans
for growth on different sources.
11:05 Coffee break
11:30 Implementation, Validation, and Application of a Bioinformatics Framework for Analyzing Oxford Nanopore Technologies Genome Sequencing Data from Zoonotic Bacterial Pathogens
– by Christine Thomas
Investigating bacterial genomes on nucleotide level requires whole-genome sequencing and bioinformatics
analysis. While Illumina sequencing is established (ISO 23418:2022), validation is still missing for the
application of Oxford Nanopore Technologies within the German national surveillance program of zoonotic
bacteria. Automated bioinformatics pipelines enable researchers with little expertise in computer science to
investigate their sequencing data, and ensure reproducibility. Existing pipelines for ONT data analysis either
miss important steps like basecalling, especially with the latest basecaller dorado, or polishing (Deneke
2023, Petit 2023), are not specified for bacteria, or use outdated tools (Murigneux 2021). Here, a
bioinformatics framework for creating closed genome assemblies based on bacterial ONT sequencing data is
implemented, validated, and applied. This framework is incorporating basecalling, quality control, assembly,
and polishing of assemblies. Implementation is completed using python and the workflow management
system snakemake, allowing for parallelism, scalability, and reproducibility. As diverse tools exist for each
analysis step, the performance of different tool combinations is evaluated. Therefore, the whole-genome
sequencing data of seven reference strains of highly pathogenic bacterial pathogens (Bacillus anthracis,
Brucella spp., Burkholderia mallei, Francisella tularensis, Taylorella equigenitalis) is used for the validation
and optimization of the best composition of tools which ensures highest quality genomes. The correctness of
the final polished assemblies is examined by the comparison to publicly available sanger sequences. The
resulting pipeline presents an intuitive tool to produce high quality genome assemblies from whole-genome
sequencing data. The utility of ONT sequencing is demonstrated for different tasks in veterinary medicine
including serotyping of Salmonella (Thomas et al 2023) or genotyping of highly pathogenic zoonotic bacteria
(Linde et al 2023).
Deneke, Carlus. “MiLongA”. Git: https://gitlab.com/bfr_bioinformatics/milonga. Accessed November, 2023.
Petit, Robert A. “dragonflye”. Git: https://github.com/rpetit3/dragonflye. Accessed November, 2023.
Murigneux, Valentine, et al. „MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial
genome construction.“ BMC genomics 22.1 (2021): 474.
ISO 23418:2022 Microbiology of the food chain Whole genome sequencing for typing and genomic
characterization of bacteria. General requirements and guidance.
Thomas, Christine, et al. „Oxford Nanopore Technologies—a valuable tool to generate whole-genome
sequencing data for in silico serotyping and the detection of genetic markers in Salmonella.“ Frontiers in
Veterinary Science 10 (2023): 1178922.
Linde, Jörg, et al. „Comparison of Illumina and Oxford Nanopore Technology for genome analysis of
Francisella tularensis, Bacillus anthracis, and Brucella suis.“ BMC genomics 24.1 (2023): 1-15.
11:50 Fast and Accurate Barcode Calling Based on k-mers
– by Riko Uphoff
DNA barcodes, which are sets of short unique DNA strings, are regularly used as tags in pooled sequencing
experiments to enable the identification of reads that originate from the same cell. In spatial transcriptomics,
for example, tissue samples are placed on arrays with surface–attached barcode primers which allow to find
the spatial coordinates of the sequenced mRNA molecules on the studied tissue. A crucial preprocessing
step in such experiments is barcode calling, where for each read the corresponding barcode has to be
determined. This task is challenging because the original barcode set is corrupted by errors during synthesis
and sequencing. Modern techniques such as photolithographic microarray synthesis produce barcodes with
a comparatively high nucleotide error rate. A straightforward, but naive approach for solving barcode calling
involves searching for the most similar barcode for each read.
For typical applications with ~10^6 barcodes and ~10^8 reads, an all-to-all exact distance
calculation is prohibitively expensive. Hence, the main challenge is to develop a scalable and highly efficient,
but still precise algorithm with a high recall rate for barcode calling. In a breakthrough paper, Press (2022)
designed a filtering approach leading to high throughput rates on modern GPUs.
Here we propose a new filtering technique based on precomputed k-mer lists, improving the state-of-the-art in several ways. We reduce the computational work for the filtering step by orders of magnitude, resulting
in higher throughput even on standard CPUs. Our approach is able to call about 10^8 reads per day on a
CPU. Additionally we achieve slightly improved trade-offs between recall and precision.
12:10 Barcode Layout Optimization in Spatial Transcriptomics: Theory and Experiments
– by Antonia Schmidt
In the field of spatial transcriptomics, the spatial distribution of gene expression in tissue can be examined.
For this purpose, the transcripts are applied to an array with 1024$\times$768 features (14$\mu m$ pitch),
ligated with barcodes and then sequenced. The transcripts are then assigned to their original location via the
barcodes. Unfortunately, the barcode synthesis on the array by photolithographic methods is error-prone,
with a suspected error rate of up to 20\% per base, making automated assignment to the barcode library
challenging. Therefore, we have been searching for methods to decrease the error rate. Prior work suggests
that insertion errors arise during barcode synthesis by neighbor effects due to scattered and diffracted light.
This motivates the combinatorial optimization problem studied in this work. For a given set of barcodes, we
attempt to optimize the layout of the barcodes on the array by minimizing the dissimilary of neighboring
barcodes.
We show that the problem is hard to solve and even hard to approximate: It is maxSNP-hard. An ILP
formulation theoretically allows the computation of optimal results, but it is only applicable for tiny instances.
Therefore, we have developed heuristic algorithms with the aim of computing near-optimal solutions for
instances of realistic size. The algorithms developed include a sorting-based algorithm, a greedy algorithm,
and a genetic algorithm. To assess the quality of the results, we compare the generated solutions with the
expected quality of a random layout and with lower bounds. A GPU implementation of the greedy algorithm
generates the most promising results in terms of both quality and runtime. Solutions to large-scale instances
with 1024^$\times$768 many barcodes show a 35-40\% improvement in quality compared to a random
solution and can be computed within approximately 15 seconds.
12:30 Improved reconstruction of transcripts and coding sequences from RNA-seq data
– by Jan Grau
New genomes are being sequenced with an amazing pace. Having obtained an assembled genome
sequence, a pivotal requirement for many subsequent analyses is an accurate annotation of gene and
transcript models. Previously, we have developed GeMoMa, an algorithm for homology-based gene
prediction that uses protein sequence homology and the evolutionary conservation of exon-exon borders to
transfer gene models from well-annotated reference species to a newly sequence target genome. Here, we
present a novel approach termed GeMoRNA for RNA-seq-based gene prediction that shall complement the
homology- based predictions of GeMoMa and, hence, focuses on the prediction of protein-coding genes.
In GeMoRNA, we start from genomic mappings of RNA-seq reads. We partition the genome into covered
regions, build a read graph with basepair resolution connecting positions (nodes) that are adjacent in
mapped reads, and identify connected components. For each connected component, we then merge
consecutive positions without alternative edges into a splicing graph. We use the splicing graph to
combinatorially enumerate candidate transcripts with a heuristic that shall prioritize high-coverage
transcripts. Candidate transcripts may still be chimeras of multiple transcripts. Hence, we combine coverage
information, simplistic CDS prediction and splice site orientation to split such chimeras into individual
transcripts. Resulting potential transcripts are quantified based on RNA-seq evidence in an EM-like
algorithm. Overlapping transcripts with sufficient coverage are finally merged to genes.
We benchmark GeMoRNA against the state-of-the-art tools StringTie, scallop and cufflinks on a large
collection of 230 stranded and unstranded RNA-seq libraries of multiple species (A. thaliana, C. elegans, D.
melanogaster, M. musculus, O. sativa, S. cerevisiae, S. lycopersicum) using the respective reference
annotations as ground truth. For transcript F1 measure (gffcompare), we find that GeMoRNA yields higher
accuracy than cufflinks and scallop, while GeMoRNA performs better than StringTie for two thirds of the data
sets. We further compare performance on the level of CDS predictions, which have been in the focus of
GeMoRNA development. Here, we observe that GeMoRNA yields better predictions than all previous
approaches for almost all data sets, where the improvement is specifically pronounced for S. cerevisiae, C.
elegans and A. thaliana.
The combination of homology-based (GeMoMa) and RNA-seq-based (GeMoRNA) gene predictions may
further improve the completeness of (protein-coding) gene predictions.
12:50 Lunch (Recommendations are marked in purple on the map)
14:15 Invited Talk: Genotyping structural variation: From simple to complex
– by Birte Kehr
15:00 EVEs in the Shadows: A Gap in Virus Research and Bioinformatics
– by Muriel Ritsch
Endogenous Viral Elements (EVEs) are remnants of viral genetic material integrated into the genome of
germline cells of the host. EVEs are pivotal in the ongoing ‚arms race‘ between host organisms and viruses,
exhibiting transcription and translation capabilities. Some EVEs have antiviral functions or co-opted roles and
are associated with various diseases. EVEs can outnumber protein-coding genes and offer valuable insights
for paleovirology, enriching our comprehension of virus-host interactions throughout history. Despite their
significance, EVEs are often underestimated or ignored in the scientific community. We have explored the
data availability and accessibility of EVE, identifying numerous challenges that currently make it impossible
to conduct meaningful analyses or perform meta-analyses with EVE data.
The availability of EVE data is hindered by a bias towards endogenous retroviruses (ERVs), which, due to
their disease associations and their high integration capability into host genomes, overshadow non-retroviral
EVEs (nrEVEs) and affect overall data accessibility. The intricate characteristics of EVEs, involving varied
integration patterns and sequence differences, further complicate the process of collecting data and may
hinder their integration into comprehensive databases. We discovered the absence of well-established EVE
identification tools, and in most virus-related tools, EVEs are often not considered or even mentioned.
Beyond data availability, data accessibility is equally critical, as inaccessible data is effectively as good as
nonexistent. We found that effective keyword searches for scientific papers or databases are lacking, and the
absence of standardized methods for presenting and storing EVE data presents significant hurdles.
Additionally, annotations for EVEs in host genomes are rare. Furthermore, we illustrated that the limited
availability of EVE sequences predominantly arises from the lack of dedicated EVE databases and the
challenge of ensuring easy retrieval of EVE sequences within general databases.
In this study, we explored the current state of EVE research, providing a concrete illustration of the
challenges that arise when data is unavailable and inaccessible. These challenges are not unique to the EVE
field and create opportunities for an interdisciplinary exchange of knowledge to find solutions collaboratively.
To address these issues we identified, we proposed investing more in EVE identification tools and integrating
EVE data into existing databases, emphasizing the diversity and significance of EVEs in virus research. We
encourage researchers to ask themselves: „Do EVEs play a role in my research, and how can EVEs be
integrated?“
15:20 Comprehensive Survey of Conserved RNA Secondary Structures in Full-Genome Alignment of Hepatitis C Virus
– by Sandra Triebel
Background
The Hepatitis C virus (HCV) genome is approximately 9.6 kilobases in length and encodes a polyprotein
precursor flanked by the 5′ and 3′ untranslated regions (UTRs). The construction of full-genome alignments
of viruses is challenging on the sequence level due to high genetic diversity, rapid mutation rates, and the
presence of insertions and deletions (indels) that complicate the alignment process. The HCV genome
harbors several RNA secondary structures essential for viral translation, replication, and assembly. One of
the most prominent RNA secondary structures in HCV is the internal ribosome entry site (IRES), located
within the 5′ UTR, mediating cap-independent translation initiation. Incorporating RNA secondary structures
in the alignment is even more challenging due to the additional complexity introduced by the folding and
variability of RNA structures, requiring specialized methods to incorporate both sequence and structural
information.
Methods
We clustered all complete HCV genomes from the BV-BRC database incorporating k-mer distributions,
followed by dimension reduction and clustering with HDBSCAN. We identified 57 representative genomes
and constructed a full-genome multiple sequence alignment (MSA), including computational prediction of
RNA secondary structures. The MSA process involved initial sequence-based alignment followed by
structure-guided refinement, based on conserved anchor regions. Manual curation was performed by HCV
experts to ensure accuracy and reliability of the alignment and structures.
Results
We successfully tackled three key challenges for HCV: (i) the construction of a comprehensive full-genome
alignment, (ii) the integration of RNA secondary structure prediction (including alternative conformational
states of structures), and (iii) the in silico prediction of novel structures throughout the HCV genome. The
conserved RNA secondary structure motifs are included in the Rfam database (https://rfam.org). In addition,
a detailed analysis of certain structures revealed a relative shift in sequence and structure. This phenomenon
is referred to as incongruent evolution.
Conclusion
Our computational approach predicts RNA secondary structures across entire viral genomes, aiding in the
annotation of conserved RNA structures. The inclusion of HCV models in the Rfam database aligns with a
broader project creating a comprehensive collection of viral RNA structures and covariance models for
automated annotation. Further, we strive to automate the alignment construction and RNA secondary
structure prediction to analyze additional viral clades efficiently.
15:40 Identifying a rule for fluoroquinolone treatment of patients with community-acquired pneumonia (CAP) of moderate severity
– by Marcus Oswald
The role of fluoroquinolone therapy in community-acquired pneumonia (CAP) of moderate severity is a
matter of debate. Even though fluoroquinolones cover a broad spectrum of pathogens and, particularly, can
be very effective against atypical pathogens, they may come along with severe side effects, such as tendon
damage, arthralgia or divers neuropathies. They need to be prescribed with care and, so far, are only
recommended as a reserve treatment option.
We used machine learning based on aetiological and clinical parameters, which are available ex ante to
support a personalised decision for or against fluoroquinolones for the best clinical outcome of the individual
patient. We employed a cross-validation scheme based on a well-balanced selection of patients employing
an own developed propensity matching procedure to data available on admission of more than 6,000
hospitalised patients with moderate severity (non-intensive care unit patients) from the observational,
prospective, multinational CAPNETZ study. We aimed to improve the primary outcome of 90-day survival.
We found that the efficiency of fluoroquinolones very much depends on the time of year. Treating patients
with fluoroquinolones within a time window starting at the seasonal peaks of influenza infections and ending
four weeks later, fluoroquinolone treated patients showed distinctive lower mortality compared to beta-lactam
treated patients. In turn, during the rest of the year fluoroquinolones were inferior to beta-lactams.
Stratifying fluoroquinolone treatment in patients following this simple treatment rule may lead to considerably
reduced mortality in CAP. Long term effects have not been considered so far but need to also be considered
before setting up a randomised controlled trial confirming our result.
Members of the CAPNETZ study group are: M. Dreher, C. Cornelissen (Aachen); W. Knüppel (Bad Arolsen);
D. Stolz (Basel); N. Suttorp, M. Witzenrath, P. Creutz, A. Mikolajewska (Berlin, Charité); T. Bauer, D. Krieger
(Berlin); W. Pankow, D. Thiemig (Berlin-Neukölln); B. Hauptmeier, S. Ewig, D. Wehde (Bochum); M. Prediger,
S. Schmager (Cottbus); M. Kolditz, B. Schulte-Hubbert, S. Langner (Dresden); W. Albrich (St Gallen); T.
Welte, J. Freise, G. Barten, O. Arenas Toro, M. Nawrocki, J. Naim, M. Witte, W. Kröner, T. Illig, N. Klopp
(Hannover); M. Kreuter, F. Herth, S. Hummler (Heidelberg); P. Ravn, A. Vestergaard-Jensen, G. BaunbaekKnudsen (Hillerød); M. Pletz, C. Kroegel, J. Frosinski, J. Winning, B. Schleenvoigt ( Jena); K. Dalhoff, J.
Rupp, R. Hörster, D. Drömann (Lübeck); G. Rohde, J. Drijkoningen, D. Braeken (Maastricht); H. Buschmann
(Paderborn); T. Schaberg, I. Hering (Rotenburg/Wümme); M. Panning (Freiburg); M. Wallner (Ulm)
16:00 VIRify: an integrated detection, annotation, and taxonomic classification pipeline using virus-specific protein profile hidden Markov models
– by Martin Hölzer
Introduction
Viruses are the most abundant biological entities on our planet. Some are relevant pathogens for public
health or agriculture; others play ecological roles critical for maintaining ecosystems. Most viruses are yet to
be cultured, so their identification and characterization depend solely on the analysis of DNA or RNA
obtained from the environment. Unlike cellular organisms, viruses also lack a universal genetic marker that
allows taxonomic profiling of an environmental viral community.
Objectives
To facilitate the robust detection of viruses from metagenomics or metatranscriptomics sequencing data, we
developed a computational pipeline called VIRify. Our pipeline is designed to provide a user-friendly and
accurate functional and taxonomic characterization of viral communities.
Materials & Methods
VIRify identifies viral contigs and prophages from metagenomic assemblies and annotates them using a
collection of viral profile hidden Markov models (HMMs). These include our manually curated profile HMMs
(ViPhOGs), which serve as specific taxonomic markers for many prokaryotic and eukaryotic viral taxa. We
tested VIRify on assemblies from two microbial mock communities, a large marine metagenomics study, and
human gut phages.
Results
VIRify efficiently detects and classifies viruses targeting bacteria or eukaryotic organisms, both in mock
microbial communities and in samples from the world’s oceans and the human gut microbiome, providing
taxonomic classifications from the genus to the family rank with high accuracy. Overall, we demonstrate that
VIRify is a novel and powerful resource that offers an enhanced capability to detect and taxonomically
classify a broad range of viral contigs. VIRify is user-friendly, requires minimal interaction with the command
line, and was developed with portability in mind. VIRify can enhance the exploration of viral diversity in
nature and support the detection of pathogenic viruses with pandemic potential. The pipeline is freely
available at https://github.com/EBI-Metagenomics/emg-viral-pipeline.
16:20 RNAswarm – A Modular Pipeline for Differential RRI Analysis in Influenza A Virus
– by Gabriel Lencioni Lovoate
The segmented RNA genome of Influenza A virus (IAV) is a subject to intricate RNA-RNA interactions (RRIs),
which are crucial for understanding viral evolution, vaccine development, and pandemic response. Current
RNA proximity ligation methods like PARIS (psoralen analysis of RNA interactions and structures), SPLASH
(psoralen crosslinked, ligated, and selected hybrids), and 2CIMPL (dual crosslinking, immunoprecipitation,
and proximity ligation) can detect RRIs on a large scale [1-3], yet there is no bioinformatics pipeline for highthroughput comparison of RRI frequency across different strains or conditions. Addressing this gap, we
present RNAswarm, a versatile and reproducible Nextflow pipeline designed for differential RRI analysis,
which is already available on GitHub (https://github.com/rnajena/RNAswarm). This open-source tool
efficiently processes raw reads from proximity ligation experiments, de-novo identifies RRIs, and employs
DEseq2 for statistical analysis, allowing for strain-specific RRI site identification and the validation of known
interactions within IAV strains. RNAswarm implements de novo annotation of RRIs through the generation of
pairwise matrices of chimeric reads, followed by fitting Gaussian Mixture Models (GMMs) to pinpoint
interaction candidates. This approach solely uses the density count of cross-linked reads to find potential
interactions based on the fitted GMMs. The pipeline has demonstrated its utility in identifying evolutionarily
conserved interaction sites, as well as and flexible sites which can potentially interact with multiple
segments[4]. It provides a systematic approach to prioritize and compare RRIs, and has the potential to
uncover novel RRIs in viruses and other biological systems, since RNAswarm is agnostic to the origin of the
used references. The modularity and visualization capabilities of RNAswarm make it a useful resource for
researchers probing the viral RNA interactome.
1. Aw JGA et al. (2016) Molecular Cell. 62:603–17.
2. Lu Z et al. (2016) Cell. 165:1267–79.
3. Sage VL et al. (2020) Cell Reports. 31.
4. Jakob C et al. (2023) Nucleic Acids Research. gkad442.
16:50 Closing remarks
17:00 End of conference