Program

Details can be found as a dropdown by pressing on the title.

Thursday, February 29

10:00 Registration desk opens

11:00 Welcome note

– by Kirsten Küsel, Microverse; Rainer König, Jena University Hospital

11:15 Invited Talk: Guessing, meassuring and inferring metabolic fluxes

– by Nicola Zamboni

12:00 Dynamics of algae-bacteria interactions in artificial plankton communities

– by Marco Mauri

Marine phytoplankton is composed of unicellular algae and their associated bacteria. These algal communities are highly variable in species composition, but their patterns tend to recur seasonally. This diversity suggests the existence of ecological niches for the associated bacteria. To clarify the role of the bacterial community on the phytoplankton, we co-cultured several bacterial species with microalgae. We found that the inhibitory or stimulatory function of a particular bacterium depends on both the co-cultured partner and on the different growth phases of the algal culture. Using a generalized Lotka-Volterra model, we were able to predict the behaviour of an artificial community composed of C. radiatus diatoms and four bacterial species, and demonstrate the importance of pairwise bacteria-algae interactions [1]. Our study supports the idea that interactions between bacteria and algae are highly species-specific and depend on algal fitness, bacterial metabolism, and community composition. This species specificity may underlie the high complexity of the multi-species plankton communities observed in nature.

[1] Deng, M. Mauri, M. Vallet, M. Staudinger, R.J. Allen, G. Pohnert, Appl Environ Microbiol. 88, 22, 2022.

12:20 Predicting the environment of bacterial community composition

– by Rose Brouns

Abiotic factors, including salinity, temperature, and pH, play a crucial role in shaping the microbial community structure. Considering that these abiotic factors are crucial in shaping microbial communities, we expect that conversely, they can be predicted based on community features. Being able to predict how communities are shaped in relation to abiotic factors can have huge impact on microbiome research because it can help predict what will happen to a community once the environment changes.

We downloaded the following publicly available data from the MGnify database: taxonomic composition, biome description, salinity, pH and temperature of study sites around the world. We used dimension reduction techniques to visualize the data based on different abiotic factors and biome. While some separation was observed, the samples did not form clear clusters, suggesting that there is more to learn about the complex ecology of microbial communities.

Using Random Forest (RF) models we could classify samples in high or low salinity groups based on its taxonomic composition on different taxonomic ranks. The models achieved a mean accuracy of 96-97% (standard deviation between 0.005 and 0.008) in predicting high or low salinity of the environment.

To cross-check the results of our models, we conducted differential abundance analysis (DAA) in which we identified taxa that are significantly and differentially abundant between high and low saline environments. These results are cohesive with the most important taxa identified by the RF modelss.

Our success in predicting salinity levels suggests that we can also predict other abiotic factors such as pH and temperature, which are currently under development. This opens up the possibility of using metagenomic data to predict a wide range of environmental conditions and gaining insight about the interplay between community composition dynamics and abiotic factors.

12:40 Lunch (Recommendations are marked in purple on the map)

14:10 Flexible multi-omics data integration using autoencoders by a versatile PyTorch-based toolbox: AUTOENCODIX

– by Jan Ewald

Current omics techniques generate data characterized by varying types (gene counts, mutations) and huge feature spaces (genome-wide). Hence, data integration and dimension reduction methods are critical not only from a computational perspective but also to gain biological insights. Further, latent spaces and embeddings have been successfully used to develop disease prognosis models with high predictive power. Recently, several types of autoencoders for multi-omics data have proven their success in unsupervised learning and feature-engineering tasks such as cancer subtyping or single-cell atlases. However, currently, no easy- and flexible-to-use implementation for autoencoders exists and we propose here a versatile tool based on PyTorch for autoencoders on multi-omics data. Our toolbox will provide the following key features. Users can flexibly choose and specify different Omics types (transcriptome, proteome, mutome, clinical parameters etc.) as well as several autoencoder types and topologies (variational, hierarchical etc.) including explainability by design (ontology-based autoencoder) or translation of data modalities. Further, autoencoders can be optimized via feature filtering and scaling options as well as fully integrated hyperparameter tuning by Optuna. Lastly, our implementation integrates, as a pipeline, all necessary steps from data preprocessing, model tuning to model evaluation including the visualization and clustering of resulting latent spaces. To illustrate the power of our versatile implementation, we showcase multi-omics data integration on the TCGA pan-cancer data set for cancer subtyping and prognosis models. First results show that in particular variational autoencoders learn compact and dense latent space in comparison to methods like t-SNE or UMAP. In summary, our toolbox offers an easy-to-use solution while being versatile for multi-omics data integration by autoencoders. Further, it can be extended to additional types of autoencoders and applied on large-scale data sets, especially single-cell sequencing data.

14:30 Encoding and decoding the Microverse

– by Aristeidos Litos

Complex dynamics and co-occurrence patterns transpire throughout the Microverse and are reflected in the composition of microbial communities. These dynamics, which could be stochastic, universal, niche-specific, or even sample-specific, play a crucial role in determining the structure and function of microbial communities. However, discerning the relative contribution of universal and context-specific processes in shaping the composition of microbial communities remains a challenge. If microbial communities are largely shaped by a set of conserved and universal processes, then the information needed to reconstruct their composition is minimal, which signifies a large level of compressibility in community composition. Conversely, if communities are shaped by context-specific rules, making each community essentially unique, a comprehensive understanding of the entire community would be necessary to reconstruct its composition, indicating that compositions are largely incompressible.

Intrinsic dynamics on the global scale are still understudied, whereas the volume of available data increases. Here, we introduce a neural network-based framework to unravel intrinsic dynamics in microbial communities and its concepts. To achieve that we utilize all the data in MGnify database and develop a neural network with an autoencoder architecture for microbial compositions.

Autoencoders learn compressed representations of data by reducing dimensions and reconstructing them. In our implementation we further introduce taxonomic information, by projecting the composition to a universal taxonomy tree. Hyperparameters regarding the model’s architecture, training procedure and preprocessing of the data are tuned based on the beta-diversity between predictions and samples along with other metrics. All samples are assigned with a compressibility score that considers the performance.

High compressibility scores indicate wider-spread intrinsic patterns in the composition, while low scores suggest uniqueness. Therefore, with our generative model, we can classify microbial taxonomic profiles in a gradient from universal to context-specific dynamic patterns. Universal patterns throughout the Microverse can be revealed with models trained and tested on datasets from different biomes.

Shared patterns of intrinsic dynamics encrypted in the composition between samples of different biomes can emerge and further our understanding of microbial communities, their structures, functions and role in nature.

14:50 Genomic insights through antibiotic resistance of wastewater E. coli using positive selection, GWAS, and AI analyse

– by Negin Malekian

We have explored the link between antibiotic resistance and genomic variations in wastewater E. coli samples. Wastewater is chosen as a representative source due to having diverse genomic content from various bacteria.

Our research addresses two main aspects: whether to include antibiotic resistance data and the granularity of representing genomic variations. We investigated positive selection in E. coli using only genomic data [1]. Our study revealed mutations and genes linked to antibiotic resistance, especially those situated on the rim of membrane porins.

Examining both genomic and antibiotic resistance data at different resolution levels (single mutations, unitigs (genome words), and machine learning models), our study [2] finds that even the simplest representation of single mutations detects known resistance mutations and offers insights into potential novel mechanisms. By using larger genomic units (unitigs), we confirm these findings and identify joint mutations in known and novel genes [3]. Utilizing advanced machine learning models, including AI language models, the research predicts antibiotic resistance based on genomic data, with an accuracy above 90%.

In summary, we uncovered new antibiotic resistance mechanisms, conducted a significant positive selection study in E. coli, and pioneered the utilization of AI language models for predicting antibiotic resistance.

[1] Malekian, Negin, et al. „A genome-wide scan of wastewater E. coli for genes under positive selection: focusing on mechanisms of antibiotic resistance.“ Scientific Reports 12.1 (2022): 8037. [2] Malekian, Negin, et al. „Mutations in bdcA and valS correlate with quinolone resistance in wastewater Escherichia coli.“ International Journal of Molecular Sciences 22.11 (2021): 6063. [3] Malekian, Negin, et al. „Word-based GWAS harnesses the rich potential of genomic data for E. coli quinolone resistance.“ Frontiers in microbiology, under revision.

15:10 JIPipe2AI – deep learning access for everyone

– by Jan-Phillip Praetorius

In recent years, deep learning has become an essential tool for the quantitative analysis of biomedical image data. Cell segmentation—focused algorithms like Cellpose [1] and StarDist [2] already offer a powerful and easy-to-understand approach to utilize deep learning by providing simple parameters, pre-trained models for biological object extraction, and a user-friendly graphical interface. More general bioimage analysis tasks are covered by tools like deepImageJ [3] and Ilastik [4] that support a wider range of image analysis model types and can import existing pre-trained models.

Yet, the training of models, design of entirely new architectures is currently only accessible via advanced deep learning frameworks that require extensive programming knowledge. Additionally, frameworks also enable the creation of pipelines that combine models with image processing operations for pre- and postprocessing, as well as the application of explainable AI techniques that visualize image regions associated to the decision making process.

To address these limitations and provide an easy-to-use deep learning toolbox for non-programmers, we developed JIPipe2AI, which combines the powerful visual programming language JIPipe [5] with a standardized and versatile interface to advanced deep learning frameworks. The functions are accessible via fully documented and intuitive processing steps that seamlessly integrate with the existing ImageJ-based [6] functions of JIPipe.

JIPipe2AI thus allows non-programmers to take advantage of state-of-the-art techniques, including transfer learning, automated data augmentation, distributed learning across multiple devices, and multi-level regularization. In addition, the design of the interface allows users to implement their own models and conveniently compare their performance with models already included in JIPipe2AI. These features make JIPipe2AI a powerful tool for beginners and advanced users to create (custom) model architectures, train models on their own data from scratch or improve the performance of existing ones, generate and postprocess predictions, and apply explainable AI techniques.

1. Pachitariu, M. & Stringer, C. Cellpose 2.0: how to train your own model. Nat. Methods 19, 1634– 1641 (2022). 2. Schmidt, U., Weigert, M., Broaddus, C. & Myers, G. Cell Detection with Star-Convex Polygons. in Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 (eds. Frangi, A. F., Schnabel, J. A., Davatzikos, C., Alberola-López, C. & Fichtinger, G.) vol. 11071 265–273 (Springer International Publishing, 2018). 3. Gómez-de-Mariscal, E. et al. DeepImageJ: A user-friendly environment to run deep learning models in ImageJ. Nat. Methods 18, 1192–1195 (2021). 4. Berg, S. et al. ilastik: interactive machine learning for (bio)image analysis. Nat. Methods 16, 1226– 1232 (2019). 5. Gerst, R., Cseresnyés, Z. & Figge, M. T. JIPipe: visual batch processing for ImageJ. Nat. Methods (2023) doi:10.1038/s41592-022-01744-4. 6. Rueden, C. T. et al. ImageJ2: ImageJ for the next generation of scientific image data. BMC Bioinformatics 18, 529 (2017).

15:30 Using DNA language models to derive per-nucleotide somatic mutation probabilities for individual patients

– by Anna R. Poetsch

Identical sequences in the human genome can have multiple layers of information content, simultaneously encoding information for proteins, gene regulation, and its own replication. At the same time, it needs to keep stable to prevent damage, mutation and resulting dysfunction and/or information change/loss with consequences for somatic evolution, which leads to somatic mosaicism, ageing and cancer. We use DNA language models to investigate how stability is encoded in the DNA. EAGLE-MUT (Efficient Analysis with a Genome-wide LSTM to Evaluate per-nucleotide MUTation susceptibility) learns the sequence context of somatic single base substitutions (SBS) in individual patient samples of somatic mutagenesis. Pre-trained on mutations pooled from multiple tumors and fine-tuned for individual samples, retrospective probabilities are derived for every nucleotide in the genome. We are evaluating the model via a ratio to a model of random mutation distributions over the genome and reach a median performance for each patient of 19-fold over random with a peak performance of 93-fold.

The probability tracks can be used to quantify selective pressures, to correlate other continuous data or to derive treatment-related probabilities of mutagenesis. Most importantly, they allow discovery of mechanisms that predispose or protect from mutagenesis at specific sites. We found patterns of mutation distribution to follow sequence content, genome elements, diverse epigenetic profiles, and discovered for distinct mutation cold- and hotspots novel biologically meaningful motifs of up to 200 bp in length. Together this shows that mutation distribution is not random, but the location of mutations is dependent on genome sequence. Models like EAGLE-MUT overcome the sparse nature of mutation data and lead to novel insight and unprecedented detail to understand somatic human genome evolution and thus the development of somatic mosaicism, ageing, and cancer.

15:50 Coffee break

16:20 Bayesian Inference of Chromosome Structures at Multiple Scales from Single-Cell Hi-C Data

– by Vincent Messow

Recent research has shown that the 3D architecture of genomes impacts the regulation of essential cellular processes. Chromosome conformation capture techniques such as Hi-C have become instrumental in characterizing the 3D structure of chromatin. Although the development of ever improving protocols is advancing at great pace, contacts between genomic loci as reported by Hi-C methods remain sparse, noisy and low in resolution. These challenges are even more severe in single-cell Hi-C variants. Bayesian approaches to chromosome structure calculation have proven to be capable of addressing noise and lack of resolution by modeling the uncertainty in the data and chromosome conformations.

Chromosomes are commonly represented coarsely as *beads on a strings* with bead radii, force field parameters and contact distances chosen *ad hoc* by trial and error. The hierarchical organization of genome architecture as well as variations in the density of Hi-C contacts suggest a 3D representation of chromosomes at multiple scales in which both the number of beads and their physical sizes vary locally. In contrast to single-scale models with fixed parameters, an *ad hoc* selection of model parameters is no longer suitable for multi-scale settings. A Bayesian perspective suggests that bead parameters should be inferred from the data along with bead coordinates.

State-of-the-art multi-scale methods for large-scale genome structure determination refrain from taking a full Bayesian approach due to putatively resource-intensive conformation sampling[^meng2021]. We aim to illustrate that fully Bayesian *Inferential Structure Determination* (ISD) from single-cell Hi-C data is still feasible by means of an efficient custom-developed particle engine that strongly improves upon the computation of pairwise interactions of previous work[^carstens2016]. To further improve sampling, we will explore *resolution exchange* Monte Carlo, a replica exchange inspired group of techniques which swap sampled conformations between independent simulations at different resolutions. We also discuss potential extensions to our approach, paving the way towards integrating inference of bead parameters with the ISD framework.

[^meng2021]: Meng et al. 2021 [https://doi.org/10.1038/s41467-021-24662-z](https://doi.org/10.1038/ s41467-021-24662-z) [^carstens2016]: Carstens et al. 2016 [https://doi.org/10.1371/journal.pcbi.1005292](https://doi.org/10.1371/journal.pcbi.1005292)

16:40 Finding plant protective sRNAs via deep degradome sequencing – Is counting still adequate?

– by Bernhard Timo Gadow-Werner

Small regulatory RNAs (sRNAs) regulate gene expression in nearly all higher organisms via RNA interference (RNAi). RNAi causes, among other effects, the targeted degradation of messenger RNAs at sites complementary to a respective sRNA. Ongoing research in this field could not only reveal a plethora of endogenous functions of RNAi, but also silencing between different organism. In this process of crosskingdom RNAi (ckRNAi) sRNAs are transferred from one organism (e.g. a pathogen) to another organism (e.g. a host). Since the discovery of ckRNAi a decade ago, a wide range of different examples from all mayor eukaryotic clades were discovered. Among them are two examples of plants silencing pathogen genes (Borniego & Innes, 2023). The project WheatInterfere aims to find anti-fungal sRNAs naturally occurring in the staple crop, to aid breeders and to find promising and safe candidates for RNA-based crop protection. For this purpose, we generated a substantial dataset of more than a billion reads from plants challenged with two fungal pathogens – Fusarium graminearum and Magnaporthe oryzae a.k.a. cereal killers. Especially promising the improved parallel analysis of RNA ends (PARE) protocol, which produces more reads and longer tags than former degradome sequencing approaches. A closer look on this data reveals formerly unknown biases, overlooked by established bioinformatics pipelines, which solely rely on binomial distributions. In this talk we will introduce a new pipeline for positional bias reduction using non-discrete statistics.

Borniego, M.L. and Innes, R.W., 2023: Extracellular RNA, mechanisms of secretion and potential functions. Journal of experimental botany, erac512, DOI: 10.1093/jxb/erac512.

17:00 Nonlinear DNA methylation trajectories in aging mice

– by Alena van Bömmel

Although DNA methylation data yields highly accurate age predictors, little is known about the dynamics of this quintessential epigenomic biomarker during lifespan. To narrow the gap, we investigated the methylation trajectories of male mouse colon at five different time points of aging. Our study indicates the existence of sudden hypermethylation events at specific stages of life. Precisely, we identify two epigenomic switches during early-to-midlife (3-9 mo) and mid-to-late-life (15-24 mo) transitions, separating the rodents’ life into three stages. These surprisingly nonlinear methylation dynamics predominantly affect genes associated with the nervous system and enrich in bivalently marked chromatin regions. Based on groups of nonlinearly modified CpGs, we construct a novel clock-like classifier STageR that accurately predicts murine epigenetic stage. We demonstrate the universality of our clock in an independent mouse cohort and with publicly available datasets.

17:20 Convergent promoters form cooperative domains

– by Martin Fischer

Transcription is one of the most fundamental processes in biology. Over the past decade, it has become evident that most promoters and enhancers initiate bidirectional transcription, generating two RNAs that are transcribed in divergent directions. More recently, antisense transcription downstream of promoters has been found to be surprisingly prevalent, but its function is unknown. According to the standing model of transcription interference, convergent transcription, i.e., the collision of sense and antisense transcription, is expected to display a negative correlation. Here, we show that convergent transcription marks an unexpectedly cooperative promoter constellation. By assessing transcriptional dynamic systems with deep CAGE-seq, we identified cooperative constituent promoters that are connected by a distinct chromatin structure. Visualization of nascent RNA molecules using single-molecule FISH revealed co-transcription from convergent promoters. Within these cis-regulatory domains, transcription factors can regulate both constituting promoters by binding to only one of them. Epigenetic data suggest that convergent promoters form a joint regulatory domain spanning CpG islands enriched for active promoter marks. Convergent promoters account for about a quarter of all active transcript starts and initiate 5’-overlapping antisense RNAs, a class of RNA previously thought to be rare. Integrative analysis of CAGE-seq, RNA-seq, and QuantSeq data provides precise annotations for downstream antisense RNAs generated from convergent promoters that have missing or incomplete annotations in GENCODE. Taken together, our results demonstrate that cooperative promoters substantially expand the cis-regulatory repertoire, reveal limitations of the transcription interference model, and call for adjusting the promoter concept

17:40 Poster session

18:00 Dinner, poster session continued

Friday, March 1

9:00 Invited Talk: Mapping the Microverse and modeling its drivers

– by Bas E. Dutilh

9:45 Computational prediction of pH changes during bacterial growth

– by Rosalind Allen

Changes in pH due to bacterial metabolism play a key role in diverse areas such as tooth decay and food production, and are also important for detection of bacterial contamination and infection. Flux Balance Analysis (FBA) is a well-established method for computational prediction of metabolism, but it has rarely been used to predict pH changes. We aimed to develop a computational framework to predict pH changes associated with bacterial growth, and to compare its results to measurements for diverse bacterial species under a range of growth conditions. A mathematical modelling framework for predicting pH changes associated with bacterial growth was developed by combining FBA with the pH calculation software CurTipot. Experimental measurements of changes in pH during bacterial growth were made for lab strains of Escherichia coli and Staphylococcus epidermidis, on LB medium with and without glucose and under aerobic and anaerobic conditions. Our computational predictions were in good agreement with our measurements for E. coli under the various growth conditions, but agreement for S. epidermidis was less good, pointing to possible deficiencies in the underlying metabolic model for this organism. Our computational approach is promising and may provide a way in future to identify the key metabolites and metabolic pathways that mediate the effect of microbial growth on pH for different microorganisms.

10:05 Nested metabolic defense strategies of antibiotic-resistant bacterium Pseudomonas aeruginosa

– by Wassili Dimitriew

P. aeruginosa, a versatile Gram-negative bacterium, has emerged as a formidable human pathogen, causing a range of infections that pose significant challenges to medical treatment. This opportunistic pathogen is particularly problematic in hospital settings, affecting individuals with compromised immune systems, such as those undergoing chemotherapy or with cystic fibrosis [1]. One of the key features contributing to the resilience of P. aeruginosa is its intrinsic resistance to many antibiotics. The bacterium possesses an array of efflux pumps and produces biofilms, protective communities of bacteria embedded in a self-produced matrix, both of which contribute to its ability to resist conventional treatments. P.aeruginosa falls within the ”critical” category of the World Health Organization’s priority list of bacterial pathogens, demanding urgent research and development of new antibiotica [2]. This bacterium can also adapt to various environments, making it a persistent threat in diverse clinical settings [1, 3]. As an example of such an adaptability, one can mention the ability of P. aeruginosa to survive inside macrophages. The resulting interplay can be described in terms of nested defense strategies, where the host and the pathogen develop mechanisms to counter the opponent’s measures. For P. aeruginosa, the mechanism of itaconic acid degradation was described [4], which is a rare example of counter-counter-counter-(or counter3-)defense. We have constructed a model of the nested defense strategy, using P. aeruginosa as an example organism with counter3-defense mechanism. We have explored the system parameters’ time courses, the optimal resource allocation from the pathogen’s side, and parameter sensitivity for different infection scenarios, using the quasi-sequential approach to dynamic optimization [5]. It allowed us not only to achieve deeper understanding of P. aeruginosa infection process, but also to hypothesize about new methods to fight this infection.

References
[1] Irene Jurado-Mart ́ın, Maite Sainz-Mej ́ıas, and Siobh ́an McClean. “Pseudomonas aeruginosa: An audacious pathogen with an adaptable arsenal of virulence factors”. In: International journal of molecular sciences 22.6 (2021), p. 3128. [2] Jo ̃ao Botelho, Filipa Grosso, and Lu ́ısa Peixe. “Antibiotic resistance in Pseudomonas aeruginosa– Mechanisms, epidemiology and evolution”. In: Drug resistance updates 44 (2019), p. 100640. [3] Karolina Anna Mielko et al. “Metabolomic studies of Pseudomonas aeruginosa”. In: World Journal of Microbiology and Biotechnology 35 (2019), pp. 1–11. [4] Jahminy Sasikaran et al. “Bacterial itaconate degradation promotes pathogenicity”. In: Nature chemical biology 10.5 (2014), pp. 371–377. [5] Martin Bartl, Pu Li, and Lorenz T Biegler. “Improvement of state profile accuracy in nonlinear dynamic optimization with the quasi-sequential approach”. In: AIChE Journal 57.8 (2011), pp. 2185–2197

10:25 Genome-scale metabolic modeling of Aspergillus fumigatus strains reveals growth dependencies on the lung microbiome

– by Sascha Schäuble

Aspergillus fumigatus, an opportunistic human pathogen, frequently infects the lungs of people with cystic fibrosis and is one of the most common causes of infectious-disease death in immunocompromised patients. Here, we construct 252 strain-specific, genome-scale metabolic models of this important fungal pathogen to study and better understand the metabolic component of its pathogenic versatility. The models show that 23.1% of A. fumigatus metabolic reactions are not conserved across strains and are mainly associated with amino acid, nucleotide, and nitrogen metabolism. Profiles of non-conserved reactions and growth-supporting reaction fluxes are sufficient to differentiate strains, for example by environmental or clinical origin. In addition, shotgun metagenomics analysis of sputum from 40 cystic fibrosis patients (15 females, 25 males) before and after diagnosis with an A. fumigatus colonization suggests that the fungus shapes the lung microbiome towards a more beneficial fungal growth environment associated with aromatic amino acid availability and the shikimate pathway. Our findings are starting points for the development of drugs or microbiome intervention strategies targeting fungal metabolic needs for survival and colonization in the non-native environment of the human lung.

10:45 Metabolic Quest: Unveiling metabolic intricacies through expression patterns

– by Shrutakirti Saha

Candida albicans, a significant fungal pathogen represents a great health challenge globally. To understand and unravel its intricate mechanisms, metabolic modelling is used, integrating omics data. Proliferation and virulence are mainly caused by gene expression, which are also regulated by metabolic pathways. Serine-arginine (SR) protein kinases are characterized by phosphorylating SR proteins and are highly conserved in eukaryotes. To investigate the role of SR protein kinases in Candida albicans and their metabolic implications, a study was conducted of RNA sequencing of sky2 and sky2Δ C.albicans in three distinct growth media: glucose, succinic acid and malic acid at 6 hours. At each time point, we systematically gathered omics data, encompassing transcriptomics and metabolomics data. This dataset forms a basis of understanding into the metabolic fluxes and the dynamic behavior of C.albicans. We developed a topological model of the central metabolism of C.albicans with the Tricarboxylic acid (TCA) cycle. This plays an important role in governing the conversion of diverse carbon sources into energy and the nicotinate pathway which is responsible for the biosynthesis of vitamin B3(niacin). The nicotinate pathway allows us to understand the cellular processes under limited nutrient availability. Knockout strains grew less well on the three different media with lower fluxes than the wild-type variant. We created a topological model to analyze the arginine metabolism and sphingolipid metabolism in C.albicans. Sphingolipid biosynthesis plays an important role in filamentation of the fungus. Our study of metabolism with the help of elementary modes that combine omics data allows us to unravel a new perspective on enzyme and pathway usage for C.albicans for growth on different sources.

11:05 Coffee break

11:30 Implementation, Validation, and Application of a Bioinformatics Framework for Analyzing Oxford Nanopore Technologies Genome Sequencing Data from Zoonotic Bacterial Pathogens

– by Christine Thomas

Investigating bacterial genomes on nucleotide level requires whole-genome sequencing and bioinformatics analysis. While Illumina sequencing is established (ISO 23418:2022), validation is still missing for the application of Oxford Nanopore Technologies within the German national surveillance program of zoonotic bacteria. Automated bioinformatics pipelines enable researchers with little expertise in computer science to investigate their sequencing data, and ensure reproducibility. Existing pipelines for ONT data analysis either miss important steps like basecalling, especially with the latest basecaller dorado, or polishing (Deneke 2023, Petit 2023), are not specified for bacteria, or use outdated tools (Murigneux 2021). Here, a bioinformatics framework for creating closed genome assemblies based on bacterial ONT sequencing data is implemented, validated, and applied. This framework is incorporating basecalling, quality control, assembly, and polishing of assemblies. Implementation is completed using python and the workflow management system snakemake, allowing for parallelism, scalability, and reproducibility. As diverse tools exist for each analysis step, the performance of different tool combinations is evaluated. Therefore, the whole-genome sequencing data of seven reference strains of highly pathogenic bacterial pathogens (Bacillus anthracis, Brucella spp., Burkholderia mallei, Francisella tularensis, Taylorella equigenitalis) is used for the validation and optimization of the best composition of tools which ensures highest quality genomes. The correctness of the final polished assemblies is examined by the comparison to publicly available sanger sequences. The resulting pipeline presents an intuitive tool to produce high quality genome assemblies from whole-genome sequencing data. The utility of ONT sequencing is demonstrated for different tasks in veterinary medicine including serotyping of Salmonella (Thomas et al 2023) or genotyping of highly pathogenic zoonotic bacteria (Linde et al 2023).

Deneke, Carlus. “MiLongA”. Git: https://gitlab.com/bfr_bioinformatics/milonga. Accessed November, 2023. Petit, Robert A. “dragonflye”. Git: https://github.com/rpetit3/dragonflye. Accessed November, 2023. Murigneux, Valentine, et al. „MicroPIPE: validating an end-to-end workflow for high-quality complete bacterial genome construction.“ BMC genomics 22.1 (2021): 474. ISO 23418:2022 Microbiology of the food chain Whole genome sequencing for typing and genomic characterization of bacteria. General requirements and guidance. Thomas, Christine, et al. „Oxford Nanopore Technologies—a valuable tool to generate whole-genome sequencing data for in silico serotyping and the detection of genetic markers in Salmonella.“ Frontiers in Veterinary Science 10 (2023): 1178922. Linde, Jörg, et al. „Comparison of Illumina and Oxford Nanopore Technology for genome analysis of Francisella tularensis, Bacillus anthracis, and Brucella suis.“ BMC genomics 24.1 (2023): 1-15.

11:50 Fast and Accurate Barcode Calling Based on k-mers

– by Riko Uphoff

DNA barcodes, which are sets of short unique DNA strings, are regularly used as tags in pooled sequencing experiments to enable the identification of reads that originate from the same cell. In spatial transcriptomics, for example, tissue samples are placed on arrays with surface–attached barcode primers which allow to find the spatial coordinates of the sequenced mRNA molecules on the studied tissue. A crucial preprocessing step in such experiments is barcode calling, where for each read the corresponding barcode has to be determined. This task is challenging because the original barcode set is corrupted by errors during synthesis and sequencing. Modern techniques such as photolithographic microarray synthesis produce barcodes with a comparatively high nucleotide error rate. A straightforward, but naive approach for solving barcode calling involves searching for the most similar barcode for each read. For typical applications with ~10^6 barcodes and ~10^8 reads, an all-to-all exact distance calculation is prohibitively expensive. Hence, the main challenge is to develop a scalable and highly efficient, but still precise algorithm with a high recall rate for barcode calling. In a breakthrough paper, Press (2022) designed a filtering approach leading to high throughput rates on modern GPUs.

Here we propose a new filtering technique based on precomputed k-mer lists, improving the state-of-the-art in several ways. We reduce the computational work for the filtering step by orders of magnitude, resulting in higher throughput even on standard CPUs. Our approach is able to call about 10^8 reads per day on a CPU. Additionally we achieve slightly improved trade-offs between recall and precision.

12:10 Barcode Layout Optimization in Spatial Transcriptomics: Theory and Experiments

– by Antonia Schmidt

In the field of spatial transcriptomics, the spatial distribution of gene expression in tissue can be examined. For this purpose, the transcripts are applied to an array with 1024$\times$768 features (14$\mu m$ pitch), ligated with barcodes and then sequenced. The transcripts are then assigned to their original location via the barcodes. Unfortunately, the barcode synthesis on the array by photolithographic methods is error-prone, with a suspected error rate of up to 20\% per base, making automated assignment to the barcode library challenging. Therefore, we have been searching for methods to decrease the error rate. Prior work suggests that insertion errors arise during barcode synthesis by neighbor effects due to scattered and diffracted light. This motivates the combinatorial optimization problem studied in this work. For a given set of barcodes, we attempt to optimize the layout of the barcodes on the array by minimizing the dissimilary of neighboring barcodes.

We show that the problem is hard to solve and even hard to approximate: It is maxSNP-hard. An ILP formulation theoretically allows the computation of optimal results, but it is only applicable for tiny instances. Therefore, we have developed heuristic algorithms with the aim of computing near-optimal solutions for instances of realistic size. The algorithms developed include a sorting-based algorithm, a greedy algorithm, and a genetic algorithm. To assess the quality of the results, we compare the generated solutions with the expected quality of a random layout and with lower bounds. A GPU implementation of the greedy algorithm generates the most promising results in terms of both quality and runtime. Solutions to large-scale instances with 1024^$\times$768 many barcodes show a 35-40\% improvement in quality compared to a random solution and can be computed within approximately 15 seconds.

12:30 Improved reconstruction of transcripts and coding sequences from RNA-seq data

– by Jan Grau

New genomes are being sequenced with an amazing pace. Having obtained an assembled genome sequence, a pivotal requirement for many subsequent analyses is an accurate annotation of gene and transcript models. Previously, we have developed GeMoMa, an algorithm for homology-based gene prediction that uses protein sequence homology and the evolutionary conservation of exon-exon borders to transfer gene models from well-annotated reference species to a newly sequence target genome. Here, we present a novel approach termed GeMoRNA for RNA-seq-based gene prediction that shall complement the homology- based predictions of GeMoMa and, hence, focuses on the prediction of protein-coding genes. In GeMoRNA, we start from genomic mappings of RNA-seq reads. We partition the genome into covered regions, build a read graph with basepair resolution connecting positions (nodes) that are adjacent in mapped reads, and identify connected components. For each connected component, we then merge consecutive positions without alternative edges into a splicing graph. We use the splicing graph to combinatorially enumerate candidate transcripts with a heuristic that shall prioritize high-coverage transcripts. Candidate transcripts may still be chimeras of multiple transcripts. Hence, we combine coverage information, simplistic CDS prediction and splice site orientation to split such chimeras into individual transcripts. Resulting potential transcripts are quantified based on RNA-seq evidence in an EM-like algorithm. Overlapping transcripts with sufficient coverage are finally merged to genes. We benchmark GeMoRNA against the state-of-the-art tools StringTie, scallop and cufflinks on a large collection of 230 stranded and unstranded RNA-seq libraries of multiple species (A. thaliana, C. elegans, D. melanogaster, M. musculus, O. sativa, S. cerevisiae, S. lycopersicum) using the respective reference annotations as ground truth. For transcript F1 measure (gffcompare), we find that GeMoRNA yields higher accuracy than cufflinks and scallop, while GeMoRNA performs better than StringTie for two thirds of the data sets. We further compare performance on the level of CDS predictions, which have been in the focus of GeMoRNA development. Here, we observe that GeMoRNA yields better predictions than all previous approaches for almost all data sets, where the improvement is specifically pronounced for S. cerevisiae, C. elegans and A. thaliana. The combination of homology-based (GeMoMa) and RNA-seq-based (GeMoRNA) gene predictions may further improve the completeness of (protein-coding) gene predictions.

12:50 Lunch (Recommendations are marked in purple on the map)

14:15 Invited Talk: Genotyping structural variation: From simple to complex

– by Birte Kehr

15:00 EVEs in the Shadows: A Gap in Virus Research and Bioinformatics

– by Muriel Ritsch

Endogenous Viral Elements (EVEs) are remnants of viral genetic material integrated into the genome of germline cells of the host. EVEs are pivotal in the ongoing ‚arms race‘ between host organisms and viruses, exhibiting transcription and translation capabilities. Some EVEs have antiviral functions or co-opted roles and are associated with various diseases. EVEs can outnumber protein-coding genes and offer valuable insights for paleovirology, enriching our comprehension of virus-host interactions throughout history. Despite their significance, EVEs are often underestimated or ignored in the scientific community. We have explored the data availability and accessibility of EVE, identifying numerous challenges that currently make it impossible to conduct meaningful analyses or perform meta-analyses with EVE data.

The availability of EVE data is hindered by a bias towards endogenous retroviruses (ERVs), which, due to their disease associations and their high integration capability into host genomes, overshadow non-retroviral EVEs (nrEVEs) and affect overall data accessibility. The intricate characteristics of EVEs, involving varied integration patterns and sequence differences, further complicate the process of collecting data and may hinder their integration into comprehensive databases. We discovered the absence of well-established EVE identification tools, and in most virus-related tools, EVEs are often not considered or even mentioned.

Beyond data availability, data accessibility is equally critical, as inaccessible data is effectively as good as nonexistent. We found that effective keyword searches for scientific papers or databases are lacking, and the absence of standardized methods for presenting and storing EVE data presents significant hurdles. Additionally, annotations for EVEs in host genomes are rare. Furthermore, we illustrated that the limited availability of EVE sequences predominantly arises from the lack of dedicated EVE databases and the challenge of ensuring easy retrieval of EVE sequences within general databases.

In this study, we explored the current state of EVE research, providing a concrete illustration of the challenges that arise when data is unavailable and inaccessible. These challenges are not unique to the EVE field and create opportunities for an interdisciplinary exchange of knowledge to find solutions collaboratively. To address these issues we identified, we proposed investing more in EVE identification tools and integrating EVE data into existing databases, emphasizing the diversity and significance of EVEs in virus research. We encourage researchers to ask themselves: „Do EVEs play a role in my research, and how can EVEs be integrated?“

15:20 Comprehensive Survey of Conserved RNA Secondary Structures in Full-Genome Alignment of Hepatitis C Virus

– by Sandra Triebel

Background
The Hepatitis C virus (HCV) genome is approximately 9.6 kilobases in length and encodes a polyprotein precursor flanked by the 5′ and 3′ untranslated regions (UTRs). The construction of full-genome alignments of viruses is challenging on the sequence level due to high genetic diversity, rapid mutation rates, and the presence of insertions and deletions (indels) that complicate the alignment process. The HCV genome harbors several RNA secondary structures essential for viral translation, replication, and assembly. One of the most prominent RNA secondary structures in HCV is the internal ribosome entry site (IRES), located within the 5′ UTR, mediating cap-independent translation initiation. Incorporating RNA secondary structures in the alignment is even more challenging due to the additional complexity introduced by the folding and variability of RNA structures, requiring specialized methods to incorporate both sequence and structural information.

Methods
We clustered all complete HCV genomes from the BV-BRC database incorporating k-mer distributions, followed by dimension reduction and clustering with HDBSCAN. We identified 57 representative genomes and constructed a full-genome multiple sequence alignment (MSA), including computational prediction of RNA secondary structures. The MSA process involved initial sequence-based alignment followed by structure-guided refinement, based on conserved anchor regions. Manual curation was performed by HCV experts to ensure accuracy and reliability of the alignment and structures.

Results
We successfully tackled three key challenges for HCV: (i) the construction of a comprehensive full-genome alignment, (ii) the integration of RNA secondary structure prediction (including alternative conformational states of structures), and (iii) the in silico prediction of novel structures throughout the HCV genome. The conserved RNA secondary structure motifs are included in the Rfam database (https://rfam.org). In addition, a detailed analysis of certain structures revealed a relative shift in sequence and structure. This phenomenon is referred to as incongruent evolution.

Conclusion
Our computational approach predicts RNA secondary structures across entire viral genomes, aiding in the annotation of conserved RNA structures. The inclusion of HCV models in the Rfam database aligns with a broader project creating a comprehensive collection of viral RNA structures and covariance models for automated annotation. Further, we strive to automate the alignment construction and RNA secondary structure prediction to analyze additional viral clades efficiently.

15:40 Identifying a rule for fluoroquinolone treatment of patients with community-acquired pneumonia (CAP) of moderate severity

– by Marcus Oswald

The role of fluoroquinolone therapy in community-acquired pneumonia (CAP) of moderate severity is a matter of debate. Even though fluoroquinolones cover a broad spectrum of pathogens and, particularly, can be very effective against atypical pathogens, they may come along with severe side effects, such as tendon damage, arthralgia or divers neuropathies. They need to be prescribed with care and, so far, are only recommended as a reserve treatment option. We used machine learning based on aetiological and clinical parameters, which are available ex ante to support a personalised decision for or against fluoroquinolones for the best clinical outcome of the individual patient. We employed a cross-validation scheme based on a well-balanced selection of patients employing an own developed propensity matching procedure to data available on admission of more than 6,000 hospitalised patients with moderate severity (non-intensive care unit patients) from the observational, prospective, multinational CAPNETZ study. We aimed to improve the primary outcome of 90-day survival. We found that the efficiency of fluoroquinolones very much depends on the time of year. Treating patients with fluoroquinolones within a time window starting at the seasonal peaks of influenza infections and ending four weeks later, fluoroquinolone treated patients showed distinctive lower mortality compared to beta-lactam treated patients. In turn, during the rest of the year fluoroquinolones were inferior to beta-lactams. Stratifying fluoroquinolone treatment in patients following this simple treatment rule may lead to considerably reduced mortality in CAP. Long term effects have not been considered so far but need to also be considered before setting up a randomised controlled trial confirming our result.

Members of the CAPNETZ study group are: M. Dreher, C. Cornelissen (Aachen); W. Knüppel (Bad Arolsen); D. Stolz (Basel); N. Suttorp, M. Witzenrath, P. Creutz, A. Mikolajewska (Berlin, Charité); T. Bauer, D. Krieger (Berlin); W. Pankow, D. Thiemig (Berlin-Neukölln); B. Hauptmeier, S. Ewig, D. Wehde (Bochum); M. Prediger, S. Schmager (Cottbus); M. Kolditz, B. Schulte-Hubbert, S. Langner (Dresden); W. Albrich (St Gallen); T. Welte, J. Freise, G. Barten, O. Arenas Toro, M. Nawrocki, J. Naim, M. Witte, W. Kröner, T. Illig, N. Klopp (Hannover); M. Kreuter, F. Herth, S. Hummler (Heidelberg); P. Ravn, A. Vestergaard-Jensen, G. BaunbaekKnudsen (Hillerød); M. Pletz, C. Kroegel, J. Frosinski, J. Winning, B. Schleenvoigt ( Jena); K. Dalhoff, J. Rupp, R. Hörster, D. Drömann (Lübeck); G. Rohde, J. Drijkoningen, D. Braeken (Maastricht); H. Buschmann (Paderborn); T. Schaberg, I. Hering (Rotenburg/Wümme); M. Panning (Freiburg); M. Wallner (Ulm)

16:00 VIRify: an integrated detection, annotation, and taxonomic classification pipeline using virus-specific protein profile hidden Markov models

– by Martin Hölzer

Introduction
Viruses are the most abundant biological entities on our planet. Some are relevant pathogens for public health or agriculture; others play ecological roles critical for maintaining ecosystems. Most viruses are yet to be cultured, so their identification and characterization depend solely on the analysis of DNA or RNA obtained from the environment. Unlike cellular organisms, viruses also lack a universal genetic marker that allows taxonomic profiling of an environmental viral community.

Objectives
To facilitate the robust detection of viruses from metagenomics or metatranscriptomics sequencing data, we developed a computational pipeline called VIRify. Our pipeline is designed to provide a user-friendly and accurate functional and taxonomic characterization of viral communities.

Materials & Methods
VIRify identifies viral contigs and prophages from metagenomic assemblies and annotates them using a collection of viral profile hidden Markov models (HMMs). These include our manually curated profile HMMs (ViPhOGs), which serve as specific taxonomic markers for many prokaryotic and eukaryotic viral taxa. We tested VIRify on assemblies from two microbial mock communities, a large marine metagenomics study, and human gut phages.

Results
VIRify efficiently detects and classifies viruses targeting bacteria or eukaryotic organisms, both in mock microbial communities and in samples from the world’s oceans and the human gut microbiome, providing taxonomic classifications from the genus to the family rank with high accuracy. Overall, we demonstrate that VIRify is a novel and powerful resource that offers an enhanced capability to detect and taxonomically classify a broad range of viral contigs. VIRify is user-friendly, requires minimal interaction with the command line, and was developed with portability in mind. VIRify can enhance the exploration of viral diversity in nature and support the detection of pathogenic viruses with pandemic potential. The pipeline is freely available at https://github.com/EBI-Metagenomics/emg-viral-pipeline.

16:20 RNAswarm – A Modular Pipeline for Differential RRI Analysis in Influenza A Virus

– by Gabriel Lencioni Lovoate

The segmented RNA genome of Influenza A virus (IAV) is a subject to intricate RNA-RNA interactions (RRIs), which are crucial for understanding viral evolution, vaccine development, and pandemic response. Current RNA proximity ligation methods like PARIS (psoralen analysis of RNA interactions and structures), SPLASH (psoralen crosslinked, ligated, and selected hybrids), and 2CIMPL (dual crosslinking, immunoprecipitation, and proximity ligation) can detect RRIs on a large scale [1-3], yet there is no bioinformatics pipeline for highthroughput comparison of RRI frequency across different strains or conditions. Addressing this gap, we present RNAswarm, a versatile and reproducible Nextflow pipeline designed for differential RRI analysis, which is already available on GitHub (https://github.com/rnajena/RNAswarm). This open-source tool efficiently processes raw reads from proximity ligation experiments, de-novo identifies RRIs, and employs DEseq2 for statistical analysis, allowing for strain-specific RRI site identification and the validation of known interactions within IAV strains. RNAswarm implements de novo annotation of RRIs through the generation of pairwise matrices of chimeric reads, followed by fitting Gaussian Mixture Models (GMMs) to pinpoint interaction candidates. This approach solely uses the density count of cross-linked reads to find potential interactions based on the fitted GMMs. The pipeline has demonstrated its utility in identifying evolutionarily conserved interaction sites, as well as and flexible sites which can potentially interact with multiple segments[4]. It provides a systematic approach to prioritize and compare RRIs, and has the potential to uncover novel RRIs in viruses and other biological systems, since RNAswarm is agnostic to the origin of the used references. The modularity and visualization capabilities of RNAswarm make it a useful resource for researchers probing the viral RNA interactome.
1. Aw JGA et al. (2016) Molecular Cell. 62:603–17.
2. Lu Z et al. (2016) Cell. 165:1267–79.
3. Sage VL et al. (2020) Cell Reports. 31.
4. Jakob C et al. (2023) Nucleic Acids Research. gkad442.

16:50 Closing remarks

17:00 End of conference

List of Posters (PDF)

Organizers

Contact