Motivation: To predict the function of protein sequences in metagenomes, all common tools search for related sequences in the reference databases, from which the functional annotation can be inferred. But many species found in metagenomics studies are not closely related to any organism with a well-annotated genome. Therefore, the fraction of protein sequences in metagenomic data that cannot be annotated using this "vertical" information transfer is often as high as 65% to 90%. This is the major obstacle to make progress.
Project: We want to develop a new paradigm for function prediction based on the transfer of contextual, ”horizontal” information. Building on our MMseqs2 software for fast sequence and profile searches  and sequence clustering , we will develop a very fast sequence search method that can find clusters of neighboring and co-transcribed genes. The basis idea of utilising genomic context is similar to well-known tools such as the STRING database. However, we are devising a novel statistical approach which, in combination with MMseqs2 and Linclust, will allow us to analyse huge numbers of genomes and metagenomes. Using iterative profile searches combined with horizontal information transfer, we will mine massive amounts of genomic and metagenomic data to learn functional modules of genes / proteins that will subsequently be used for improved annotation. This novel approach promises to greatly accelerate the rate of biological and biotechnological discoveries by deep mining of metagenomic and genomic sequence data.