TeBactEn (text mining for bacterial enzymes)

TeBactEn is a tool designed to facilitate the retrieval, extraction and annotation of bacterial enzymatic reactions and pathways from the literature.

The system has been developed in the context of the Microme project and contains three different data collections, namely (a) a compilation of articles derived from the Microme database, i.e. articles (abstracts and full text articles) that had been used for manual annotation of bacterial pathways (Microme set), (b) a set that covers abstracts from the entire PubMed database that are relevant to bacteria (PubMed set) and finally (c) a collection of abstracts and full text articles that are relevant for a list of bacteria of special interest to the Microme project, facilitating a more exhaustive extraction of enzymes particularly for these bacteria (species set).

In case of all three TeBactEn data collections, an exhaustive recognition of mentions of all species and taxonomic entities was carried out.

Main features

TeBactEn covers all the main steps relevant for the automatic extraction and ranking of metabolism relations from the literature and allows enhanced access and annotation of related information:

  • Identification of metabolism relevant articles.
  • Detection of the bio-entities involved in biochemical reactions:
    enzyme, compounds and organisms.
  • Extraction weighted (ranked) relationships between these
  • An interface to browse this information and to construct
    a manually curated database of metabolism reactions.
  • Host user-entered annotations.
  • The option to normalize/ground bio-entity mentions
    to other knowledgebases like UniProt and ChEBI.
TeBactEn pipeline

TeBactEn literature mining system flow chart. The system consists in two main components, the information retrieval pipeline and the information/relation extraction modules. The first component works at the level of articles while the second one is concerned on the identification of semantic labels and their relations at the level of individual sentences

The figure illustrates the general flow chart followed in the TeBactEn pipeline. For each of the bacteria of interest, the species taxonomy identifier from the NCBI taxonomy was selected. Expansion by including the child nodes of the species NCBI taxonomy node corresponding to strains and sub-strains was performed. All the names, aliases and synonyms were derived from this resource and simple typographical variants were generated together with abbreviated genus names for cases were the resulting shortened species name was not ambiguous. As an alternative to this Boolean query, we explored originally the use of more sophisticated retrieval approaches. For instance to ascertain whether some extra keywords could be relevant for the retrieval step, a supervised document classifier tool ( MedlineRanker ) and a system based on text similarity and clustering ( PubClust ) were tested. Moreover, we examined whether for some bacteria there existed species-specific journals in order to use them as an additional component for query expansion within the article selection step.

Once the various document collections were assembled, we carried out document standardization and extraction of useful textual data. In case of PubMed records this consisted in selection of titles and abstracts sections, while in case of full text articles text conversion and preprocessing also had to deal with extraction of plain text data from PDF and HTML documents. The text conversion of PDF files was done using pdftotext and PDFlib . The extraction of plain text from HTML files was carried out using an in house HTML parser tool optimized for handling scientific online articles. All documents were further processed using an in-house sentence boundary recognition script that worked reasonable well both on PubMed abstracts as well as full text articles.

The next step of the TeBactEn pipeline consisted in the addition of semantic labels to the text, namely in detecting the mentions of relevant bio-entities for the extraction of bacterial metabolism and pathways. Within the TeBactEn article collections we tried to recognize three different types of entities, all of fundamental relevance for the annotation of bacteria metabolism. These entities consisted in (1) species and taxonomic names , (2) proteins and enzyme mentions , and (3) mentions of chemical compounds and drugs .