The figure illustrates the general flow chart followed in the TeBactEn pipeline. For each of the bacteria of interest, the species taxonomy identifier from the NCBI taxonomy was selected. Expansion by including the child nodes of the species
node corresponding to strains and sub-strains was performed. All the names, aliases and synonyms were derived from this resource and simple typographical variants were generated together with abbreviated genus names for cases were the resulting shortened species name was not ambiguous. As an alternative to this Boolean query, we explored originally the use of more sophisticated retrieval approaches. For instance to ascertain whether some extra keywords could be relevant for the retrieval step, a supervised document classifier tool (
) and a system based on text similarity and clustering (
) were tested. Moreover, we examined whether for some bacteria there existed species-specific journals in order to use them as an additional component for query expansion within the article selection step.
Once the various document collections were assembled, we carried out document standardization and extraction of useful textual data. In case of PubMed records this consisted in selection of titles and abstracts sections, while in case of full text articles text conversion and preprocessing also had to deal with extraction of plain text data from PDF and HTML documents. The text conversion of PDF files was done using pdftotext and
. The extraction of plain text from HTML files was carried out using an in house HTML parser tool optimized for handling scientific online articles. All documents were further processed using an in-house sentence boundary recognition script that worked reasonable well both on PubMed abstracts as well as full text articles.
The next step of the TeBactEn pipeline consisted in the addition of semantic labels to the text, namely in detecting the mentions of relevant bio-entities for the extraction of bacterial metabolism and pathways. Within the TeBactEn article collections we tried to recognize three different types of entities, all of fundamental relevance for the annotation of bacteria metabolism. These entities consisted in (1)
species and taxonomic names
proteins and enzyme mentions
, and (3) mentions of
chemical compounds and drugs