HEMDAG programmatic call and evaluation¶

Here we explain how to apply the ensemble algorithms of the HEMDAG family in both cross-validated and hold-out experiments.

HEMDAG can in principle boost the predictions of any flat learning method by reconciling the flat predictions with the topology of the underlying ontology. Hence, to run HEMDAG we need the following ingredients:

the label matrix M representing the protein annotations to functional terms;
the graph g representing the hierarchy of the functional terms;
the flat score matrix S representing a score or a probability that a gene/protein belongs to a given functional term;

To build the graph, the label matrix and the protein-protein interaction network you can use this pipeline. Instead, to obtain the flat score matrix you can use the shogun library or the caret package or any other software returning a score or a probability that a protein belongs to a functional term.

Note

HEMDAG is built upon flat predictions. HEMDAG corrects all the violations of the hierarchical relationships between ontology terms.

Note

To run the experiments shown below, make sure you have installed the following requirements:

HEMDAG >= 2.7.4

R >= 4.0.4

Ubuntu >= 16.04

HEMDAG Call Script¶

To call any hierarchical ensemble algorithm of the HEMDAG family on either a time-lapse hold-out or a cross-validated dataset you can execute the following script:

#!/usr/bin/Rscript

## load library
library(HEMDAG);
suppressPackageStartupMessages(library(graph)); ## silence biocgenerics mask messages...
library(optparse);

## command line arguments
## for a detailed description, please see the manual: https://cran.r-project.org/web/packages/HEMDAG/HEMDAG.pdf
optionList <- list(
    make_option(c("-o", "--organism"), type="character", default="7227_drome",
        help="organism name in the form <taxon>_<name> (def. 7227_drome)"),
    make_option(c("-d", "--domain"), type="character", default="mf",
        help="go domain. It can be: bp, mf or cc (def. mf)"),
    make_option(c("-e", "--exptype"), type="character", default="ho",
        help="type of dataset on which run HEMDAG. It can be: ho (hold-out) or cv (cross-validated) -- def. ho"),
    make_option(c("-f", "--flat"), type="character", default="svmlinear",
        help="flat classifier"),
    make_option(c("-p", "--positive"), type="character", default="descendants",
        help="positive nodes selection. It can be: children or descendants.
                Skip this parameter if only topdown strategy is applied (def. descendants)"),
    make_option(c("-b", "--bottomup"), type="character", default="tau",
        help="bottomup strategy. It can be: none, threshold.free, threshold, weighted.threshold.free, weighted.threshold or tau.
                If none only topdown strategy is applied (def. tau)"),
    make_option(c("-t", "--topdown"), type="character", default="gpav",
        help="topdown strategy. It can be: htd or gpav (def. gpav)"),
    make_option(c("-c", "--threshold"), type="character", default="seq(from=0.1, to=0.9, by=0.1)",
        help="threshold for the choice of positive nodes.
                It can be a fixed value or an array of values (def. seq(from=0.1, to=0.9, by=0.1))"),
    make_option(c("-w", "--weight"), type="character", default="0",
        help="weight for the choice of positive nodes. It can be a fixed value or an array of values (def. 0)"),
    make_option(c("-m", "--metric"), type="character", default="auprc",
        help="performance metric on which maximize the parametric ensemble algorithms. It can be: auprc or fmax (def. auprc)"),
    make_option(c("-r", "--round"), type="integer", default="3",
        help="number of rounding digits to be applied for choosing the best Fmax. To be used only if metric is set to fmax (def. 3)"),
    make_option(c("-s", "--seed"), type="integer", default="23",
        help="seed for the random generator to create folds (def. 23)"),
    make_option(c("-k", "--fold"), type="integer", default="5",
        help="number of folds for the cross validation (def. 5)"),
    make_option(c("-l", "--parallel"), type="logical", default=FALSE, action="store_true",
        help="should the sequential or parallel version of gpav be run?
                If flag -p is 'on' the gpav parallel version is run. NB: only gpav can be run in parallel (def. FALSE)"),
    make_option(c("-n", "--cores"), type="integer", default="1",
        help="number of cores to use for the parallel execution of gpav (def. 1)"),
    make_option(c("-z", "--norm"), type="logical", default=FALSE, action="store_true",
        help="should the flat score matrix be normalized? If flag -p is 'on' the input flat scores is normalized (def. FALSE)"),
    make_option(c("-y", "--normtype"), type="character", default="none",
        help="type of normalization. It can be maxnorm or qnorm (def. none)")
);

optParser <- OptionParser(option_list=optionList);
opt <- parse_args(optParser);

prefix    <- opt$organism;
organism  <- strsplit(prefix,"_")[[1]][2];
exptype   <- opt$exptype;
flat      <- opt$flat;
positive  <- opt$positive;
bottomup  <- opt$bottomup;
topdown   <- opt$topdown;
domain    <- opt$domain;
threshold <- eval(parse(text=opt$threshold));
weight    <- eval(parse(text=opt$weight));
metric    <- opt$metric;
round     <- opt$round;
seed      <- opt$seed;
kk        <- opt$fold;
parallel  <- opt$parallel;
cores     <- opt$cores;
norm      <- opt$norm;
normtype  <- opt$normtype;
if(normtype == "none")
    normtype <- NULL;

## hemdag algorithm to be displayed in output file name -> 18 iso/tpr-dag ensemble combinations + gpav + htd (tot 20 hemdag family)
if(positive=="children" && bottomup=="threshold.free" && topdown=="htd")
    hemdag.name <- "tprTF";
if(positive=="children" && bottomup=="threshold" && topdown=="htd")
    hemdag.name <- "tprT";
if(positive=="children" && bottomup=="weighted.threshold.free" && topdown=="htd")
    hemdag.name <- "tprW";
if(positive=="children" && bottomup=="weighted.threshold" && topdown=="htd")
    hemdag.name <- "tprwt";
if(positive=="descendants" && bottomup=="threshold.free" && topdown=="htd")
    hemdag.name <- "descensTF";
if(positive=="descendants" && bottomup=="threshold" && topdown=="htd")
    hemdag.name <- "descensT";
if(positive=="descendants" && bottomup=="weighted.threshold.free" && topdown=="htd")
    hemdag.name <- "descensW";
if(positive=="descendants" && bottomup=="weighted.threshold" && topdown=="htd")
    hemdag.name <- "descensWT";
if(positive=="descendants" && bottomup=="tau" && topdown=="htd")
    hemdag.name <- "descensTAU";
if(positive=="children" && bottomup=="threshold.free" && topdown=="gpav")
    hemdag.name <- "isotprTF";
if(positive=="children" && bottomup=="threshold" && topdown=="gpav")
    hemdag.name <- "isotprT";
if(positive=="children" && bottomup=="weighted.threshold.free" && topdown=="gpav")
    hemdag.name <- "isotprW";
if(positive=="children" && bottomup=="weighted.threshold" && topdown=="gpav")
    hemdag.name <- "isotprWT";
if(positive=="descendants" && bottomup=="threshold.free" && topdown=="gpav")
    hemdag.name <- "isodescensTF";
if(positive=="descendants" && bottomup=="threshold" && topdown=="gpav")
    hemdag.name <- "isodescensT";
if(positive=="descendants" && bottomup=="weighted.threshold.free" && topdown=="gpav")
    hemdag.name <- "isodescensW";
if(positive=="descendants" && bottomup=="weighted.threshold" && topdown=="gpav")
    hemdag.name <- "isodescensWT";
if(positive=="descendants" && bottomup=="tau" && topdown=="gpav")
    hemdag.name <- "isodescensTAU";
if(bottomup=="none" && topdown=="gpav")
    hemdag.name <- "gpav";
if(bottomup=="none" && topdown=="htd")
    hemdag.name <- "htd";

## I/O directories
data.dir <- paste0("../data/", exptype, "/");
res.dir  <- paste0("../res/",  exptype, "/");
if(!dir.exists(res.dir)){dir.create(res.dir, recursive=TRUE);}

## flat/ann/dag/testIndex files
files <- list.files(data.dir);
flat.file <- files[grep(paste0(organism, ".*", domain, ".scores.*", flat), files)];
ann.file  <- files[grep(paste0(domain,".ann"), files)];
dag.file  <- files[grep(paste0(domain,".dag"), files)];
if(exptype == "ho"){
    idx.file  <- files[grep(paste0(domain,".testindex"), files)];
    if(length(idx.file)==0)
        stop("no index file found\n");
}

## check if flat|ann|dag exists
if(length(flat.file)==0 || length(ann.file)==0 || length(dag.file)==0)
    stop("no flat|ann|dag file found\n");

## load data
S <- get(load(paste0(data.dir, flat.file)));
g <- get(load(paste0(data.dir, dag.file)));
ann <- get(load(paste0(data.dir, ann.file)));
if(exptype == "ho")
    testIndex <- get(load(paste0(data.dir, idx.file)));

## shrink graph g to terms of matrix S -- if number of nodes between g and S mismatch
root <- root.node(g);
nd <- colnames(S);
class.check <- ncol(S) != graph::numNodes(g);
if(class.check){
    root.check <- root %in% colnames(S);
    if(!root.check)
        nd <- c(root, nd);
    g <- build.subgraph(nd, g, edgemode="directed");
    ann <- ann[, colnames(S)];
}

## address case when (iso)descensW is called with a fixed value of w to enter in the right branch
if(bottomup=="weighted.threshold.free" && length(weight)==1)
    threshold <- 0;

## elapsed time
start.elapsed <- proc.time();

## HEMDAG calling
if(exptype == "ho"){
    if(bottomup=="none"){
        if(topdown=="gpav"){
            S.hier <- gpav.holdout(S=S, g=g, testIndex=testIndex, W=NULL, parallel=parallel,
                ncores=cores, norm=norm, norm.type=normtype);
        }else{
            S.hier <- htd.holdout(S=S, g=g, testIndex=testIndex, norm=norm, norm.type=normtype);
        }
    }else{ ## branch to call HEMDAG by tuning the parameters
        if(length(threshold)>1 || length(weight)>1){
            S.hier <- tpr.dag.holdout(S, g, ann=ann, testIndex=testIndex, norm=norm, norm.type=normtype,
                positive=positive, bottomup=bottomup, topdown=topdown, W=NULL, parallel=parallel, ncores=cores,
                threshold=threshold, weight=weight, kk=kk, seed=seed, metric=metric, n.round=round);
        }else{ ## branch to call HEMDAG with fixed the parameters
            ## add root node if it does not exist
            if(!(root %in% colnames(S))){
                max.score <- max(S);
                z <- rep(max.score,nrow(S));
                S <- cbind(z,S);
                colnames(S)[1] <- root;
            }
            ## normalization
            if(norm){
                S <- scores.normalization(norm.type=normtype, S);
                cat(normtype, "normalization done\n");
            }
            ## shrink S to test indexes
            S.test <- S[testIndex,];
            ## degenerate case when test set has just one row/example
            if(!is.matrix(S.test)){
                test.sample <- rownames(S)[testIndex];
                S.test <- matrix(S.test, ncol=length(S.test), dimnames=list(test.sample, names(S.test)));
            }
            ## tpr-dag correction
            S.hier <- tpr.dag(S.test, g, root=root, positive=positive, bottomup=bottomup, topdown=topdown,
                t=threshold, w=weight, W=NULL, parallel=parallel, ncores=cores);
            ## print chosen parameters
            if(bottomup=="weighted.threshold.free"){
                cat("fixed weight:", weight, "\n");
            }else if(bottomup=="weighted.threshold"){
                cat("fixed weight:", weight, "fixed threshold:", threshold, "\n");
            }else{
                cat("fixed threshold:", threshold, "\n");
            }
            cat("tpr-dag correction done\n");
        }
    }
}else{
    if(bottomup=="none"){
        if(topdown=="gpav"){
            S.hier <- gpav.vanilla(S=S, g=g, W=NULL, parallel=parallel, ncores=cores, norm=norm, norm.type=normtype);
        }else{
            S.hier <- htd.vanilla(S=S, g=g, norm=norm, norm.type=normtype);
        }
    }else{  ## branch to call HEMDAG by tuning the parameters
        if(length(threshold)>1 || length(weight)>1){
            S.hier <- tpr.dag.cv(S, g, ann=ann, norm=norm, norm.type=normtype, positive=positive, bottomup=bottomup,
                topdown=topdown, W=NULL, parallel=parallel, ncores=cores, threshold=threshold, weight=weight,
                kk=kk, seed=seed, metric=metric, n.round=round);
        }else{  ## branch to call HEMDAG with fixed parameters
            ## add root node if it does not exist
            if(!(root %in% colnames(S))){
                max.score <- max(S);
                z <- rep(max.score,nrow(S));
                S <- cbind(z,S);
                colnames(S)[1] <- root;
            }
            ## normalization
            if(norm){
                S <- scores.normalization(norm.type=normtype, S);
                cat(normtype, "normalization done\n");
            }
            S.hier <- tpr.dag(S, g, root=root, positive=positive, bottomup=bottomup, topdown=topdown,
                t=threshold, w=weight, W=NULL, parallel=parallel, ncores=cores);
            ## print chosen parameters
            if(bottomup=="weighted.threshold.free"){
                cat("weight: ", weight, "\n");
            }else if(bottomup=="weighted.threshold"){
                cat("weight: ", weight, "threshold: ", threshold, "\n");
            }else{
                cat("threshold: ", threshold, "\n");
            }
            cat("tpr-dag correction done\n");
        }
    }
}

stop.elapsed <- proc.time() - start.elapsed;
timing.s <- stop.elapsed["elapsed"];
timing.m <- round(timing.s/(60),4);
timing.h <- round(timing.m/(60),4);
cat(hemdag.name, "running time:", timing.s["elapsed"], "(seconds)", "|", timing.m["elapsed"], "(minutes)", "|" , timing.h["elapsed"], "(hours)", "\n\n");

## store results
## outname
fname <- unlist(strsplit(flat.file, split="[.,_]"));
outname <- paste0(fname[-((length(fname)-1):length(fname))], collapse="_");
if(norm==TRUE && !(is.null(normtype)))
    outname <- paste0(outname,"_",normtype);
if(exptype == "ho"){
    save(S.hier, file=paste0(res.dir, outname, "_", hemdag.name, "_holdout.rda"), compress=TRUE);
}else{
    save(S.hier, file=paste0(res.dir, outname, "_", hemdag.name, ".rda"), compress=TRUE);
}

You can download the script as follow:

mkdir -p ~/hemdag/script/
cd ~/hemdag/script/
wget -nc https://raw.githubusercontent.com/marconotaro/hemdag/master/docs/playground/script/hemdag-call.R

Before executing the script be sure to have correctly installed the latest version of the HEMDAG package (and all its dependencies – see Installation) and the package optparse.

Note

The output hierarchical score matrix of the called HEMDAG algorithm (whose name is saved in the output .rda file name) is stored in the folder ~/hemdag/res/(ho|cv) depending on whether you chose to execute HEMDAG on either hold-out (ho) or cross-validated (cv) datasets. The HEMDAG elapsed time is printed on the shell.
By default, if no inputs parameters are specified in hemdag-call.R, the script executes the isodescensTAU algorithm on the hold-out dataset by tuning the parameter tau on the basis of AUPRC.
The tuning of the hyper-parameters can take from few minutes up to few hours depending on the size of the dataset and on the adopted evaluation metric (Fmax is slower than AUPRC).

Arguments Explanation¶

For the usage of the script, type in the shell under the ~/hemdag/script/ folder:

Rscript hemdag-call.R -h

For a detailed description of the input arguments positive, bottomup, topdown, threshold, weight, metric, round, seed, fold, parallel, cores, norm, normtype, please refer to the description of the input variables of the functions (gpav|htd|tpr.dag).(holdout|cv) in the HEMDAG reference manual.

Parametric-free arguments¶

To call a parametric-free HEMDAG algorithm the main required arguments are:

-b (--bottomup) → none
-t (--topdown) → gpav|htd

Note

GPAV can also be run in parallel simply by using the flag -l (--parallel) and by setting the number of cores -n (--cores).

Parametric arguments¶

To call a parametric HEMDAG algorithm the main required arguments are:

-b (--bottomup) → threshold.free|threshold|weighted.threshold.free|weighted.threshold|tau
-t (--topdown) → gpav|htd
-c (--cut-off) → 0.5|"seq(from=0.1, to=0.9, by=0.1)". Note the use of double quotes for the range of thresholds
-w (--weight) → 0.5|"seq(from=0.1, to=0.9, by=0.1)". Note the use of double quotes for the range of thresholds

If a range of thresholds (or weights) is selected, the hyper-parameters are tuned on the basis of imbalance-aware performance metrics estimated on the training data – -m (--metric) → auprc|fmax. By default the number of folds -k (--fold) is set to 5 and the seed -s (--seed) for the random generator is set to 23. Furthermore, if -m (--metric) → fmax the parameter -r (--round) can be used to select the number of rounding digits to be applied for choosing the best Fmax (by default is set to 3).

Additional arguments¶

The following arguments are dataset-specific:

-o (--organism) specifies the organism name (in the form <taxon>_<name>);
-d (--domain) specifies the GO domain: bp (biological process), mf (molecular function), cc(cellular component);
-e (--exptype) specifies the type of dataset where running HEMDAG: ho (holdout) or cv (cross-validated);
-f (--flat) specifies the name of the flat classifier. In case the flat learning method returns a score and not a probability, the flat score matrix must be normalized before running HEMDAG. On the contrary, if the flat classifier already returns a probability there is no needed to normalize the flat score matrix, since the flat scores can be directly compared with the hierarchical ones. To normalize the flat score matrix we must “activate” the flag -z (--norm) (by default the flag -z is deactivate) and we need to choose a type of normalization (-y (--normtype) → maxnorm|qnorm). The name of the chosen normalization is stored in the rda output file name.

Time-lapse hold-out experiments¶

Here, to show how to use HEMDAG in time-lapse hold-out experiments, we use a pre-built dataset of the organism Drosophila melanogaster (DROME) and, for simplicity, we consider the annotations of the GO domain molecular function (MF). To build the dataset we used the annotations of an old GO release (December 2017) as training set and the annotations of a more recent GO release (June 2020) as test set. The graph and the annotation matrix was built by adopting the pipeline. The flat score matrix was obtained by using the R interface of the machine learning library LiblineaR with the default parameter settings. For further details on the dataset, please refer to HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction (submitted to Bioinformatics).

Download Data¶

All the required .rda files can be downloaded by using the following commands, by exploiting the beauty and power of the non-greedy positive lookbehind regex 😉:

mkdir -p ~/hemdag/data/ho/
cd ~/hemdag/data/ho/
curl -Ss https://github.com/marconotaro/hemdag/tree/master/docs/playground/data/ho |  grep -oP '(?<=href=").*?(?=\">)' | grep '.rda$' | perl -pe 's/blob\///' | perl -pe 's/^/https\:\/\/raw.githubusercontent.com/' | wget -nc -i -

With the command above, we download the following datasets:

7227_drome_go_mf_ann_20dec17_16jun20.rda: the annotation matrix;
7227_drome_go_mf_dag_20dec17_16jun20.rda: the graph;
7227_drome_go_mf_testindex_20dec17_16jun20.rda: the indexes of the examples of the test set;
7227_drome_go_mf_scores_svmlinear_holdout.rda: the flat score matrix;

Programmatic Call¶

Below we show some examples of how to call HEMDAG in time-lapse hold-out experiments, but we leave the user the freedom to experiment with any another ensemble algorithm of the HEMDAG family.

Note

the hemdag-call.R script must be called in ~/hemdag/script/;
for the examples shown below, the tuning of hyper-parameters takes few minutes;
the output HEMDAG score matrices are stored in ~/hemdag/res/ho/.

GPAV call (parallel version):

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -b none -t gpav -l -n 12

isotprTF call:

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -p children -b threshold.free -t gpav -l -n 12

isotprW call:

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -p children -b weighted.threshold.free -t gpav -w "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

isodescensTF call:

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -p descendants -b threshold.free -t gpav -l -n 12

isodescensW call:

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -p descendants -b weighted.threshold.free -t gpav -w "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

isodescensTAU call:

Rscript hemdag-call.R -o 7227_drome -d mf -e ho -f svmlinear -p descendants -b tau -t gpav -c "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

Check Hierarchical Constraints¶

All the HEMDAG score matrices respect the hierarchical constraints imposed by the underlying ontology. The script below checks that all the 6 HEMDAG matrices obtained with the commands shown above, do not violate the between-term relationships in the GO MF hierarchy. For further details refer to Check Hierarchical Constraints.

#!/usr/bin/Rscript

suppressPackageStartupMessages(library(HEMDAG));
library(optparse);

optionList <- list(
    make_option(c("-o", "--organism"), type="character", default="7227_drome",
        help="organism name in the form <taxon>_<name> (def 7227_drome)"),
    make_option(c("-d", "--domain"),  type="character", default="mf",
        help="gene ontology domain -- bp, mf, cc (def. mf)"),
     make_option(c("-e", "--exptype"), type="character", default="ho",
        help="type of dataset on which run HEMDAG. It can be: ho (hold-out) or cv (cross-validated) -- def. ho"),
    make_option(c("-f", "--flat"), type="character", default="svmlinear",
        help="flat classifier (def. svmlinear)"),
    make_option(c("-a", "--algorithm"), type="character", default="isodescensTAU",
        help="hierarchical correction algorithm (def. isodescensTAU)")
);

optParser <- OptionParser(option_list=optionList);
opt <- parse_args(optParser);

## setting input argument
prefix    <- opt$organism;
taxon     <- strsplit(prefix,"_")[[1]][1];
organism  <- strsplit(prefix,"_")[[1]][2];
domain    <- opt$domain;
exptype   <- opt$exptype;
flat      <- opt$flat;
algorithm <- opt$algorithm;

## I/O
data.dir <- paste0("../data/", exptype, "/");
res.dir  <- paste0("../res/",  exptype, "/");
data.files <- list.files(data.dir);
res.files  <- list.files(res.dir);
if(!dir.exists(res.dir)){dir.create(res.dir, recursive=TRUE);}

## load data
dag.file  <- data.files[grep(paste0(domain,".dag"), data.files)];
hier.file <- res.files[grep(paste0(organism, ".*", domain, ".scores.*", flat, ".", algorithm), res.files)];

## check if flat|ann|dag exists
if(length(dag.file)==0 || length(hier.file)==0)
    stop("no dag|hier file found\n");

## check constraints violation
g <- get(load(paste0(data.dir, dag.file)));
S.hier <- get(load(paste0(res.dir, hier.file)));
root <- root.node(g);
g <- build.subgraph(colnames(S.hier), g);
check <- check.hierarchy(S.hier, g, root);
if(check$status=="OK"){
    cat(taxon, organism, domain, paste0(flat,'+',algorithm), "check passed :)","\n");
}else{
    cat(taxon, organism, domain, paste0(flat,'+',algorithm), "check failed :(","\n");
}

To download and use the hierarchical constraints script:

## download
mkdir -p ~/hemdag/script/
cd ~/hemdag/script/
wget -nc https://raw.githubusercontent.com/marconotaro/hemdag/master/docs/playground/script/hemdag-checker.R

## call
Rscript hemdag-checker.R -o 7227_drome -d mf -e cv -f ranger -a isodescensTAU

You can call hemdag-checker.R by looping through more hierarchical ensemble methods:

algotihms=("gpav" "isotprTF" "isotprW" "isodescensTF" "isodescensW" "isodescensTAU")

for ((i=0; i<${#algotihms[@]}; i++)); do
    Rscript hemdag-checker.R -o 7227_drome -d mf -e ho -f svmlinear -a ${algotihms[$i]}
done;

## example of stdout
7227 drome mf ranger+gpav check passed :)
7227 drome mf ranger+isotprTF check passed :)
7227 drome mf ranger+isotprW check passed :)
7227 drome mf ranger+isodescensTF check passed :)
7227 drome mf ranger+isodescensW check passed :)
7227 drome mf ranger+isodescensTAU check passed :)

Of course, you can loop hemdag-checker.R also through the arguments -o, -d, -e, -f according with the values of your interest.

Evaluation¶

To evaluate the generalization performance of HEDMAG in the time-lapse hold-out experiments performed above, you can use the term-centric and/or protein-centric evaluation metrics provided by the HEMDAG package itself. For further details on the implemented performance metric refer to section Performance Evaluation of the HEMDAG tutorial.

#!/usr/bin/Rscript

start.time <- proc.time();

library(HEMDAG);
library(optparse);

optionList <- list(
    make_option(c("-o", "--organism"), type="character", default="7227_drome",
        help="organism name in the form <taxon>_<name> (def 7227_drome)"),
    make_option(c("-d", "--domain"),  type="character", default="mf",
        help="gene ontology domain -- bp, mf, cc (def. mf)"),
     make_option(c("-e", "--exptype"), type="character", default="ho",
        help="type of dataset on which run HEMDAG. It can be: ho (hold-out) or cv (cross-validated) -- def. ho"),
    make_option(c("-f", "--flat"), type="character", default="svmlinear",
        help="flat classifier (def. svmlinear)"),
    make_option(c("-a", "--algorithm"), type="character", default="isodescensTAU",
        help="hierarchical correction algorithm (def. isodescensTAU)")
);

optParser <- OptionParser(option_list=optionList);
opt <- parse_args(optParser);

## setting input argument
prefix    <- opt$organism;
taxon     <- strsplit(prefix,"_")[[1]][1];
organism  <- strsplit(prefix,"_")[[1]][2];
domain    <- opt$domain;
exptype   <- opt$exptype;
flat      <- opt$flat;
algorithm <- opt$algorithm;

## I/O
data.dir <- paste0("../data/", exptype, "/");
res.dir  <- paste0("../res/",  exptype, "/");
if(!dir.exists(res.dir)){dir.create(res.dir, recursive=TRUE);}

## flat/ann/dag/testIndex file
files <- list.files(data.dir);
flat.file <- files[grep(paste0(organism, ".*", domain, ".scores.*", flat), files)];
ann.file  <- files[grep(paste0(domain,".ann"), files)];
dag.file  <- files[grep(paste0(domain,".dag"), files)];
## check if flat|ann|dag exists
if(length(flat.file)==0 || length(ann.file)==0 || length(dag.file)==0)
    stop("no flat|ann|dag file found\n");

if(exptype == "ho"){
    idx.file  <- files[grep(paste0(domain,".testindex"), files)];
    if(length(idx.file)==0)
        stop("no idx file found\n");
}

## hier file
files <- list.files(res.dir);
hier.file <- files[grep(paste0(organism, ".*", domain, ".scores.*", flat, ".", algorithm), files)];

## load data
S <- get(load(paste0(data.dir, flat.file)));
S.hier <- get(load(paste0(res.dir, hier.file)));
g <- get(load(paste0(data.dir, dag.file)));
ann <- get(load(paste0(data.dir, ann.file)));
if(exptype == "ho")
    testIndex <- get(load(paste0(data.dir, idx.file)));

## if number of nodes between g and S mismatch -> shrink graph g to terms of matrix S
## eg. during flat learning you removed from S all those terms having less than N annotations
root <- root.node(g);
nd <- colnames(S);
class.check <- ncol(S) != graph::numNodes(g);
if(class.check){
    root.check <- root %in% colnames(S);
    if(!root.check)
        nd <- c(root, nd);
    g <- build.subgraph(nd, g, edgemode="directed");
    ann <- ann[, colnames(S)];
}

## remove root node S and S.hier score matrix (if any)
root <- root.node(g);
if(root %in% colnames(S)){
    S <- S[,-which(colnames(S)==root)];
    cat("root node removed from flat score matrix\n");
}
if(root %in% colnames(S.hier)){
    S.hier <- S.hier[,-which(colnames(S.hier)==root)];
    cat("root node removed from hierarchical score matrix\n");
}
## remove root node from annotation matrix (if any)
if(root %in% colnames(ann)){
    ann <- ann[,-which(colnames(ann)==root)];
    cat("root node removed from annotation matrix\n");
}

## shrink S to testIndex
if(exptype == "ho"){
    S <- S[testIndex, ];
    ann <- ann[testIndex, colnames(S)];
}

## compute flat perf
auc.flat  <- auroc.single.over.classes(target=ann, predicted=S, folds=NULL, seed=NULL);
prc.flat  <- auprc.single.over.classes(target=ann, predicted=S, folds=NULL, seed=NULL);
fmax.flat <- compute.fmax(target=ann, predicted=S, n.round=3, b.per.example=TRUE, folds=NULL, seed=NULL, verbose=FALSE);
pxr.flat  <- precision.at.given.recall.levels.over.classes(target=ann, predicted=S, folds=NULL, seed=NULL, recall.levels=seq(from=0.1, to=1, by=0.1));
cat(taxon, organism, domain, flat, algorithm, "flat performance done\n");

## compute hier perf
auc.hier  <- auroc.single.over.classes(target=ann, predicted=S.hier, folds=NULL, seed=NULL);
prc.hier  <- auprc.single.over.classes(target=ann, predicted=S.hier, folds=NULL, seed=NULL);
fmax.hier <- compute.fmax(target=ann, predicted=S.hier, n.round=3, b.per.example=TRUE, folds=NULL, seed=NULL, verbose=FALSE);
pxr.hier  <- precision.at.given.recall.levels.over.classes(target=ann, predicted=S.hier, folds=NULL, seed=NULL, recall.levels=seq(from=0.1, to=1, by=0.1));
cat(taxon, organism, domain, flat, algorithm, "hierarchical performance done\n");

## storing
outname <- gsub("scores", "perfmeas", hier.file)
save(auc.flat, prc.flat, fmax.flat, pxr.flat, auc.hier, prc.hier, fmax.hier, pxr.hier, file=paste0(res.dir, outname), compress=TRUE);

## timing
timing.s <- proc.time() - start.time;
timing.m <- round(timing.s/(60),4);
timing.h <- round(timing.m/(60),4);
cat("\n");
cat("elapsed time:", timing.s["elapsed"], "(seconds)", "|", timing.m["elapsed"], "(minutes)", "|" , timing.h["elapsed"], "(hours)", "\n\n");

To download and use the HEMDAG evaluation script:

## download
mkdir -p ~/hemdag/script/
cd ~/hemdag/script/
wget -nc https://raw.githubusercontent.com/marconotaro/hemdag/master/docs/playground/script/hemdag-eval.R

## call
Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a gpav

Chunk evaluation (optional)¶

The above R call evaluates the performance of an HEMDAG algorithm just on a single dataset. Since HEMDAG can be virtually applied on top of any flat classifier, the Perl script below generates “chunks” of HEMDAG evaluation calls.

Note

The parameter regulating the number of chunk is $m, set by default to 12. Increase (resp. decrease) this value to rise (reduce) the number of HEMDAG evaluation calls to be executed in parallel.

#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Long 'HelpMessage';

GetOptions(
    "org=s"  => \(my @orgs=""),
    "flat=s" => \(my @flats=""),
    "alg=s"  => \(my @algs=""),
    "dmn=s"  => \(my @onts=""),
    'exp=s'  => \(my $exp='ho'),
    'chk=i'  => \(my $chk='12'),
    'help'   => sub {HelpMessage(0)},
) or HelpMessage(1);

@orgs=  split(/,/,join(',',@orgs));
@flats= split(/,/,join(',',@flats));
@algs=  split(/,/,join(',',@algs));
@onts=  split(/,/,join(',',@onts));

print "#!/bin/sh\n\n";
print "tot_start=\$(date +%s)\n\n";

my $k=0;  ## cpu number on which binding a task
foreach my $org (@orgs){
    foreach my $flat (@flats){
        foreach my $alg (@algs){
            foreach my $ont (@onts){
                $k++;
                my $cpu= $k-1;
                print "taskset -c $cpu Rscript hemdag-eval.R -o $org -d $ont -e $exp -f $flat -a $alg > $org"."_go_"."$ont"."_"."$flat"."_"."$alg"."_perfmeas.out 2> /dev/null &\n";
                if($k % $chk ==0){
                    print "\n";
                    print "wait\n";
                    print "\n";
                    $k=0;
                } 
            }
        }
    }
}

print "\n";
print "tot_end=\$(date +%s)\n";
print "tot_elapsed_s=\$((tot_end-tot_start))\n";
print "tot_elapsed_m=\$((tot_elapsed_s/60))\n";
print "tot_elapsed_h=\$((tot_elapsed_m/60))\n";
print "printf \"grand total elapsed time:\t\$((tot_elapsed_s)) SECONDS\t\$((tot_elapsed_m)) MINUTES\t\$((tot_elapsed_h)) HOURS\"\n";
print "echo\n\n";

print "echo \"compute performance done\"\n\n";

__END__

=pod

=head1 NAME

call-hemdag-chunk - generate HEMDAG evaluation calls chunks

=head1 SYNOPSIS

    --org,-o    organism list. The list can have one or more elements. The elements must be comma separated (7227_drome,9031_chick,..)
    --flt,-f    flat classifier list. The list can have one or more elements. The elements must be comma separated (svmlinear,ranger,..)
    --alg,-a    hierarchical correction algorithm list. The list can have one or more elements. The elements must be comma separated (gpav,isodescensTF,isodescensTAU,..)
    --dmn,-d    gene ontology domain list. The list can have one or more elements. The elements must be comma separated (bp,mf,cc)
    --exp,-e    type of dataset on which evaluate HEMDAG. It can be: ho or cv (def. ho)
    --chk,-c    number of parallel evaluations to be computed before going to the next block (def 12)
    --help,-h   print this help

=cut

To download and use the Perl script that generates “chunks” of HEMDAG evaluation calls:

## download the perl script
mkdir -p ~/hemdag/script/
cd ~/hemdag/script/
wget -nc https://raw.githubusercontent.com/marconotaro/hemdag/master/docs/playground/script/hemdag-chunk.pl

## generate chunk evaluation calls
perl hemdag-chunk.pl -o 7227_drome -d mf -f svmlinear -a gpav,isotprTF,isotprW,isodescensTF,isodescensW,isodescensTAU -e ho -c 6 > hemdag-ho-eval.sh

## evaluate HEMDAG in chunks
bash hemdag-ho-eval.sh > out &

You can generate the calls that you wish, simply by extending the arguments of the arrays @orgs, @flats, @algs, @onts with the wanted values. For instance, by setting -d bp,mf,cc, the Perl script hemdag-chunk.pl returns 2 chunks of evaluation calls, the first made of 12 calls and the second one of 6 calls:

## call
perl hemdag-chunk.pl -o 7227_drome -d bp,mf,cc -f svmlinear -a gpav,isotprTF,isotprW,isodescensTF,isodescensW,isodescensTAU -e ho -c 12

## stdout

#!/bin/sh

tot_start=$(date +%s)

taskset -c 0 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a gpav > 7227_drome_go_bp_svmlinear_gpav_perfmeas.out 2> /dev/null &
taskset -c 1 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a gpav > 7227_drome_go_mf_svmlinear_gpav_perfmeas.out 2> /dev/null &
taskset -c 2 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a gpav > 7227_drome_go_cc_svmlinear_gpav_perfmeas.out 2> /dev/null &
taskset -c 3 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a isotprTF > 7227_drome_go_bp_svmlinear_isotprTF_perfmeas.out 2> /dev/null &
taskset -c 4 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a isotprTF > 7227_drome_go_mf_svmlinear_isotprTF_perfmeas.out 2> /dev/null &
taskset -c 5 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a isotprTF > 7227_drome_go_cc_svmlinear_isotprTF_perfmeas.out 2> /dev/null &
taskset -c 6 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a isotprW > 7227_drome_go_bp_svmlinear_isotprW_perfmeas.out 2> /dev/null &
taskset -c 7 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a isotprW > 7227_drome_go_mf_svmlinear_isotprW_perfmeas.out 2> /dev/null &
taskset -c 8 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a isotprW > 7227_drome_go_cc_svmlinear_isotprW_perfmeas.out 2> /dev/null &
taskset -c 9 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a isodescensTF > 7227_drome_go_bp_svmlinear_isodescensTF_perfmeas.out 2> /dev/null &
taskset -c 10 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a isodescensTF > 7227_drome_go_mf_svmlinear_isodescensTF_perfmeas.out 2> /dev/null &
taskset -c 11 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a isodescensTF > 7227_drome_go_cc_svmlinear_isodescensTF_perfmeas.out 2> /dev/null &

wait

taskset -c 0 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a isodescensW > 7227_drome_go_bp_svmlinear_isodescensW_perfmeas.out 2> /dev/null &
taskset -c 1 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a isodescensW > 7227_drome_go_mf_svmlinear_isodescensW_perfmeas.out 2> /dev/null &
taskset -c 2 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a isodescensW > 7227_drome_go_cc_svmlinear_isodescensW_perfmeas.out 2> /dev/null &
taskset -c 3 Rscript hemdag-eval.R -o 7227_drome -d bp -e ho -f svmlinear -a isodescensTAU > 7227_drome_go_bp_svmlinear_isodescensTAU_perfmeas.out 2> /dev/null &
taskset -c 4 Rscript hemdag-eval.R -o 7227_drome -d mf -e ho -f svmlinear -a isodescensTAU > 7227_drome_go_mf_svmlinear_isodescensTAU_perfmeas.out 2> /dev/null &
taskset -c 5 Rscript hemdag-eval.R -o 7227_drome -d cc -e ho -f svmlinear -a isodescensTAU > 7227_drome_go_cc_svmlinear_isodescensTAU_perfmeas.out 2> /dev/null &

tot_end=$(date +%s)
tot_elapsed_s=$((tot_end-tot_start))
tot_elapsed_m=$((tot_elapsed_s/60))
tot_elapsed_h=$((tot_elapsed_m/60))
printf "grand total elapsed time:   $((tot_elapsed_s)) SECONDS  $((tot_elapsed_m)) MINUTES  $((tot_elapsed_h)) HOURS"
echo

echo "compute performance done"

Cross-validated experiments¶

Here, to show how to use HEMDAG in cross-validated experiments, we use a pre-built dataset of the organism Drosophila melanogaster (DROME) that covers the annotations of the GO domain molecular function (MF). The graph and protein-GO term associations belong to the GO release of December 2017. The graph and the annotation matrix was built by adopting the following pipeline. The flat score matrix was obtained by using the random forest as flat learning method (model ranger in the R library caret package with the default parameter settings). For further details on the dataset, please refer to HEMDAG: a family of modular and scalable hierarchical ensemble methods to improve Gene Ontology term prediction (submitted to Bioinformatics).

Download Data¶

All the required .rda files can be downloaded by using the following commands:

mkdir -p ~/hemdag/data/cv/
cd ~/hemdag/data/cv/
curl -Ss https://github.com/marconotaro/hemdag/tree/master/docs/playground/data/cv |  grep -oP '(?<=href=").*?(?=\">)' | grep '.rda$' | perl -pe 's/blob\///' | perl -pe 's/^/https\:\/\/raw.githubusercontent.com/' | wget -nc -i -

Note

Note the change of the last directory from ho/ to cv/ compared to the data downloaded in the Time-lapse hold-out experiments.

With the command above, we download the following datasets:

7227_drome_go_mf_ann_20dec17.rda: the annotation matrix;
7227_drome_go_mf_dag_20dec17.rda: the graph;
7227_drome_go_mf_scores_pearson_100_feature_ranger_5fcv.rda: the flat score matrix;

Programmatic Call¶

To execute any HEMDAG algorithm on cross-validated datasets, basically you must replace -e ho with -e cv in the various calls shown in section Programmatic Call for the time-split hold-out experiments.

Note

the hemdag-call.R script must be called in ~/hemdag/script/;
for the examples shown below the tuning of hyper-parameters takes around one hour;
the flat classifier adopted for the cross-validated experiments performed here is not the svm (-f svmlinear), but the random forest (-f ranger);
the output HEMDAG score matrices are stored in ~/hemdag/res/cv/ (note the shift of the last directory);

For instance, to call the 6 HEMDAG on cross-validated datasets just type:

## GPAV
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -b none -t gpav -l -n 12

## isotprTF
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -p children -b threshold.free -t gpav -l -n 12

## isotprW
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -p children -b weighted.threshold.free -t gpav -w "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

## isodescensTF
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -p descendants -b threshold.free -t gpav -l -n 12

## isodescensW
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -p descendants -b weighted.threshold.free -t gpav -w "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

## isodescensTAU
Rscript hemdag-call.R -o 7227_drome -d mf -e cv -f ranger -p descendants -b tau -t gpav -c "seq(from=0.1, to=0.9, by=0.1)" -m auprc -s 23 -k 5 -l -n 12

Check Hierarchical Constraints¶

To check that HEMDAG does not violate hierarchical constraints imposed by the GO MF hierarchy in the cross-validated datasets obtained above, just replace replace -e ho with -e cv and -f svmlinear with -f ranger in the Check Hierarchical Constraints script:

algotihms=("gpav" "isotprTF" "isotprW" "isodescensTF" "isodescensW" "isodescensTAU")

for ((i=0; i<${#algotihms[@]}; i++)); do
    Rscript hemdag-checker.R -o 7227_drome -d mf -e cv -f ranger -a ${algotihms[$i]}
done;

Evaluation¶

To evaluate HEMDAG in the cross-validated experiments performed above, just replace replace -e ho with -e cv and -f svmlinear with -f ranger in the Evaluation script:

Note

the evaluation of the cross-validated dataset used here takes around 30 minutes – the slowest metric is PXR.

## single evaluation call
Rscript hemdag-eval.R -o 7227_drome -d mf -e cv -f ranger -a gpav

## generate chunk evaluation calls
perl hemdag-chunk.pl -o 7227_drome -d mf -f ranger -a gpav,isotprTF,isotprW,isodescensTF,isodescensW,isodescensTAU -e cv -c 6 > hemdag-cv-eval.sh

## evaluate HEMDAG in chunks
bash hemdag-cv-eval.sh > out &