Keywords

15.1 Motivation of Random Forest

The complexity and high dimensionality of genomic data require using flexible and powerful statistical machine learning tools for effective statistical analysis. Random forest (RF) has proven to be an effective tool for such settings, already having produced numerous successful applications (Chen and Ishwaran 2012). RF is a supervised machine learning algorithm that is very flexible, easy to use, and that without a lot of effort produces very competitive predictions of continuous, binary, categorical, and count outcomes. Also, RF allows measuring the relative importance of each predictor (independent variable) for the prediction. For these reasons, RF is one of the most popular and powerful machine learning algorithms that has been successfully applied in fields such as banking, medicine, electronic commerce, stock market, and finance, among others.

Due to the fact that there is no universal model that works in all circumstances, many statistical machine learning models have been adopted for genomic prediction. RF is one of the models adopted for genomic prediction with many successful applications (Sarkar et al. 2015; Stephan et al. 2015; Waldmann 2016; Naderi et al. 2016; Li et al. 2018). For example, García-Magariños et al. (2009) found that RF performs better than other methods for binary traits when the sample size is large and the percentage of missing data is low (García-Magariños et al. 2009). Naderi et al. (2016) found that, for binary traits, RF outperformed the GBLUP method only in a scenario combining the highest heritability, the largest dense marker panel (50K SNP chip), and the largest number of QTL. González-Recio and Forni (2011) found that RF performed better than Bayesian regression when detecting resistant and susceptible animals based on genetic markers. They also reported that RF produced the most consistent results with very good predictive ability and outperformed other methods in terms of correct classification.

Some of the reasons for the increased popularity of RF models are (a) they require very simple input preparation and can handle binary, categorical, count, and continuous dependent variables without the need for any preprocessing also of independent variables like scaling, (b) they perform implicit variable selection and provide a ranking of predictor (feature) importance, (c) they are inexpensive in terms of computational resources needed for their training since there are few hyperparameters that commonly need to be tuned (number of trees, number of features sampled, and number of samples in the final nodes) and because instead of working directly with all independent variables simultaneously each time, they use only a fraction of the independent variables, (d) some algorithms can beat random forests, but it is never by much, and other algorithms many times take much longer to build and tune than an RF model, (e) contrary to deep neural networks that are really hard to build, it is really hard to build a bad random forest, since it depends on very few hyperparameters and some of them are not very sensitive, which means that a lot of tweaking and fiddling is not required to get a decent random forest model, (f) they have a very simple learning algorithm, (g) they are easy to implement since there are many free and open-source implementations, and (h) RF parallelization is possible because each decision tree is grown independently.

For this reason, this chapter provides the fundamentals for building RF models as well as many illustrative examples for continuous, binary, categorical, and count response variables in the context of genomic prediction. All examples are provided in the context of genomic selection with the goal of facilitating the learning process of users that do not have a strong background in statistics and computer science.

15.2 Decision Trees

A decision tree is a prediction model used in various fields ranging from social science to astrophysics. Given a set of data, logic construction diagrams are manufactured that are very similar to rule-based prediction systems, which serve to represent and categorize a series of conditions that occur in succession, to solve a problem. Nodes in a decision tree involve testing a particular independent variable (attribute). Often, an attribute value is compared with a constant.

As can be seen in Fig. 15.1, decision trees use a conquer-and-divide approach for regression or classification. They work top-down, at each stage seeking to split an attribute that best separates the classes in classification problems, and then recursively processing the subproblems that result from the split. Under this strategy, a decision tree is built that can be converted into decision rules. Leaf nodes give a classification that applies to all instances that reach the leaf, or a set of classifications or a probability distribution over all possible classifications (Fig. 15.1b). In a decision tree, each internal node represents a “test” on an independent variable, each branch represents the outcome of the test, and each leaf node represents a class label in a classification problem. To classify a new individual, it is routed down the tree according to the values of the independent variables tested in successive nodes, and when a leaf is reached, the new individual is classified according to the class assigned to the leaf.

Fig. 15.1
figure 1

Illustration of a decision tree for a classification problem

Next, the following synthetic data set illustrates how to use the library party with the function ctree().

> Data_GY GY X1 X2 X3 X4 1 9.98 7.56 8.45 10.0 8.99 2 9.48 7.39 8.23 9.92 9.04 3 10.0 8.32 8.84 9.25 10.0 4 7.11 6.96 7.13 9.24 8.49 5 9.07 7.07 9.07 9.41 8.63 6 10.0 9.02 9.65 10.0 9.93 7 8.79 7.91 8.27 8.89 8.48 8 8.61 7.24 8.22 8.38 9.86 9 10.0 8.34 8.55 9.50 8.95 10 10.0 8.30 9.72 9.66 10.0 11 7.84 6.60 8.49 9.00 8.67 12 6.41 7.11 7.34 8.83 8.29

The basic R code to build decision trees in the library is given next:

Control_GY=ctree_control(teststat = c("quad", "max"), testtype = c("Univariate"), mincriterion = 0.05, minsplit =2, minbucket = 1, stump = FALSE, nresample = 101, maxsurrogate =2, mtry =2, savesplitstats = TRUE, maxdepth = 30, remove_weights = FALSE) Grades_tree=ctree(GY~X1+X2+X3+X4, controls= Control_GY, data=Data_GY)

The output of the decision tree built with the ctree() function is given next:

> Grades_tree

Conditional inference tree with five terminal nodes

Response: GY

Inputs: X1, X2, X3, X4

Number of observations: 12

1) X2 <= 7.34; criterion = 0.991, statistic = 6.812 2)* weights = 2 1) X2 > 7.34 3) X1 <= 7.24; criterion = 0.982, statistic = 5.561 4)* weights = 3 3) X1 > 7.24 5) X3 <= 9.25; criterion = 0.85, statistic = 2.07 6)* weights = 2 5) X3 > 9.25 7) X4 <= 9.04; criterion = 0.55, statistic = 0.57 8)* weights = 3 7) X4 > 9.04 9)* weights = 2

And the corresponding plot of this decision tree is given next (Fig. 15.2):

Fig. 15.2
figure 2

Decision tree for a regression problem with five terminal nodes

The ctree () function of the party package allows you to fit conditional decision trees. The choice between a regression tree or a classification tree is made automatically depending on whether the response variable is of a continuous type or factor. It is important to note that only these two types of response variables are allowed; if one type of character is passed, an error is returned.

The predictive capacity of models based on a single tree is considerably lower than that achieved with other models. This is due to the tendency of those models to overfitting and high variance. One way to improve the generalizability of decision trees is to use regularization or to combine multiple trees, and one of these approaches is called random forest. Decision trees are also sensitive to unbalanced training data (one class dominates over the others). When dealing with continuous predictors, decision trees lose some of their information by categorizing them at the time of node splitting. As described before, the creation of tree branches is achieved by the recursive binary splitting algorithm. This algorithm identifies and evaluates the possible divisions of each predictor according to a certain measure (RSS, Gini, entropy, etc.). Continuous predictors are more likely to contain, just by chance, some optimal cutoff point, which is why they tend to be favored in the creation of trees. For these reasons, decision trees are not able to extrapolate outside the range of the predictors observed in the training data.

15.3 Random Forest

Random forest (RF) is a decision tree-based supervised statistical machine learning technique. Its main advantage is that you get better generalization performance for similar training performance than with decision trees. This improvement in generalization is achieved by compensating for the errors in the predictions of the different decision trees. To ensure that the trees are different, what we do is that each one is trained with a random sample of the training data. This strategy is called bagging.

As mentioned before, RF is a set (ensemble) of decision trees combined with bootstrapping (bagging). When using bootstrapping, what actually happens is that different trees see different portions of the data. No tree sees all the training data. This entails training each tree of the forest with quite different data samples for the same problem. In this way, by combining their results, some errors are compensated for by others and we have a prediction that generalizes better. The RF adds additional randomness to the model while growing the trees. Instead of searching for the most important independent variable while splitting a node, it searches for the best independent variable among a random subset of independent variables. This generates a wide heterogeneity that generally improves the model performance. A good binary split partitions data from the parent tree node into two daughter nodes so that the ensuing homogeneity of the daughter nodes is improved by the parent node.

In Fig. 15.3 we can see that a collection of ntree >1 trees is grown in which each tree is grown independently using a bootstrap sample of the original data. The terminal nodes of the tree contain the predicted values which are tree-aggregated to obtain the forest predictions. For example, in classification, each tree casts a vote for the class and the majority vote determines the predicted class label, while in regression problems the predicted value is the average of the observations in the terminal nodes. Rather than splitting a tree node using all p independent variables (features), RF, as mentioned above, selects at each node of each tree, a random subset of 1 ≤ mtry ≤ p independent variables that is used to split the node where typically mtry is substantially smaller than p. The purpose of this two-step randomization is to decorrelate trees and reduce variance. RF trees are grown deeply (with many ramifications) which reduces bias. In general, a RF tree is grown as deeply as possible under the constraint that each terminal node must contain no fewer than nodesize ≥1 cases. The nodesize is the minimum size of terminal nodes.

Fig. 15.3
figure 3

Illustration of how the random forest model works and combines multiple trees

Figure 15.4 provides more details of how both randomizations take place. We can see that the bootstrap samples have the same number of observations as the original training data, but with the difference that not all observations are present since some rows are repeated. In general, 2/3 (63.2%) of the original observations are maintained in a bootstrap sample. Also, we can see in Fig. 15.4 that not all independent variables are present in each bootstrap sample, and that only three randomly selected samples (observations) are present in each bootstrapped sample. It is important to point out that the bootstrapping phase is independent of the feature subsampling.

Fig. 15.4
figure 4

Illustration of how bootstrap samples and samples of predictors are selected under the random forest model

15.4 RF Algorithm for Continuous, Binary, and Categorical Response Variables

RF is an interchange of bootstrap aggregating that builds a large collection of trees, and then averages out the results. Each tree is built using a splitting criterion (loss function), that should be appropriate for each type of response variables (continuous, binary, categorical, and count). For training data (Breiman 2001), RF takes B bootstrap samples and randomly selects subsets of independent variables as candidate predictors for splitting tree nodes. Each decision tree will minimize the average loss function in the bootstrapped (resampled) data and is built up using the following algorithm:

  • For b = 1, …, B bootstrap samples {yb, Xb}

Step 1. From the training data set, draw bootstrap samples of size Ntrain.

Step 2. With the bootstrapped data, grow a random forest tree Tb with the specific splitting criterion (appropriate for each response variable), by recursively repeating the following steps for each terminal node of the tree, until the minimum node size (minimum size of terminal nodes) is reached.

  1. (a)

    Randomly draw mtry out of the p independent variables (IVs); mtry is a user-specified parameter and should be less or equal to p (total number of IVs).

  2. (b)

    Pick the best independent variable among the mtry IVs.

  3. (c)

    Split the node into two child nodes. The split ends when a stopping criterion is reached, for instance, when a node has less than a predetermined number of observations. No pruning is performed.

Step 3. The ensemble of trees is obtained \( {\left\{{T}_b\right\}}_1^B \).

The predicted value of testing set (\( {\hat{y}}_i \)) individuals with input xi is calculated as \( {\hat{y}}_i=\frac{1}{B}\sum \limits_{b=1}^B{T}_b\left({\boldsymbol{x}}_i\right) \) when the response variable is continuous, but when the response variable is binary or categorical, the predicted values are calculated as \( {\hat{y}}_{ic}=\mathrm{majority}\ \mathrm{vote}\ {\left\{{\hat{C}}_b\left({\boldsymbol{x}}_i\right)\right\}}_1^B \), where \( {\hat{C}}_b\left({\boldsymbol{x}}_i\right) \) is the class prediction of the bth RF tree. Readers are referred to Breiman (2001) and Waldmann (2016) for details on the theory of RF. As explained above, the choice of splitting function is very important since we need to use different splitting rules for each type of response variable.

Also, the adequate selection of these hyperparameters could significantly improve the prediction performance of the models, but the choice of the optimal values is case study-dependent. Next are given the importance and influence of these tuning parameters in the prediction performance considering that input and response variables correspond to the same time point or individual.

  1. (a)

    Number of trees in the forest (ntree). The number of decision trees used to build up the ensemble has an explicit influence on prediction performance. The higher the number of trees, the smaller the error. But this trend is asymptotic: if the number of trees is large enough, increasing the number of trees does not result in significant improvement of the prediction accuracy. Besides, using more trees requires longer computing times. For this reason, the number of trees is set based on a trade-off solution between computing time and predictive performance. Each tree makes use of around two-thirds (63.2%) of the observations to build the tree. The remaining observations are referred to as Out-Of-Bag (OOB). One may predict the response for the ith observation using all of the trees in which these observations are OOB.

  2. (b)

    Number of independent variables randomly selected to be contemplated at each split (mtry) in RF. To choose the value of the mtry parameter, it is necessary to consider the correlation between the input variables. With highly correlated input variables, it is preferable to use a small value. Generally, mtry = p/3 for regression forests (continuous outputs) and \( mtry=\sqrt{p} \) for classification (binary or categorical outputs) forests (where p is the total number of input variables). On the other hand, if there are many irrelevant input variables, a larger value of mtry would be needed in order to obtain better predictions. In any problem, there are independent variables that may be of relatively minor importance (irrelevant). An input variable highly uncorrelated with the rest of the input could be very important due to its unique role in the analysis, or not important at all in the prediction if not linked to the response.

  3. (c)

    Node size. A decision tree works by recursive partitioning of the training set. Every node t of a decision tree is related to a set of nt observations from the training set:

By node size we understand the minimum node size; in Fig. 15.5 the minimum node size is 12. This parameter essentially sets the depth of your decision trees. The depth of each particular decision tree can be either fine-tuned manually or have the algorithm select it automatically. When decision trees grow too deep, there is the risk of overfitting. For this reason, this parameter also needs to be tuned. The terminal nodes of the tree contain the predicted values which are tree-aggregated (by average for continuous variables and majority vote for categorical variables) to obtain the forest predictions. This means that in classification, each tree casts a vote for the class and the majority vote determines the predicted class label, while in continuous response variables the average is reported as prediction performance.

Fig. 15.5
figure 5

Illustration of the minimum node size in a decision tree

15.4.1 Splitting Rules

The splitting rule defined as the rule on how to decide whether to split it is a central component of RF models and crucial for the performance of each tree of a RF model. For continuous response variables, there are many criteria to decide where to split, and the most common splitting criterion is the least squares criterion. Suppose the proposed split for the root node is of the form X ≤ c and X > c for a continuous variable X, and a split threshold of c. The best split is the one that minimizes the weighted sum of square errors (SSE):

$$ \mathrm{SSE}=\mathrm{SS}{\mathrm{E}}_L{\varOmega}_L+\mathrm{SS}{\mathrm{E}}_R{\varOmega}_R, $$

where \( \mathrm{SS}{\mathrm{E}}_L={\sum}_{i=1}^L{\left({y}_i-\overline{y_L}\right)}^2 \) represents the sum of square errors for the left node, L denotes the number of elements that contain the left partition, \( \overline{y_L} \) is the mean of the response variables of elements in the left partition and \( {\varOmega}_L=\frac{n_L}{n} \) is the proportion of observations of the left node. \( \mathrm{SS}{\mathrm{E}}_R={\sum}_{i=1}^R{\left({y}_i-\overline{y_R}\right)}^2 \) denotes the sum of square errors for the right node, R is the number of elements that contain the right partition, and \( \overline{y_R} \) is the mean of the response variables of elements in the right partition. \( {\varOmega}_R=\frac{n_R}{n} \) is the proportion of observations of the right node and n = nL + nR. When ΩL = ΩR = 1, the SSE is reduced to its unweighted version.

For binary and categorical response variables, there are some options for splitting criteria. Next, we provide the weighted Gini index criterion that should minimize

$$ \mathrm{GI}=\left[{\sum}_{i=1}^C{p}_{Li}\left(1-{p}_{Li}\right)\right]{\varOmega}_L+\left[{\sum}_{i=1}^C{p}_{Ri}\left(1-{p}_{Ri}\right)\right]{\varOmega}_R, $$

where C is the number of classes in the response variable, \( {p}_{Li}=\frac{n_{Li}}{n_L} \) is the probability of occurrence of class i in the left node, \( {\varOmega}_L=\frac{n_L}{n} \) is the proportion of observations in the left node, \( {p}_{Ri}=\frac{n_{Ri}}{n_R} \) is the probability of occurrence of class i in the right node, and \( {\varOmega}_R=\frac{n_R}{n} \) is the proportion of observations in the right node and n = nL + nR. The GI making ΩL = ΩR = 1 is reduced to the unweighted Gini index.

Another splitting criterion for binary and categorical response variables is the weighted binary (or categorical) cross-entropy , which is defined as

$$ \mathrm{CE}=-\left[{\sum}_{i=1}^C{p}_{Li}\log \left({p}_{\mathrm{Li}}\right)\right]{\varOmega}_L-\left[{\sum}_{i=1}^C{p}_{Ri}\log \left({p}_{\mathrm{Ri}}\right)\right]{\varOmega}_R $$

Also, the weighted binary (or categorical) cross-entropy is reduced to its unweighted version by making ΩL = ΩR = 1. For most of the examples, we will use the fast unified random forests for survival, regression, and classification of (randomForestSRC) R package. This package performs parallel computing of Breiman’s random forests (Breiman 2001) for a variety of data settings including regression, classification, and right-censored survival and competing risks (Ishwaran and Kogalur 2008). Other important applications cover multivariate classification/regression, quantile regression (see quantreg), unsupervised forests, and novel solutions for class imbalanced data. However, in this chapter, we will only use the randomForestSCR library for illustrating the implementation of RF for univariate and multivariate outcomes for binary, categorical, and continuous response variables. Different splitting rules can be invoked under RF applications, for which variable importance measures (VIM) can also be computed for each predictor. There are many measures of variable importance; one common approach for regression trees is to calculate the decrease in prediction accuracy from the testing data set. For each tree, the testing set portion of the data is passed through the tree and the prediction error (PE) is recorded. Each predictor variable is then randomly permuted and j new PE is computed. The differences between the two are then averaged over all the trees, and normalized by the standard deviation of the differences. The variable showing the largest decrease in prediction accuracy is the most important variable. These VIM can be displayed in a variable importance plot of the top-ranked variables.

To implement RF models in the randomForestSRC package, we used the function rfsrc(), for which we show the main elements of its usage:

rfsrc(formula, data, ntree = 1000, mtry = NULL, nodesize= NULL, splitrule = NULL, importance =TRUE),

where formula is a symbolic description of the model to be fitted. If missing, unsupervised splitting is implemented, data is a data frame containing the y-outcome and x-variables, ntree represents the number of trees, and mtry denotes the number of variables randomly selected as candidates for splitting a node. The default is mtry = p/3 for regression, where p equals the number of independent variables. For all other families (including unsupervised settings), the default is \( mtry=\sqrt{p} \). Values are always rounded up. Nodesize denotes the forest average number of unique cases (data points) in a terminal node. The defaults are classification (1), regression (5), competing risk (15), survival (15), mixed outcomes (3), and unsupervised (3). It is good practice to explore with different nodesize values. Splitrule denotes the splitting rule, for regression analysis (continuous response variables) can be used the mean squared error (mse) also known as the least square criterion. The mse splitting rule implements the weighted mean squared error splitting criterion (Breiman et al. 1984, Chapter 8.4), while for classification analysis (binary and categorical response variables), there are three splitting rules available in this package: (a) Gini: default splitrule implements Gini index splitting (Breiman et al. 1984, Chapter 4.3), (b) auc: AUC (area under the ROC curve) splitting for both two-class and multi-class settings. AUC splitting is appropriate for imbalanced data, and (c) entropy: entropy splitting (Breiman et al. 1984, Chapter 2.5, 4.3) and importance = TRUE compute VIM for each predictor. Default action is code importance = “none”.

Next, we provide some examples for implementing RF models for continuous, binary, and categorical response variables.

Example 15.1

For this example, we also used the Data_Toy_EYT.RData data set composed of 40 lines, four environments (Bed5IR, EHT, Flat5IR, and LHT), and four response variables: DTHD, DTMT, GY, and Height. G_Toy_EYT is the genomic relationship matrix of dimension 40 × 40. The first two variables are ordinal with three categories, the third is continuous (GY = Grain yield), and the last one (Height) is binary.

First, using the continuous response variable (GY), we illustrate how to implement the RF for continuous outcomes. We build the design matrices that jointly will form the input for the RF model. The design matrices for this example were built as

### Design matrix of genotypes ### ZG <- model.matrix(~0 + GID, data=Pheno) ### Compute the Cholesky factorization of the genomic relationship matrix ZL <- chol(Geno) ###Incorporating the information of the GRM to the design matrix of genotypes ZGL <- ZG %*% ZL ####Design matrix of environments ZE <- model.matrix(~0 + Env, data=Pheno) ### Design matrix of the interaction between genotype and environment (GE) ZGE <- model.matrix(~0 + ZGL:Env, data=Pheno)

Joining the design matrices of environments and genotypes

X <- cbind(ZGL, ZE)

First, using all the data set at hand in the Data_Toy_EYT.RData, we show how to train an RF model with a continuous outcome and how to extract and plot variable importance. It is important to point out that as input we not used directly the information of markers but the square root of the genomic relationship matrix.

# Data frame with the response variable and all predictors (environments + genotypes) Data <- data.frame(y=Pheno$GY, X) # Fit the model with importance=TRUE for computing the variable importance model <- rfsrc(y ~ ., data=Data, importance=TRUE)

Get the variable importance

> head(model$importance) EnvBed5IR EnvEHT EnvFlat5IR EnvLHT 0.1471930788 0.3345026471 0.2117008212 1.7597458640 GID6569128 GID6688880 -0.0001849502 -0.0003675786

Plot the ten most important variables with their own functions

Plot <- plot_importances(model$importance, how_many=10) Plot

As mentioned above, with the rfsrc(), the RF model is fitted for continuous response variables that use the mse splitting rule by default; then with model$importance we extract the variable importance values for each of the independent variables of matrix X; however, only ten of them are printed. When the response variables are categorical, the values of VIM are probabilities; for this reason, the output of the VIM is a probability distribution, that is, a probability for each categorical response that sums up to 1 in each row (individual). For this reason, extracting the VIM should be done for each category; for example, model$importance[, 2] and model$importance[, C] extract the VIM for categories 2 and C, respectively, of the response variable. Then with the plot_importance() that is available in Appendix 1, a plot is generated for only the ten most important independent variables. This plot is given in Fig. 15.6 for the example given above for the continuous response variable (GY). The complete code for generating Fig. 15.6 is given in Appendix 2.

Fig. 15.6
figure 6

Variable importance measures (VIM) for the ten most important predictors of the data set Data_Toy_EYT.RData

Figure 15.6 shows that the most important predictors are the dummy variables of environments and that each of the lines has a small effect. However, it is important to point out that in this plot, genetic and environmental effects are together, but what we see in Fig. 15.6 is not unusual, since in plant breeding trials the environmental main effect is always larger than the genetic effects.

The way of extracting and building the VIM is similar for continuous, binary, and categorical outcomes; as mentioned above, for this reason, it is not necessary to give examples for other types of response variables.

Next, we will illustrate how to make predictions for new data with RF, but now using not only the information of environments and genotypes in the predictor but also taking into account the genotype–environment interaction. For this reason, again first we stack the information on environment, lines, and genotype–environment interaction.

Joining the design matrices of environments, genotype, and interaction GE term

X <- cbind(ZGL, ZE, ZGE)

It is important to point out that since the response variable is continuous (GY), the mse, also known as the least squares criterion, is used as the splitting rule. Since the RF model has hyperparameters to tune in these examples, the following grid of values was used for the following three hyperparameters:

ntrees=c(100, 200, 300) mtry=c(80, 100, 120) nodesize=c(3, 6, 9)

This means that the grid contains 3 × 3 × 3 = 27 combinations that need to be evaluated, with part of the training set and then will be selected the best combination with which the RF model will be refitted but using the whole training (outer training) set. With this final fitted model, the predictions for the testing set are performed. It is important to point out that the larger the grid, the greater the possibility that you can improve the prediction performance in the testing set. For the evaluation of the prediction performance, in this example, ten random partitions were used and in each partition 80% of the information was used for outer training and 20% for testing, but the outer training set (80% of the original data set) in each random partition was also split and 80% of the data was for inner training and the remaining 20% for tuning the set. See Chap. 4 for a review of the outer and inner training concepts. The 27 hyperparameter combinations of the grid set given above were evaluated with the inner training set. The basic code used for the tuning process is given next:

tuning_model <- rfsrc(y ~ ., data=DataInnerTraining, ntree=flags$ntree, mtry=flags$mtry, nodesize=flags$nodesize, splitrule ="mse" ) predictions <- predict(tuning_model, newdata=DataInnerTesting)$predicted

This code is used for the tuning process, and for this reason, the RF with continuous response variable (splitrule = “mse”) is used for each of the 27 combinations of the grid and the predicted values of each of the 27 combinations are computed with the predict() function which requires as inputs the fitted model (tuning_model); the newdata for which the predictions should be obtained in this case is the DataInnerTesting (tuning set). Finally, from this function, only the predicted values were extracted for computing the prediction performance of each combination in the grid. Then the best hyperparameter combination was selected and with this best combination, the RF was refitted but using all the outer training sets (inner training+tuning set). The R code for doing this training process is the same, but using the whole outer training set:

model <- rfsrc(y ~ ., data=DataTraining, ntree=best_params$ntree, mtry=best_params$mtry, nodesize=best_params$nodesize, splitrule ="mse") predicted <- predict(model, newdata=DataTesting)$predicted

Finally, with this fitted model, the prediction performance was computed for each of the ten testing sets. The average of the ten partitions [MAAPE, average Pearson’s correlation (PC), coefficient of determination (R2), and mean square error (MSE) of prediction] is reported as the prediction performance. The code for computing these metrics and the plots of variable important measures are given in Appendix 1. The whole code for reproducing the results given in Table 15.1 is provided in Appendix 3. As pointed out before, if the splitrule is ignored, the mse splitrule is implemented by default.

Table 15.1 Prediction performance for the continuous trait (GY) in terms of MAAPE, average Pearson’s correlation (PC), coefficient of determination (R2), and mean square error (MSE). Ten random partitions with 80% of data for training and 20% for testing were used

Table 15.1 indicates that the best predictions under the MAAPE were observed in environments Bed5IR and EHT, while under the PC and R2, the best predictions were observed in environments Bed5IR and Flat5IR. However, under the MSE, the best predictions were in environments Bed5IR and LHT. These results show that the selection is in part affected by the metric used for the evaluation of prediction performance; for this reason, using more than one is suggested.

Next, we show how to use RF for predicting a binary response variable (Height); the independent variables (X) used for this example are exactly the same as the ones used in the previous example for continuous response variables that contain information of environments, genotypes, and genotype×environment interaction. However, since now we want to train RF for a binary outcome, first we need to convert the response variable as factor:

Pheno$y <- as.factor(Pheno$y)

Pheno$y is the height trait that is binary. Also, the grid used for the tuning process was exactly the same, with 27 combinations resulting from three values of ntrees, three values of mtry, and three values of nodesize. However, now instead of using random partitions for evaluating the prediction performance, we used five-fold cross-validation, and each time four of them were used for training (outer training) and one for testing. We used another strategy of cross-validation only with the purpose of illustrating that in some circumstances one strategy should be preferred over others. However, the k-fold cross-validation guarantees orthogonal folds. For selecting the hyperparameters, we also used five-fold cross-validation with four for training (inner training) and one for tuning, but now this five-fold cross-validation was performed in each of the outer training sets. The key code for implementing RF for tuning is given next:

tuning_model <- rfsrc(y ~ ., data=DataInnerTraining, ntree=flags$ntree, mtry=flags$mtry, nodesize=flags$nodesize, splitrule="gini") predictions <- predict(tuning_model, newdata=DataInnerTesting)$predicted

Now instead of using MSE as splitrule, we used the Gini, which is appropriate for binary and categorical response variables. However, if you do not specify the splitrule, but the response variable is used as factor, the Gini split rule is used by default, but when the response variable is continuous, the MSE split rule is used by default. Again, once the best hyperparameter combination in each outer training set was selected, the RF was refitted with the whole outer training set (inner training+tuning set), and with this final refitted model, the predictions for each testing set are performed. The R code using the rfsrc() for the final fitting process using the whole training set is given next, and it is exactly the same as for the tuning process.

model <- rfsrc(y ~ ., data=DataTraining, ntree=best_params$ntree, mtry=best_params$mtry, nodesize=best_params$nodesize, splitrule="gini") predicted <- predict(model, newdata=DataTesting)$predicted

The complete R code for implementing RF for the binary response variable is given in Appendix 4. Finally, the proportion of cases correctly classified (PCCC) and the Kappa coefficient (Kappa) in each of the five testing sets are computed with all the predicted values of each testing set, and the average of five-fold is reported as the prediction performance for each environment.

Table 15.2 indicates that the best prediction performance was observed in environments Bed5IR (PCCC = 0.7331 and Kappa = 0.3746) and EHT (PCCC = 0.680079 and Kappa = 0.2964).

Table 15.2 Prediction performance for the binary trait (Height) in terms of the proportion of cases correctly classified (PCCC) and the Kappa coefficient. Five-fold cross-validation was used

Next, we provide the results of implementing RF for the same data set, but now for the categorical response variable DTHD. The implementation was also done with five-fold cross-validation (outer and inner) and all inputs and the grid used for the tuning process were the same as in the previous binary example, except that now the response variable was categorical. The R code given in Appendix 4 can also be used for categorical response variables but by replacing the binary response variable (Height) with the categorical response variable (DTHD). The results of the prediction performance in terms of PCCC and Kappa coefficient are given in Table 15.3; the best prediction performance was observed in environments Bed5IR (PCCC = 0.7623 and Kappa = 0.5693) and EHT (PCCC = 0.8057 and Kappa = 0.678).

Table 15.3 Prediction performance for the categorical trait (DTHD) in terms of the proportion of cases correctly classified (PCCC) and the Kappa coefficient (Kappa). Five-fold cross-validation was used

15.5 RF Algorithm for Count Response Variables

The popular RF models presented before were originally developed for continuous, binary, and categorical data. There are also RF models for count data (Chaudhuri et al. 1995; Loh 2002) that can be implanted in R using the package part (Therneau and Atkinson 2019). However, these RF models for count data are not appropriate for counts with an excess of zeros. For this reason, Lee and Jin (2006) proposed an RF method for counts with an excess of zeros, by building the splitting criterion with the zero-inflated Poisson distribution, but the proposed method models both the excess zero part and the Poisson part jointly, which is unlike the basic hurdle and zero-inflated regression models that use two models, thus allowing different covariate effects for each part. The excess zeros are generated by a disconnect process from the count values and the excess zeros can be modeled independently. For this reason, classic regression models for counts with an excess of zeros use a logistic model for predicting excess zeros and a truncated Poisson model for counts larger than zero.

For this reason, next are described two algorithms for count data with an excess of zeros proposed by Mathlouthi et al. (2019) and applied by Montesinos-López et al. (2021) for genomic prediction. The first algorithm is called zero-altered Poisson random forests (ZAP_RF) and the other is called zero-altered Poisson custom random forest (ZAPC_RF), both algorithms [zero-altered Poisson (ZAP) regression models], assumed that Y = 0 with probability θ (0 ≤ θ < 1), and that Y follows a zero-truncated Poisson distribution with parameter μ (μ > 0), given that Y > 0 (Mathlouthi et al. 2019). That is, they are based on the ZAP random variable

$$ P\left(Y=y\right)=\left\{\begin{array}{c}\theta \kern9.5em y=0\\ {}\frac{\left(1-\theta \right)\exp \left(-\mu \right){\mu}^y}{\left(1-\exp \left(-\mu \right)\right)y!}\kern1em y>0\end{array}\right. $$

The mean and variance for ZAP are

$$ E(Y)=\frac{\left(1-\theta \right)\exp \left(-\mu \right)}{\left(1-\exp \left(-\mu \right)\right)}\mathrm{and}\ \mathrm{Var}(Y)=\frac{\left(1-\theta \right)}{\left(1-\exp \left(-\mu \right)\right)}\left(\mu +{\mu}^2\right)-{\left(\frac{\left(1-\theta \right)}{\left(1-\exp \left(-\mu \right)\right)}\mu \right)}^2 $$

In general, zero-altered models are two-part models, where the first part is a logistic model, and the second part is a truncated count model. However, under the ZAP_RF and ZAPC_RF, instead of assuming a linear predictor (like ZAP regression models), it is assumed that the links between the covariates and the responses (Mathlouthi et al. 2019) through μ and θ are given by nonparametric link functions like

$$ \log \left(\mu \right)={f}_{\mu}\left(\boldsymbol{x}\right)\ \mathrm{and}\ \log \left(\frac{\theta }{1-\theta}\right)={f}_{\theta}\left(\boldsymbol{x}\right), $$
(15.1)

where fμ and fθ are general unknown link functions. A general nonparametric and flexible procedure can be used to estimate fμ and fθ in (15.1). However, here we used a random forest in two steps instead of a parametric model:

Step 1. Zero model. Fit a binary RF to the response I(Y = 0), that is, the binary variable takes a value of 1 if Y = 0 and a value of 0 if Y > 0. This model produces estimates of \( \hat{\theta} \).

Step 2. Truncated model. Fit an RF using only the positive (Y > 0) observations. Assume there are N+ such observations denoted by \( {Y}_1^{+},\dots, {Y}_{N^{+}}^{+} \). This model produces estimates of \( \hat{\mu} \). However, to exploit the Poisson assumption, the splitting criteria used in the RF with the truncated part was derived from the zero-truncated Poisson likelihood that is equal to:

$$ L{L}^{+}=-{N}^{+}\log \left(1-\exp \left(-\mu \right)\right)+\log \left(\mu \right)\sum \limits_i^{N^{+}}{Y}_i^{+}-{N}^{+}\mu -\sum \limits_i^{N^{+}}\log \left({Y}_i^{+}!\right), $$
(15.2)

where LL+ is the log-likelihood function of a sample of a zero-truncated Poisson distribution. The estimate of μ is obtained by solving \( \frac{\partial L{L}^{+}}{\partial \mu }=0 \), which reduces to

$$ \frac{\sum \limits_i^{N^{+}}{Y}_i^{+}\ }{N^{+}}=\frac{\mu }{1-\exp \left(-\mu \right)} $$

For a given candidate split, the log-likelihood function given in (15.2) is computed separately in the two children nodes and the best split is the one that maximizes

$$ \hat{L{L}^{+}}\ \left(\mathrm{left}\ \mathrm{node}\right)+\hat{L{L}^{+}}\ \left(\mathrm{right}\ \mathrm{node}\right), $$

where \( \hat{L{L}^{+}} \) (left node) and \( \hat{L{L}^{+}} \) (right node) are the log-likelihood for each node.

Once we have the estimates of μ and θ, the predicted values of Y under the ZAP_RF are obtained with

$$ \hat{Y}=\frac{\left(1-\hat{\theta}\right)\exp \left(-\hat{\mu}\right)}{\left(1-\exp \left(-\hat{\mu}\right)\right)} $$
(15.3)

It is important to point out that in the prediction formula given above (15.3), (\( \hat{Y} \)) is equal to the mean of the ZAP model, while under the ZAPC_RF, the predictions are obtained as

$$ \hat{Y}=\left\{\begin{array}{c}0,\kern3.5em \hat{\theta}>0.5\\ {}\hat{\mu},\kern3.25em \hat{\theta}\le 0.5\end{array}\right. $$
(15.4)

The ZAPC_RF as conventional logistic regression, the predicted values are probabilities and those probabilities are converted to a binary outcome if the probability is larger (or smaller) than some probability threshold (most of the time this threshold is 0.5). However, under the ZAPC_RF, instead of converting the probabilities to 0 and 1, we convert to zero if \( \hat{\theta}>0.5 \) (15.4) and to the estimated expected count value (\( \hat{\mu}\Big) \) if \( \hat{\theta}\le 0.5 \) (15.4). One limitation of the ZAPC_RF shared with logistic regression is that the probability threshold is not unique, since many other values between zero and one can be used. However, the threshold value of 0.5 is used most of the time since it assumes no prior information, and for this reason, both categories have the same probability of occurring (Montesinos-López et al. 2021).

The implementation of the ZAP_RF and ZAPC_RF can be done using the following function:

tuning_model<-zap.rfsrc(X_training, y_training, ntree_theta=ntree_1, mtry_theta=mtry_1, nodesize_theta=nodesize_1,ntree_lambda=ntree_2, mtry_lambda=mtry_2, nodesize_lambda=nodesize_2) predictions <- predict(tuning_model, X_testing, type="original")$predicted

It is important to point out that for the zap.rfsrc() function we need to provide a training set with predictors (X_training) and the count response variable (y_training). Also, like the rfsrc() function, we need to provide the ntree, mtry, and nodesize; however, we need to specify these parameters for the two parts (zero model and truncated model) of the ZAP_RF model and ZAPC_RF. The parameters that end with _theta are for the zero model and those that end with _lambda are for the truncated model. It is important to point out that the specification given above in the zap.rfsrc() is valid for the ZAP_RF and ZAPC_RF since the only difference between the two models is in how the predictions are performed. For this reason, only by changing in the predict() function the parameter type = “original” that implements the ZAP_RF to type = “custom”, it is possible to implement the ZAPC_RF model.

Next, we show how to implement the ZAP_RF model using the same data set used in the previous examples. First, it is important to point out that all the input information used for implementing this example are the same as in the previous examples (the same design matrices, the same grid for the tuning process, etc.), except that now for the illustrating process, the response variable is assumed as count. But since we are using the same data set (Data_Toy_EYT.RData) as in the previous example, we assume trait DTHD as count. Next is given the basic R code for the tuning process:

tuning_model=zap.rfsrc(X_inner_training, y_inner_training, ntree_theta=flags$ntree, mtry_theta=flags$mtry, nodesize_theta=flags$nodesize, ntree_lambda=flags$ntree, mtry_lambda=flags$mtry, nodesize_lambda=flags$nodesize) predictions <- predict(tuning_model, X_inner_testing, type="original")$predicted

The R code for refitting the model with the best hyperparameter combination is the same as above, with the only difference that instead of using the inner information, the whole training set (inner_training+tuning set) is used. The whole code for implementing the ZAP_RF is provided in Appendix 5.

Table 15.4 indicates that since the categorical trait (DTHD) was assumed as count (only for illustration purposes), the prediction performance was reported in terms of those metrics (MAAPE, PC, R2, MSE) used for continuous traits. Table 15.4 shows that in terms of MAAPE, PC, and R2, the best predictions were observed in environments Bed5IR and EHT, while in terms of MSE, the best predictions were observed in environments EHT and LHT.

Table 15.4 Prediction performance for the categorical trait (DTHD) assumed as a count response variable under the ZAP_RF model with five-fold cross-validation

Next, we obtain the predictions under the ZAPC_RF that only differ from the ZAP_RF model in specifying inside the predict() function in type = “custom” as is shown below.

predictions <- predict(tuning_model, X_inner_testing, type="custom")$predicte

For this reason, the code given in Appendix 5 can also be used for implementing the ZAPC_RF, but by only changing type = “original” to type = “custom”. The prediction performance of the ZAPC_RF is given in Table 15.5. Under MAAPE, the best predictions were observed in the Bed5IR and LHT environments, while under APC, R2, and MSE, the best predictions were observed in environments EHT and LHT, respectively.

Table 15.5 Prediction performance for the categorical trait (DTHD) assumed as a count response variable under the ZAPC_RF model with five-fold cross-validation

15.6 RF Algorithm for Multivariate Response Variables

Multivariate RF models are a generalization of univariate RF models, whereas a typical univariate RF model involves a data set where each instance has a single (continuous, binary, categorical, or count response) response value. The instances in a multivariate (multi-trait) RF problem have two or more response values—i.e., the output value is a vector rather than a scalar. There are two general approaches for solving multivariate RF problems: either by transforming the problem into multiple univariate problems or by adapting the RF algorithm so that it directly handles multivariate response variables simultaneously. The naive approach of transforming a multivariate problem into several univariate response problems is applicable for regression (continuous response variable) just as for classification (binary and categorical), and the same caveats apply.

Training any statistical machine learning model for predicting a continuous, binary, categorical, and count response variable is a time-consuming task—particularly so when training data sets are very large. When multiple models need to be trained using the same predictors in the data, but with different response variables, time consumption can quickly get out of hand. Thus, for very large problems, transforming a multivariate approach to univariate analysis may prove to be unsuitable. Using the algorithm adaptation approach, it is possible to directly create a model that simultaneously predicts a set of two or more continuous, binary, categorical, or count responses, or even mixed response variables (continuous, binary, and categorical) from a single training iteration. More importantly, when the prediction tasks are related (i.e., there is a correlation or covariance between response values), training a coherent multivariate model can potentially bring benefits in the form of increased predictive performance compared to training multiple disjoint models (Evgeniou and Pontil 2004).

RFs have already been adapted to handle multivariate response variables for continuous, binary, categorical, and mixed (continuous, binary, and categorical) outcomes (Segal 1992; Larsen and Speckman 2004; Zhang 1998; De’Ath 2002; Faddoul et al. 2012; Segal and Xiao 2011; Glocker et al. 2012). For multivariate regression analysis, Tang and Ishwaran (2017) suggest using an averaged standardized variance splitting rule. Assuming that there are measures of q traits in each observation, that is, yi = (yi,1, …, yi,q)T, the goal is to minimize the multivariate sums of squares (MSS),

$$ \mathrm{MSS}=\sum \limits_{j=1}^q\left({\sum}_{i=1}^L{\left({y}_{ij}-{\overline{y}}_{Lj}\right)}^2+{\sum}_{i=1}^R{\left({y}_{ij}-{\overline{y}}_{Rj}\right)}^2\kern0.5em \right), $$
(15.5)

where \( {\overline{y}}_{Lj} \) and \( {\overline{y}}_{Rj} \) are the sample means of the jth response variable in the left and right daughter nodes. Note that such a splitting rule (15.5) can only be effective if each of the response variables is measured on the same scale, otherwise we could have a response variable j with, say, very large values, and its contribution would dominate MSS. We therefore calibrate MSS by assuming that each response variable has been standardized (with mean zero and variance equal to one). The standardization is applied prior to splitting a node. To make this standardization clear, we denote the standardized responses by \( {y}_{ij}^{\ast } \) (Tang and Ishwaran 2017). With some elementary manipulations, it can be verified that minimizing MSS is equivalent to minimizing

$$ \mathrm{MSS}=\sum \limits_{j=1}^q\left(\frac{1}{n_L}{\left({\sum}_{i=1}^L{y}_{ij}^{\ast}\right)}^2+\frac{1}{n_R}{\left({\sum}_{i=1}^R{y}_{ij}^{\ast}\right)}^2\right) $$
(15.6)

For multivariate classification, an averaged standardized Gini splitting rule is used. According to Tang and Ishwaran (2017), the best split s for X is obtained by maximizing

$$ \mathrm{MGI}=\sum \limits_{j=1}^r\left(\frac{1}{C_j}\sum \limits_{k=1}^{C_j}\left(\frac{1}{n_L}{\left({\sum}_{i=1}^L{z}_{i(k)j}\right)}^2+\frac{1}{n_R}{\left({\sum}_{i=1}^R{z}_{i(k)j}\right)}^2\right)\right), $$
(15.7)

where r denotes the number of categorical traits, the response variable (yij) is a class label from {1, …, Cj} for Cj ≥ 2, and \( {z}_{i(k)j}={1}_{\left\{{y}_{ij}=k\right\}} \). Note that the normalization 1/Cj employed in (15.7) for response variable j is required to standardize the contribution of the Gini split from that response variable. Observe that (15.6) and (15.7) are equivalent optimization problems, with optimization over yij for regression and zi(k)j for classification. This leads to similar theoretical splitting properties in regression and classification settings. Given this similarity, it is feasible to combine the two splitting rules ((15.6) and (15.7)) to form a composite splitting rule. The mixed outcome splitting rule MCI is a composite standardized split rule of mean squared error (15.6) and Gini index splitting (15.7), i.e.,

$$ \mathrm{MCI}=\mathrm{MSS}+\mathrm{MGI}, $$
(15.8)

where p = q + r. The best split for X is the value of s maximizing MCI (15.8). This multivariate normalized composite splitting function of mean squared error and Gini index splitting can be invoked with splitrule = “mv.mix” inside the function rfscr() for implementing multivariate RF for mixed outcomes (binary, categorical, and continuous).

There are other splitting functions for continuous and categorical response variables. For continuous multivariate responses, Segal (1992) proposed the multivariate Mahalanobis splitting rule:

$$ \mathrm{SSE}={\sum}_{i=1}^L\left({\boldsymbol{y}}_i-\overline{{\boldsymbol{y}}_L}\right){\hat{\boldsymbol{\Sigma}}}^{-1}\left(\theta, t\right)\left({\boldsymbol{y}}_i-\overline{{\boldsymbol{y}}_L}\right)+{\sum}_{i=1}^R\left({\boldsymbol{y}}_i-\overline{{\boldsymbol{y}}_R}\right){\hat{\boldsymbol{\Sigma}}}^{-1}\left(\theta, t\right)\left({\boldsymbol{y}}_i-\overline{{\boldsymbol{y}}_R}\right), $$

where \( \hat{\boldsymbol{\Sigma}}\left(\theta, t\right) \) is an estimate of the covariance matrix obtained from the parent node t, before the split, which may be modeled through a vector of parameters θ in order to impose a specific structure, for example, exchangeable or autoregressive. Even when no structure is imposed—that is, when the sample covariance matrix is used—\( \hat{\boldsymbol{\Sigma}}\left(\theta, t\right) \) is still computed on the parent node before the split. The Mahalanobis splitting rule allows incorporation of a correlation between response variables; however, it is not available in the rfscr() function.

When the response variable contains only binary or categorical variables, the splitting criterion is

$$ \mathrm{CE}={\sum}_{k=1}^S\Big({n}_k^Ln\left({\hat{\pi}}_k^L\right)+{n}_k^Rn\left({\hat{\pi}}_k^R\right) $$

under the assumption that there are r categorical responses, and letting sj be the number of different values of the categorical outcome fj, for j = 1, …, r. The whole vector (f1, …, fr) can be described through a single variable S, taking \( S={\prod}_{j=1}^r{s}_j \) possible values or states, that is, each state corresponds to one unique combination of the original variables. This way, the original vector of categorical responses is cast into a single multinomial outcome with S possible values that we assume to be 1, 2, . . . , S without loss of generality. This criterion amounts to casting all individual responses into a single categorical outcome with S values and then using the usual entropy criterion. This is in accordance with our goal to remain as free as possible from assumptions, but obviously, this solution becomes impractical when S is too large. In such a case, dimension reduction is required.

Also, for implementing multivariate RF, we will be using the rfsrc() function but instead of specifying only one response variable (before ~), we need to specify at least two response variables inside the Multivar(GY, DTHD, DTMT, Height) function or cbind(GY, DTHD, DTMT, Height) specification, where GY, DTHD, DTMT, Height denote each of the four response variables. The specification of the remaining parameters is the same. For example, below we show the basic code for performing the tuning process:

tuning_model <- rfsrc(Multivar(GY, DTHD, DTMT, Height) ~ ., data=DataInnerTraining, ntree=flags$ntree, mtry=flags$mtry, nodesize=flags$nodesize) predictions <- predict(tuning_model, DataInnerTesting)

The complete R code for implementing the multivariate RF for continuous traits is given in Appendix 6. The mv.mse splitting rule is used by default for all continuous response variables under a multivariate framework. Table 15.6 shows that the best prediction performance in terms of MAAPE was observed in the GY trait; however, in terms of PC and R2, the best predictions were observed in trait DTMT. The predictions under the MSE are not comparable between traits because the traits are on different scales.

Table 15.6 Prediction performance under multivariate RF assuming all traits are continuous with five-fold cross-validation

Next we provide the basic R code for implementing multivariate RF for categorical response variables.

model <- rfsrc(cbind(DTHD, DTMT, Height) ~ ., data=DataTraining, ntree=best_params$ntree, mtry=best_params$mtry, nodesize=best_params$nodesize predicted <- predict(model, DataTesting)

We can see that there are no differences between the specification of a multivariate RF model for continuous and categorical response variables; however, when implementing the multivariate RF for categorical outcomes, all the response variables should be converted into factors. This is important because if this conversion is ignored, the RF will be implemented assuming that all response variables are continuous using the mv.mse splitting function. The multivariate Gini splitting rule (splitrule = “mv.gini”) is used by default when all response variables are categorical. Only this splitting function is available in the rfsrc() for multivariate categorical outcomes.

Next, we give the results in terms of PCCC and the Kappa coefficient for one binary trait (Height) and two categorical traits (DTHD and DTMT) (Table 15.7). Across environments, the best predictions were observed for trait DTHD (PCCC = 0.7348 and Kappa = 0.5799) and the worst for the Height trait (PCCC = 0.6988 and Kappa = 0.3959). For trait DTHD, the best predictions were observed in the DHT environment, while for trait DTMT, the best predictions were observed in environment Bed5IR; finally, for trait Height, the best predictions were observed in environment EHT.

Table 15.7 Prediction performance under multivariate RF assuming all traits are categorical with five-fold cross-validation

Finally, the specification in function rfsrc() of the multivariate RF model with mixed outcomes (continuous, binary, and categorical) is equal to the specification given above for all continuous outcomes or all categorical outcomes, but you need to put as factors those categorical response variables and as numeric those continuous outcomes; in this way, the splitting function to be used to perform the training process will be the splitrule = “mv.mix” and the analysis should be performed in the right way. In Table 15.8, we can see that for the continuous trait GY, the best predictions were observed in the Bed5IR environment for all traits. For the binary and categorical traits, the best predictions across environments were observed in trait Height under both metrics.

Table 15.8 Prediction performance under multivariate RF with mixed traits (Height = binary, DTHD and DTMT = categorical, and GY = continuous) with five-fold cross-validation

15.7 Final Comments

Part of the power of RF is due to the fact that RF introduces two kinds of randomization. First, a bootstrap sample randomly drawn from the training data is used to grow a tree. Second, at each node of the tree, a randomly selected subset of variables (covariates) is chosen as candidate variables for splitting. This means that rather than splitting a tree node using all p variables (features), RF selects, at each node of each tree, a random subset of 1 ≤ mtryp variables that is used to split the node where typically mtry is substantially smaller than p. The purpose of this two-step randomization is to decorrelate trees and reduce variance. RF trees are grown deeply, which reduces bias. Averaging across trees, in combination with the subsampling process used in growing a tree, enables RF to approximate rich types of functions while maintaining low generalization error. Considerable empirical evidence has shown RF to be highly accurate, comparable to state-of-the-art methods such as bagging [Breiman 1996], boosting [Schapire et al. 1998], and support vector machines [Cortes and Vapnik 1995].

As a tree-based ensemble statistical machine learning tool that is highly data adaptive, RF can be applied successfully to “large p, small n” problems, and it is also able to capture correlation as well as interactions among independent variables (Chen and Ishwaran 2012). RF is becoming increasingly used in genomic selection because, unlike traditional methods, it can efficiently analyze thousands of loci simultaneously and account for nonadditive interactions. For these reasons, RF is very appealing for high-dimensional genomic data analysis. In this chapter, we provide the motivation, fundamentals, and some applications of RF for genomic prediction with genomic data with continuous, binary, ordinal, and count response variables that are very often found in genomic selection.

Another advantage of RF over alternative machine learning methods is variable importance measures, which can be used to identify relevant independent variables (input) or perform variable selection. In general, RF provides a very powerful algorithm that often has great predictive accuracy and has become one of the benchmarks in the predictive field due to the good results it generates in very diverse problems. RF comes with all the benefits of decision trees (with the exception of surrogate splits) and bagging, but greatly reduces instability and between-tree correlation. And due to the added split variable selection attribute, RF models are also faster than bagging (not explained in this book) as they have a smaller feature search space at each tree split. However, RF will still suffer from slow computational speed as the data sets get larger but, similar to bagging, the algorithm is built upon independent steps, and most modern implementations allow for parallelization to improve training time.