This past week I wrote an R function that randomly withholds data from a training set ( known trait values from the tree database). The function call can specify the proportion of values to omit and how to weight columns and rows. Columns contain traits and rows contain individual species. Subsequently, I tested phylogenetic imputations (using RPhylopars package in R) at different proportions of missing values. Then, I plotted the imputed values with 95% confidence intervals and the known values to see if the known values were in the given confidence intervals. Oddly, when I omitted greater than 50% of the values, the uncertainty of the imputed values were lower. One would think that if the data set had more values, then there would be more certainty about imputed values. I will continue to test this phenomenon with different proportions of missing values. I have only tested the training data set at 5%, 10%, 15%, 20%, 50%, and 90% missing values so far.
In the next ~2.5 weeks I hope to finalize the multivariate imputation technique and phylogenetic imputation techniques. I have yet to start on the spatial imputation technique, but I will start on that tomorrow and finish that before July 21st (the day the poster is due). I think it is doable as I have gotten the data and put them into the correct formats. The most challenging part will be to put everything into one model.
The final step will be to combine all three methods (at least the phylogenetic and spatial methods) into one unified model that can be utilized by ecologists.
July 5, 2017