A New Method to Missing Value Imputation for Immunosignature Data
The immunosignature technology uses microarray chips of random amino acid sequence peptides to detect diseases based on the change in the profile of circulating antibodies. Diseases are detected using classification algorithms trained on a reduced sample of immunosignature patterns of patients with known diagnoses.
The aim of the study was to develop a new method of missing value imputation in immunosignature data, which allows maintaining sufficient accuracy of classification.
Materials and Methods. The study was carried out using immunosignature data obtained by utilizing a high-resolution peptide microarray chip with nearly ten thousand peptide cells.
The applicability of various missing value imputation methods such as simple imputation, weighted k-nearest neighbors and machine learning techniques (linear regression, random forest, gradient boosting) was evaluated.
Results. Missing value imputation method based on gradient boosting has been developed in the framework of the study. Its operating principle implies iterating through all features (attributes) and training on examples (samples) whose values are present in the feature, followed by clarification of missing feature values. This process is repeated until the total training error for all features stops decreasing or until the maximum number of iterations is reached. The root mean squared error is employed as a training error metric.
To assess the quality of missing value imputation, classification results based on the data obtained after imputation procedure are used in our investigation.
The proposed missing value imputation algorithm based on linear gradient boosting proves to be effective under conditions of a high proportion of missing values as compared to other methods under consideration. The results of the investigation demonstrate the viability of using machine learning techniques for missing value imputation in immunosignature data.
- Padgett C.R., Skilbeck C.E., Summers M.J. Missing data: the importance and impact of missing data from clinical research. Brain Impairment 2014; 15(01): 1–9, https://doi.org/10.1017/brimp.2014.2.
- Osipova T.V., Ryabykh T.P., Baryshnikov A.Yu. Diagnostic microchips: application in oncology. Rossijskij bioterapevticeskij zurnal 2006; 5(3): 72–81.
- O’Donnell B., Maurer A., Papandreou-Suppappola A., Stafford P. Time-frequency analysis of peptide microarray data: application to brain cancer immunosignatures. Cancer Inform 2015; 14(Suppl 2): 219–233, https://doi.org/10.4137/cin.s17285.
- Richer J., Johnston S.A., Stafford P. Epitope identification from fixed-complexity random-sequence peptide microarrays. Mol Cell Proteomics 2014; 14(1): 136–147, https://doi.org/10.1074/mcp.m114.043513.
- Kukreja M., Johnston S.A., Stafford P. Immunosignaturing microarrays distinguish antibody profiles of related pancreatic diseases. J Proteomics Bioinform 2013; S6: 001, https://doi.org/10.4172/jpb.s6-001.
- Stafford P., Halperin R., Legutki J.B., Magee D.M., Galgiani J., Johnston S.A. Physical characterization of the “immunosignaturing effect”. Mol Cell Proteomics 2012; 11(4): M111.011593, https://doi.org/10.1074/mcp.m111.011593.
- National Center for Biotechnology Information Search database. URL: https://www.ncbi.nlm.nih.gov/.
- Efromovich S. Nonparametric regression with missing data. Wiley Interdisciplinary Reviews: Computational Statistics 2014; 6(4): 265–275, https://doi.org/10.1002/wics.1303.
- Van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research 2007; 16(3): 219–242, https://doi.org/10.1177/0962280206074463.
- Zloba E., Yatskiv I. Statistical methods of reproducing of missed data. Computer Modelling & New Technologies 2002; 6(1): 51–61.
- Žliobaitė I., Hollmén J. Optimizing regression models for data streams with missing values. Machine Learning 2014; 99(1): 47–73, https://doi.org/10.1007/s10994-014-5450-3.
- Troyanskaya O., Cantor M., Sherlock G., Brown P., Hastie T., Tibshirani R., Botstein D., Altman R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001; 17(6): 520–525, https://doi.org/10.1093/bioinformatics/17.6.520.
- Tutz G., Ramzan S. Improved methods for the imputation of missing data by nearest neighbor methods. Computational Statistics & Data Analysis 2015; 90: 84–99, https://doi.org/10.1016/j.csda.2015.04.009.
- Little R.J.A. Regression with missing X’s: a review. J Am Stat Assoc 1992; 87(420): 1227–1237, https://doi.org/10.2307/2290664.
- Breiman L. Random forests. Machine Learning 2001; 45(1): 5–32.
- Stekhoven D.J., Buhlmann P. MissForest — non-parametric missing value imputation for mixed-type data. Bioinformatics 2011; 28(1): 112–118, https://doi.org/10.1093/bioinformatics/btr597.
- Natekin A., Knoll A. Gradient boosting machines, a tutorial. Front Neurorobot 2013; 7: 21, https://doi.org/10.3389/fnbot.2013.00021.
- Friedman J.H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics 2001; 29(5): 1189–1232, https://doi.org/10.1214/aos/1013203451.
- Hyndman R.J., Koehler A.B. Another look at measures of forecast accuracy. Int J Forecast 2006; 22(4): 679–688, https://doi.org/10.1016/j.ijforecast.2006.03.001.
- Andryushchenko V.S., Uglov A.S., Zamyatin A.V. Statistical classification of immunosignatures under significant reduction of the feature space dimensions for early diagnosis of diseases. Sovremennye tehnologii v medicine 2018; 10(3): 14–20, https://doi.org/10.17691/stm2018.10.3.2.
- Arlot S., Celisse A. A survey of cross-validation procedures for model selection. Statistics Surveys 2010; 4: 40–79, https://doi.org/10.1214/09-ss054.
- Student. The probable error of a mean. Biometrika 1908; 6(1): 1–25, https://doi.org/10.2307/2331554.
- Rubin D.B. Inference and missing data. Biometrika 1976; 63(3): 581, https://doi.org/10.2307/2335739.
- Тikhova G.P. Data missing: how to solve and how to escape the problem. Regionarnaya anesteziya i lechenie ostroy boli 2016; 10(3): 205–209.