Challenges
Machine learning in cancer detection is inherently a complex task from both a technological and systemic perspective, presenting multiple challenges. One of the main challenges that applies to machine learning in this context is that it is necessary to redesign the research pipeline so that the cancer growth phenomena can be understood to develop preclinical models. These models have to recognize and handle complex cancers precisely with innovative methods of design and accuracy so that they can be useful to physicians as a second opinion or early recognition tools (Saba, 2020). Currently that status quo is lagging in practicality, not yet ready to widespread clinical use. Kourou et al. (2015) argues that despite suggestion that machine learning classification techniques have advanced to provide effective decision-making, few have penetrated clinical practice. Omics technologies have made advances in recognizing gene expression signatures, thus allowing to better understand the disease and identify it appropriately, but more accurate validation is necessary by machine learning algorithms before being useful in practice.
One of the most fundamental tools for machine learning in cancer detection is the use of imaging, with the premise that prognostic data is embedded in pathology images and digital pathology can provide big data for clinical research and decision-making. However, that faces multiple difficulties including that diagnostic tools and image analysis developed for radiographic images are inadequate to deal with the data density in high resolution digitized slide images (Madabhushi & Lee, 2016). A study by Blanes-Vidal et al. (2018) examining colorectal capsule endoscopy as a tool for machine learning diagnostic found that there is inherently difficult and tedious process of transforming manual identification of polyps from CCE into an automatic operation due to various conditions such as lighting, infrequent occurrence, and uncommon morphology, size, texture, and color features. This is further supported by Oliveira et al. (2018) arguing that benign and malignant tumor present a wide range of sizes and shapes, with different organ shapes (such as for breast cancer) and distribution of tissue among people. It establishes a challenging diversity which makes the design of a single platform for diagnosis highly complex.
Technical challenges are often at the core of the machine learning algorithms which are trained with a dataset which in theory should be generalizable to new, unknown datasets. Oliveira et al. (2018) describe the process of machines training on subsets of data originally but found that performance evaluation in literature is typically varied in approach and have some degree of error. This may be due to poor model validation as a result of overfitting the learning model during training phase, model selection, or contamination of the dataset altogether. Peng et al. (2018) lists other technical challenges such as the scale of deep neural network. The neurons at the higher layer such as the application of biological image processing analyzing millions of pixels, especially multiple connected layers, requires enormous computing power. The researchers also note that data accessing is another critical technical challenge in deep learning. Clinical data processing is not appropriate tag data used in training subsets. There are instances where extensive clinical data is unavailable for analysis, and there is still a problem of training data across various databases (Peng et al., 2018).
Finally, it is worth noting of some systemic challenges identified by Smiti (2020). First is the issue of ever-changing evidence in the medical field, which can shift rapidly and data analysis will lead to new observations which may exclude previous knowledge. This creates the issue of a constant need of updating databases and shifting algorithms. Second, the issue of mixed data, with data coming from different sources in different formats, and analytics approaches may be more or less effective on various data types. Finally, the point of misleading or wrong data, such as noise and outliers. The presence of outliers in medical data is a major challenge as it is, caused by anything from malfunctioning equipment to human mistake and data variability; it can mislead automated analysis (Smiti, 2020). All of these present inherent data compatibility and applicability challenges in the statistical and practical applications for highly dependent on previous data subsets algorithms of machine learning.
Table 1. Papers discussing various challenges to machine learning in cancer detection.
Types of Algorithms Used
Each machine learning algorithm utilized in cancer detection uses a well-defined learning technique which is best suited for its purpose. Most machine learning algorithms are developed with one of four primary learning methods, each presenting a unique learning strategy. These include supervised, semi-supervised, unsupervised, and reinforcement learning (Smiti, 2020). In supervised learning, sets of data are utilized in training the machine and labeling to present the correct result. Meanwhile, unsupervised learning has no predetermined data sets or expected results, thus making the objective more difficult to achieve. Classification is a common method in supervised learning where historical labeled data is utilized for development of a model used for future predictions. Clinical data is contained in large databases of patient records which can be used to develop classification models so that the algorithm make inferences based on previous cases (Bazazeh & Shubair, 2016).
A study by Yu et al. (2020) investigated the use of machine learning evaluation in chest computed tomography (CT) commonly used in lung cancer detection. They found that the majority of solutions used techniques of image preprocessing, segmentation, and classification modules. Meanwhile, classification algorithms took advantage of U-Net, VGGNet, and residual net as common in nodule segmentation and transfer learning (Yu et al., 2020). The study investigated several top approaches to algorithm development. While the majority utilized convolutional neural networks (CNNs) which leveraged existing image architecture and transfer learning, some utilized a tree-based method for cancer versus non-cancer classification (Yu et al., 2020). Tree-based approaches require predefined set of image features, effective in small sample sizes. Meanwhile, CNNs “allow data to refine the definition of features”, performing better in larger samples (Yu et al., 2020).
Imaging and matching algorithm was part of a study by Blanes-Vidal et al. (2018) that sought to implement colorectal capsule endoscopy (CCE) as mass-population, patient-friendly screening tool. To be effective, a one-to-one poly matching is necessary using the CCE with the accuracy that would occur in a colonoscopy in order to build a predictive model to learn the relationship between CCE and histopathologic examination used to exclude malignancy. The researchers utilized the CNN approach to develop a deep learning algorithm for autonomous detection and localization of colorectal polyps. Using the Gower’s similarity coefficient (GSC) to quantify similarity, the algorithm was able to detect polyps based on size, morphology, and location, providing a one-to one match to colonoscopy imaging of polyps (Blanes-Vidal et al., 2018).
The tree-based approach builds into an algorithm approach discussed by Bazazeh & Shubair (2016) known as Random Forest (RF). RF essentially compiles multiple decision tress to assemble a ‘forest’ with the premise that a single decision tree can provide either a simple model or a highly specific one. RF creates stability in comparison of using a single tree and it is insensitive to the noise of input data set, while maintaining the ability to handle data minorities. Therefore, a tumor can be classified as benign or malignant despite that malignant tumors are representative of only 10% of the data set (Bazazeh & Shubair, 2016).
The growing trend among scholarly research in the last few years has been applied semi-supervised machine learning techniques. These algorithms utilize both labeled and unlabeled data for predictions, with data showing improvement in the estimated performance in comparison to supervised techniques (Kourou et al., 2015). One of the most common semi-supervised algorithms is Support Vector Machine (SVM). According to Bazazeh & Shubair (2016), SVM is the most widely used in cancer diagnosis and prognosis. It functions by selecting critical samples from classes (support vectors) and generating a linear function which broadly divides. The review describes it as, “a mapping between an input vector to a high dimensionality space is made using SVM that aims to find the most suitable hyperplane that divides the data set into classes” (Bazazeh & Shubair, 2016, p. 1) Nevertheless, the choice of the most relevant and appropriate algorithms depends on multiple parameters including but not limited to type of data, size of data, sample, time limitations, and type of prediction outcomes (Kourou et al., 2015).
Table 2. Papers discussing types of algorithms or approaches to algorithm development for machine learning in cancer detection.
References
Bazazeh, D., & Shubair, R. (2016). Comparative study of machine learning algorithms for breast cancer detection and diagnosis. 2016 5th International Conference on Electronic Devices, Systems and Applications (ICEDSA), Ras Al Khaimah, United Arab Emirates. Web.
Blanes-Vidal, V., Baatrup, G., & Nadimi, E. S. (2019). Addressing priority challenges in the detection and assessment of colorectal polyps from capsule endoscopy and colonoscopy in colorectal cancer screening using machine learning.Acta Oncologica, 58(sup1), S29–S36. Web.
Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction.Computational and Structural Biotechnology Journal, 13, 8–17. Web.
Madabhushi, A., & Lee, G. (2016). Image analysis and machine learning in digital pathology: Challenges and opportunities.Medical Image Analysis, 33, 170–175. Web.
Oliveira, B., Godinho, D., O’Halloran, M., Glavin, M., Jones, E., & Conceição, R. (2018). Diagnosing breast cancer with microwave technology: Remaining challenges and potential solutions with machine learning.Diagnostics, 8(2), 36. Web.
Peng, L., Peng, M., Liao, B., Huang, G., Li, W., & Xie, D. (2018). The advances and challenges of deep learning application in biological big data processing.Current Bioinformatics, 13(4), 352–359. Web.
Saba, T. (2020). Recent advancement in cancer detection using machine learning: Systematic survey of decades, comparisons and challenges.Journal of Infection and Public Health, 13(9), 1274–1289. Web.
Smiti, A. (2020). When machine learning meets medical world: Current status and future challenges. Computer Science Review, 37, 100280. Web.
Yu, K.-H., Lee, T.-L. M., Yen, M.-H., Kou, S. C., Rosen, B., Chiang, J.-H., & Kohane, I. S. (2020). Reproducible Machine Learning Methods for Lung Cancer Detection Using Computed Tomography Images: Algorithm Development and Validation.Journal of Medical Internet Research, 22(8), e16709. Web.