^{1}

^{1}

^{1}

^{1}

^{1}

^{*}

Trying to provide a medical data visualization analysis tool, the machine learning methods are introduced to classify the malignant neoplasm of lung within the medical database MIMIC-III (Medical Information Mart for Intensive Care III, USA). The K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest (RF) are selected as the predictive tool. Based on the experimental result, the machine learning predictive tools are integrated into the medical data visualization analysis platform. The platform software can provide a flexible medical data visualization analysis tool for the doctors. The related practice indicates that visualization analysis result can be generated based on simple steps for the doctors to do some research work on the data accumulated in hospital, even they have not taken special data analysis training.

Medical data mainly include clinical trial data, biomedical data, electronic medical records and diagnosis books, and individual health information [

Sometimes the special type medical data need to be classified into clusters, then we can try to find the relationship between the cluster and disease. The cluster analysis is an important method as the data visualization analysis. So, the typical machine learning methods KNN, Support Vector Machine and Random Forest are selected as predict tool for the data classification.

K-Nearest Neighbor (KNN) [

d ( x , y ) = ∑ k = 1 n ( x k − y k ) 2 (1)

d ( x , y ) = ∑ k = 1 n | x k − y k | (2)

KNN makes decisions based on the dominant categories of k objects, rather than a single object category. The KNN algorithm could be describes as:

Step 1: Calculate the distance between the test data and each training data;

Step 2: Sort the distance according to the increasing relation;

Step 3: Select K points with the nearest distance;

Step 4: Determine the occurrence frequency of the category of the first K points;

Step 5: Return the category with the highest frequency in the K points as the prediction classification of test data.

Support Vector Machines (SVM) [

w → ⋅ x → + b = 0 (3)

{ w → ⋅ x → + b ≥ + 1 w → ⋅ x → + b ≤ − 1 (4)

The distance between two planes is 2 ‖ w → ‖ , where ‖ w → ‖ stands for Euclidean

norm. Such task situations are expressed as a set of constraints (5). When the data is non-linear and separable, the constraint condition of the task case is (6).

min 1 2 ‖ w → ‖ 2 , making y i ( w → ⋅ x i → − b ) ≥ 1 ∀ x i → (5)

min 1 2 ‖ w → ‖ 2 + C ∑ i = 1 n ξ i , making y i ( w → ⋅ x i → − b ) ≥ 1 − ξ i ∀ x → i , ξ i ≥ 0 (6)

When dealing with vector x i , it can map to high-dimensional space through the kernel function. Kernel function which used commonly has linear kernel (7), polynomial kernel (8), sigmoid kernel (9), Gaussian RBF kernel (10), etc.

K ( x i → , x j → ) = x i → ⋅ x j → (7)

K ( x i → , x j → ) = ( x i → ⋅ x j → + 1 ) d (8)

K ( x i → , x j → ) = tanh ( k x i → ⋅ x j → − δ ) (9)

K ( x i → , x j → ) = e − ‖ x i → − x j → ‖ 2 2 σ 2 (10)

Random Forest (RF) [

Considering that the medical data varies from values to image, the original data may need to be pre-processed before the visualization analysis.

MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The public MIMIC-III Critical Care Database could be download from https://mimic.physionet.org/about/mimic/. The database includes medical records like demographics, vital sign measurements made at the bedside, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). MIMIC-III supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development. Extracting the relevant detection data before carrying out the experiment, this data is which included the data of patients with pulmonary malignant tumor and healthy people. The main detection projects included Anion Gap, Base Excess, Bicarbonate, Calcium. Total, Calculated Total CO_{2}, Chloride, Creatinine, Glucose, Hematocrit, Hemoglobin, Magnesium, MCH, MCHC, MCV, pCO_{2}, pH, Phosphate, Platelet Count, pO_{2}, Potassium, RDW, Red Blood Cells, Sodium, Urea Nitrogen, White Blood Cells.

In medical database, some values may be not available. Therefore, in most cases, the database is incomplete. The methods to deal with incomplete data sets mainly include [

For the missing value in the medical database, we can use box chart as a missing data analysis tool. For instance, by calling the summary function in R, as shown in

f ( x i ) = w T x i + b , making f ( x i ) ≃ y i (11)

This is multivariate linear regression. The least square method is used to estimate w and b, and w and b are absorbed into the vector form w ^ = ( w ; b ) . The data set D is represented as a matrix X of m × ( d + 1 ) size, as follows (12).

X = ( x 11 x 12 ⋯ x 1 d x 21 x 22 ⋯ x 2 d ⋮ ⋮ ⋱ ⋮ 1 1 ⋮ x m 1 x m 2 … x m d 1 ) = ( x 1 T 1 x 2 T ⋮ x m T 1 ⋮ 1 ) (12)

Given y = ( y 1 ; y 2 ; ⋯ ; y m ) , E w ^ = ( y − X w ^ ) T ( y − X w ^ ) , and take the derivative with respect to w ^ and we can get (13).

∂ E w ^ ∂ w ^ = 2 X T ( X w ^ − y ) (13)

If X T X is positive definite matrix, the derivative is zero, and w ^ * = ( X T X ) − 1 X T y . If X T X is not positive definite matrix, the regularization term is introduced.

Taken from MIMIC-III, 5000 records of the malignant neoplasm of lung are introduced as sample data. 4250 (85%) records are taken as training set, and 750 (15%) records are taken as test set. Within the training data set, 1275 patients have been diagnosed as lung tumor, up to 30%, and 2975 patients have not been infected, total to 70%. Among the test set, there were 225 (30%) lung tumor patients, and 525 (70%) non-disease patients. Taking KNN, SVM and RF as classification tool separately, the influence factor can be classified, and the prediction model can be extracted.

At first, take KNN as extractor, and calculate the distance of features, which needs to convert the nominal features involved in the data set into a numerical format. Take dummy variable encoding method (14).

gender = { 1 if x = M 0 if x = F (14)

Try to reduce the dependence of KNN on the measurement scale of the input features, the min-max standardized data is adopted, the classification results are shown in

Turn to RF, the choosing of appropriate mtry’s value by testing could improve the accuracy. As shown in

Sensitivity means the rate that the suffering samples detected take up all the suffering samples. Specificity is the rate of the non-suffering samples account for all the non-suffering samples. Comparing classification accuracy from

As shown in _{2}, Chloride, Creatinine, Glucose, Hematocrit, Hemoglobin, Magnesium, MCH, MCHC, MCV, pCO_{2}, pH, Phosphate, Platelet Count, pO_{2}, Potassium, RDW, Red Blood Cells, Sodium, Urea Nitrogen, White Blood Cells. The vertical axis represents the importance measure of input index, from this chart, it can be seen that pH, Platelet Count, Creatinine, and Calculated Total CO_{2}, pO_{2} is the most important indicators which determined whether patients suffer from the malignant neoplasm of lung, which can help doctors to focus on these indicators when examining patients.

We introduced the machine learning methods is aimed to help medical personnel with diagnosis and treatment of diseases (the disease which selected in our experiment is the malignant neoplasm of lung), and obtain the influence of different characteristics on diseases in the analysis process, as shown in

As shown in Figures 7-9, we have provided a visual platform for doctors to implement the algorithms to obtain the results of disease classification. Doctors can also realize the statistical analysis of the data through the platform, and can manually control the visualization operation of the data. This can assist providing intuitionistic analysis within human-machine coupling to find the relationship between potential influent factor(s) and disease or recovery.

Amount | Category | Classification rate | Number of A | Number of B |
---|---|---|---|---|

750 | A | 0.9905 | 520 | 5 |

B | 0.5156 | 109 | 116 |

Amount | Category | Classification rate | Number of A | Number of B |
---|---|---|---|---|

750 | A | 0.9695 | 509 | 16 |

B | 0.8000 | 45 | 180 |

Amount | Category | Classification rate | Number of A | Number of B |
---|---|---|---|---|

750 | A | 0.9961 | 523 | 2 |

B | 0.9911 | 2 | 223 |

Character | KNN | SVM | RF |
---|---|---|---|

Sensitivity | 0.9905 | 0.9695 | 0.9961 |

Specificity | 0.5156 | 0.8000 | 0.9911 |

Accuracy | 84.80% | 91.87% | 99.47% |

For the medical data visualization analysis, the machine learning methods can provide both predict and classification tool. We select three typical machine learning methods: K-Nearest Neighbor (KNN), Support Vector Machine (SVM) and Random Forest (RF) as the classifier to predict whether patients suffer from the malignant neoplasm of lung. Considering the sensitivity, specific and detection accuracy, RF has better performance. By the medical data visualization analysis platform based on machine learning tools, we can make further efforts to classify the most influence factors are pH, Platelet Count and Creatinine for the results of classification. The platform can also provide various graphics generators according the doctor’s query operation, which can provide doctors with intuitive analysis, find the relationship between potential influent factor(s) and disease or recovery. The experiment and practice within the medical database MIMIC-III have indicated that the platform can provide a flexible medical data visualization analysis tool for the doctors, no data analysis training required.

This work was supported by SDUT & Zibo City Integration Development Project (NO. 2016ZBXC049); A Project of Shandong Province Higher Education Science and Technology Program (NO. J16LN20); Natural Science Foundation of Shandong Province (NO. ZR2016FM18).

The authors declare no conflicts of interest regarding the publication of this paper.

Wang, T., Zhao, L., Cao, Y.F., Qu, Z.J. and Li, P.J. (2018) Medical Data Visualization Analysis and Processing Based on Machine Learning. Journal of Computer and Communications, 6, 299-310. https://doi.org/10.4236/jcc.2018.611027