Using Random Forest and Decision Tree to predict the presence of Breast Cancer

Fátima Sánchez
4 min readNov 23, 2020

According to the world cancer research fund, breast cancer is the most commonly occurring cancer in women and the second most common cancer overall. Just in the year of 2018, there were over 2 million new cases. [1]

Breast cancer screening is an important strategy to allow for early detection and ensure a greater probability of having a good outcome in treatment. Screening models, based on data which may be collected in routine consultation and blood analysis, can become an important contribution by offering more screening tools to detect this disease. [2]

In this article, it is explained how two machine learning models (Decisions Tree and Random Forest) can be trained with data obtained from Marshall Data Solution’s dataset in order to predict the presence of breast cancer. [3]

Libraries

All the libraries used can be seen below.

# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from joblib import dump, load
from sklearn.metrics import confusion_matrix

Dataset

The dataset used contained a classification column with ‘0’ indicating a healthy patient and ‘1’ indicating a patient with Breast Cancer. It also included 9 variables, all of which were obtained from physical measurements and blood analysis. The dataset can be found in here: https://data.world/marshalldatasolution/breast-cancer

# Load the dataset
data = pd.read_csv('B_Cancer.csv')
data.head()
First five rows of the dataset

In order to see how the variables were related, a cross correlation matrix was calculated.

corr = data[[‘Age’, ‘BMI’, ‘Glucose’, ‘Insulin’, ‘HOMA’, ‘Leptin’, ‘Adiponectin’,‘Resistin’, ‘MCP.1’]].corr()plt.figure(figsize=(10,10))
sns.heatmap(corr, square=True, annot=True, cmap='viridis');
Cross-correlation matrix

In the matrix it could be observed that variables such as HOMA and Insulin or HOMA and Glucose were highly positive correlated. In order to see this relationship, some scatter plots were made.

# Plot HOMA vs Insulin
data.plot(kind="scatter", x="HOMA", y="Insulin",c="Classification", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()# Plot HOMA vs Glucose
data.plot(kind="scatter", x="HOMA", y="Glucose",c="Classification", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()

In the above images the red dots represent the patients that were positive diagnosed. Notice the highest values are those in red.

Data Preprocessing

After a quick analysis of the dataset, it was time for training the models. But first, some data preprocessing techniques were applied.

# Select data
X = data.drop(['Classification'], axis=1)
y = data['Classification']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The data was scaled using Scikit-learn’s StandardScaler().

# Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Afterwards, PCA was applied, keeping 91% of data significance. This means that the initial 9 variables were reduced to 6 variables.

# PCA keeping 91% of data significance with 6 variables
pca = PCA(.91)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

Training a Decision Tree

After the data was cleaned and preprocessed, the first model was trained. This was implemented using Scikit-learn’s DecisionTreeClassifier() and keeping a max_depth of 4 levels.

# Training a Decision Tree
tree_clf = DecisionTreeClassifier(max_depth=4)
tree_clf.fit(X_train, y_train)
# Predicting with the test set
y_pred = tree_clf.predict(X_test)
# Accuracy
print('Accuracy:', accuracy_score(y_test, y_pred)*100)

The resulting accuracy after testing with the test data set fragment was of 87.5% and in order to see better the True Positives, True Negatives, False Positives and False Negatives, a confusion matrix was displayed.

# Confusion matrix decision tree
LABELS = ['Negative', 'Positive']
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(7, 5))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Confusion matrix Decision Tree

Training a Random Forest

The results obtained with the Decision Tree weren’t bad. However, to see if another model could surpass those results, a Random Forest was also implemented, using Scikit-learn’s RandomForestClassifier() and setting 10 as n_estimators, 16 as max_leaf_nodes and a max_depth of 5.

# Training a random forest
rnd_clf = RandomForestClassifier(n_estimators=10, max_leaf_nodes=16, n_jobs=-1, max_depth=5)
rnd_clf.fit(X_train, y_train)
# Predicting with the test set
y_pred_rf = rnd_clf.predict(X_test)
# Accuracy
print('Accuracy:', accuracy_score(y_test, y_pred_rf)*100)

The accuracy obtained testing with the data set’s test fragment was of 91.6%, slightly better than the obtained with the Decision Tree.

As with the other model, the confusion matrix was displayed to analyze better the results.

# Confusion matrix random forest
LABELS = ['Negative', 'Positive']
conf_matrix = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(7, 5))
sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d")
plt.title("Confusion matrix")
plt.ylabel('True class')
plt.xlabel('Predicted class')
plt.show()
Confusion matrix Random Forest

Conclusions

After training both models it was observed that the Random Forest predicted with a higher accuracy than the Decision Tree. However, both achieved an accuracy of over 85% meaning that correct predictions to detect the presence of Breast Cancer can be made out of data obtained from physical measurements and blood analysis.

--

--