Contents

How to Predict Breast Cancer Tumor with K-Nearest Neighbors? [Part 2]

Contents

Welcome to the part 2 of this article. We will talk about some technical stuff on KNeighborClassifier from the library Scikit-Learn.

Remember, last time we cleaned the data and did some exploration. We discovered that the attribute “mitosis” have a strong correlation on the “class” attribute.

First of all, we define X as the data without the class column and y as the attribute to predict (i.e. the class column).

1
2
X = df_train.drop('class', axis=1)
y = df_train['class']

Now, we separate the data into two sets: one for training the model and another one for testing it. To do this, we use the train_test_split function from Scikit-Learn.

Let’s import it:

from sklearn.model_selection import train_test_split

Then, use it:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=53)

This function split X and y into two train and test sets. The argument “test_size” allows you to define the rate of the testing data here 30%. The random_state is used to replicate easily the results. You can use another random_state but you will not have exactly the same results.

Let’s import the model now:

from sklearn.neighbors import KNeighborsClassifier

We define the model as a variable for more readability:

knn = KNeighborsClassifier(n_neighbors=2)

Here we use n_neighbors=2 because we should have only two cluster that are the malign cells and the benign cells.

Let’s apply our model to the training data:

knn.fit(X_train,y_train)

And then predict the testing data:

pred = knn.predict(X_test)

Nice, the work is almost done.

We would like to display the results. To do it we import the confusion_matrix and classification_report from Scikit-Learn:

from sklearn.metrics import confusion_matrix, classification_report

Then we print the result:

1
2
3
4
5
print('WITH K=2')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

Great! As you can see we have pretty good results, the model makes good predictions in 92% of the time.

However is 92% sufficient? What happen if we predict cancer to someone in good health? Worst, what happen if we not predict cancer to someone sick?

Data science is very useful to help medicine, but here we still need medicine to assure the results.

We can get better results using other models, but this will be approached in another post.

I hope you enjoyed this article and if you have questions do not hesitate to ask them in DM.

See you on!