How to Predict Breast Cancer Tumor with K-Nearest Neighbors? [Part 1]

For Breast Cancer Awareness month, I decided to document a Python script that I have produced in August. This can predict breast tumors from a cell’s characteristics. You can do it by yourself following this article. Do not hesitate to ask me your questions in the DM. This article is about data cleaning and exploration. Data are available on Kaggle here:


First, we need to import some libraries:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Data exploration

Then we define a data frame with Pandas.

df_train = pd.read_csv("../dataset/train.csv")

We can see the head of the data:


As you can see, there are several attributes and the Id columns are not set as index. Let’s do it.

df_train.set_index('Id', inplace=True)

Let’s display some information on the datasets as usual:


You should remark a strange value in the bare_nuclei columns. The minimum is negative, but this attribute describes a length. To avoid this, we need to redefine our datasets selecting the right values. The other attributes look good so we will let them.

df_train = df_train[df_train['bare_nuclei']>=1]

Nice, now the data do not contain weird values.

Dealing with unbalanced data

In machine learning, we need to have balanced data to get good results. It means that we do not want to have a data set with 10% of cell malign and only 90% benign (like it could be in reality). If our model learns with unbalanced data he will give unbalanced results. So now, we will just check if it is not the case plotting the distribution of the class.

sns.countplot(x='class', data=df_train)

As you can see, our data looks pretty balanced. If it were not the case we would need to separate the datasets into balanced data.

Correlation matrix

Now we will continue our exploration by plotting the correlation matrix. This is very useful to see which attributes impact the most of the class attributes.

correlation = df_train.corr()

Pretty nice no? Now, how to interpret it?

The legend at the right describes the relative correlation, white defines a perfect correlation, and dark purple mean there is no correlation. We are not interested with the diagonal because it shows just the impact of an attribute on itself. As you can see, the matrix is symmetric. It means that we can just focus on the lower side of the diagonal. In the next part of this series, we will see how to create a model using all the attributes, but in the end, to get better results we could focus our efforts on specific attributes. That’s it for this one. I hope you enjoyed it. DM me your questions and remarks. Next time we will implement the K-Nearest-Neighbors algorithms with Scikit-Learn. See you on!