K-nearest neighbours algorithm in R

K-nearest neighbours algorithm (k-NN) is a method used for classification and regression . This post is about how to implement a simple solution for classification using R programming language.

K-nearest neighbors algorithm
Image Source: Wikipedia

How k-NN works

k-NN is classified as a Lazy Learning algorithm that takes a dataset and keeps the training examples for later use. In this post We will analyze a forecast rain dataset, below 5 of 100 examples:

Temperature Humidity Wind Speed Rain
68530no
149035no
15868yes
215615yes
17679yes

k-NN uses examples provided by the the dataset in order to make predictions, comparing previous features with new ones, in order to do that, it builds a 2 dimensional coordinate space where one axis is the weighted sum of all features and the other axis is the outcome. The graphic below represents 5 points where the "y" axis is the weighted sum of the features (Temperature, Humidity, Wind Speed) and the "x" axis represents the outcome "yes" or "no":

dataset graphic
Weighted sum of the features (Temperature, Humidity, Wind Speed)

Similarity measure and Majority Voting

K-NN finds out the k closest neighbours to certain input feature, a common way to do that is by applying Euclidean distance . The decision of classifying a new forecast query as "yes" or "no" is decided by the number of votes of the k nearest neighbours. If the majority of neighbours are closer to the "yes" or 1, then the outcome will be "yes", otherwise, it will be "no". Is not required but highly recommended to choose an odd number as parameter k.

The image below shows how a red point A is closer to "no" or 0 and a blue point B is closer to 1 or "yes":

closests
Red point A is closer to "no" or 0 and a blue point B is closer to 1 or "yes"

k-NN implmentation

Loading input file

The functions below will load and process the dataset file:

    load_dataset return ( <-  function (file_name) {
        # opens a csv file
        dataset <- read.csv(file_name, header=FALSE)
        return (dataset)
    }

Normalizing data and Reordering Randomly

Values in dataset are numbers that are in different ranges, the next function will apply min-max data normalization to transform all the values to numbers in a scale between 0 and 1 and will transform yes and no to 1 or 0. After the normalizing, this function will return a random ordered version of the dataset:

    normalize_dataset  <- function (dataset) {

        normalize_number <- function (column) {
            # applies min-max normalization using the formula
            # zi = (xi - min(x)) / ( max(x) - min(x))
            max = max(column)
            min = min(column)
            return ( (column - min) / (max - min) )
        }

        normalize_yes_no  <- function (column) {
            #turns yes to 1 and no to 0
            n = length(column)
            result = vector(length=n)

            for(i in 1:n) {
                if ( column[i] == "yes") {
                    result[i] = 1
                } else {
                    result[i] = 0
                }
            }

            return (result)
        }

        dataset[1] <- normalize_number(dataset[1])
        dataset[2] <- normalize_number(dataset[2])
        dataset[3] <- normalize_number(dataset[3])
        dataset[4] <- normalize_yes_no(dataset[,4])

        randomnize_dataset <- function(dataset) {
            # reorder dataset randomly
            rnumbers <- runif(nrow(dataset))
            return(dataset[order(rnumbers), ])
        }

        return (randomnize_dataset(dataset))
    }

Training Example VS Test Example

We will split the dataset into 2 datasets: one for training and another for testing. It is considered a good practice to have at least 10% of the dataset for testing porpouses, in this example We will use 15%, because We have 100 examples the test dataset will size will be 15 and the training dataset will be 85:

    dataset_train <- dataset[1:85,]
    dataset_test <- dataset[86:100,]

Now We need to create target datasets for train and test datasets, the target dataset will contain the forecast ( "yes" -> 1 , "no" ->0)

    dataset_train_target <- dataset[1:85, 4]
    dataset_test_target <- dataset[86:100, 4]

We can use the function knn to call the algorithm passing the datasets that we created previously and some value k:

    # loading package
    require(class)
    model1 <- knn(train=dataset_train, test=dataset_test, cl=dataset_train_target, k=3)

model1 variable will contain the prediction for the features of the test dataset. Now We can test how well knn predicted the test dataset just comparing the original dataset

  result = table(dataset_test_target, model1)

The result is the table described below, and as We can see there is a perfect match between the dataset and the predicted data:

model1
dataset_test_target01
0120
103

Source code of this example can be found at this link

References