CSCI 374: Homework Assignment #1
k-Nearest Neighbor
Due: 11:59 PM on Wednesday, September 21
You can download the assignment instructions by clicking on this link
Instructions for using GitHub for our assignments can be found on the Resources page of the class website, as well as using this link.
Random Seeds and Randomizing the Data Set
The purpose of using a random seed is to be able to replicate the "randomness" created by the (pseudo-)random number generator used by the computer. That is, when you specify a random seed, every random number generated will occur in the exact same order as any other time you used the exact same random seed. Diffferent random seeds will generate different sequences of "random" numbers.
Why are we concerned about this? In science, all experiments should be reproducable. So by specifying a random seed, we make it possible for our program to still use randomness to split the training and test sets (so that each instance is just as likely as any other to be used for training or testing), and yet we can also verify the results of the experiment by running it again to duplicate its results.
This means that if you run your program a second time with a particular random seed (for a given algorithm and data set), the output should be exactly the same as the first time you ran it with the same random seed. Whether your program does indeed produce the same results each time the same random seed is used might be considered when grading your assignment.
Following are some code snippets that you might use to randomly shuffle a list based on a random seed:
Python
import random
random.seed(yourRandomSeed)
shuffled = list(yourInstances)
random.shuffle(shuffled)
Java
import java.util.Random;
import java.util.Collections;
Random rng = new Random(yourRandomSeed);
List shuffled = new ArrayList(yourInstances);
Collections.shuffle(shuffled, rng);