CSCI 374: Homework Assignment #5

Predict the Author (Text Classification with Naive Bayes)
Due: 11:59 PM on Wednesday, November 1

You can download the assignment instructions by clicking on this link

Instructions for using GitHub for our assignments can be found on the Resources page of the class website, as well as using this link.

Preprocessing the Text

I have included code in both Python and Java for handling some of the preprocessing you need to do for the lab. This includes both parsing a long string of text into a list of words, as well as converting a list of words into its list of stems. Please note that you will still need to find your own list of stop words (please cite in your code and README where you found the list of stopwords) and remove stop words from the list of words before stemming.

Python


	import StemmingUtil
	
	# parse the text into a list of words
	words = StemmingUtil.parseTokens(lowerCaseText)

	# remove the stop words
	''' Your code goes here '''

	# convert the words to their stems
	stems = StemmingUtil.createStems(words) 	
								

Java


	import edu.oberlin.csci374.StemmingUtil;

	// parse the text into a list of words
	List<String> words = StemmingUtil.parseTokens(lowerCaseText);

	// remove the stop words
	/* Your code goes here */

	// convert the words into their stems
	List<String> stems = StemmingUtil.createStems(words);