In class, we have been discussing how graphs can be
used to
represent relationships between objects. For this assignment,
you will be writing a program that allows you to play the Kevin
Bacon Game.
A person's Bacon
Number is computed based on the number of
movies of separation between that person and the actor Kevin
Bacon. For
example, if you are Kevin Bacon, then your Bacon Number is 0. If
you
were
in a movie with Kevin Bacon, your number would be 1. If you
weren't in
a
movie with Kevin Bacon, but were in a movie with someone who was, your
Bacon Number would be 2. In short, your Bacon Number is one
greater
than
the smallest Bacon Number of any of your co-stars.
Bacon numbers can be generalized to the notion of bacon distance. The bacon
distance between any two actors is the minimum number of movies of
separation between those two actors.
Note that this is a takeoff on Erdös numbers, and the two can be combined to form the more elusive Erdös-Bacon number.
For fun and some additional background, you can try out the Oracle of Bacon at the University
of Virginia.
You will be writing a class called BaconNumber (or just Bacon) that will read
a data
file and allow you to interactively query the system for the Bacon
Number
and path for any actor in the database. The program should
require a
single
argument which is the filename containing the information on people and
the
movies they appeared in. An optional second argument can be used
to
specify the initial center.
(The center is the actor from which your program will be computing
bacon distances.) After reading in the data, the
program
should then prompt the user for commands until an end-of-file (ctrl-D)
is
reached (hasNextLine() will return false).
You can open a Scanner for a file with a statement such as:
Scanner s = new Scanner(new File(<filename>));
% java -Xmx2g BaconNumber ~jdonalds/151/imdb.full.txt
# plays the game with the full data set centered at "Kevin Bacon"
% java -Xmx2g BaconNumber ~jdonalds/151/imdb.pre1950.txt "Bela Lugosi"
# plays the game using only movies produced prior to 1950, with the center set to "Bela Lugosi"
% java -Xmx2g BaconNumber ~jdonalds/151/imdb.no-tv-v.txt
# plays the game with the no TV/V data set centered at "Kevin Bacon"
The movie data file (downloaded from www.imdb.com) contains information on what movies an actor appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:
<performer name>|<movie title>
The vertical bar character '|' can be used to determine where the name
ends and the title begins. There will only be one '|' on a line
and
there
are no empty names or titles. java.lang.String
has a number of methods that can
be used to divide up the line. (e.g., split("\\|"))
I have supplied several data files of varying sizes for you to work with.
You can set up a symbolic link to any of these files in your own
directory with the "ln" command. For example, to create a
symbolic link to the small database, just give the command
ln
-s ~jdonalds/151/imdb.small.txt
Then you can use the simple name "imdb.small.txt" to refer to the
file.
With other than the small database, you'll almost certainly need to
increase the amount of memory allowed via the -Xmx argument
to the "java" command (as I did in the sample command lines above).
Your program should read in the specified file and in the default case, choose "Kevin Bacon" as the initial center. There are a number of commands you are to support in order to query the database and change the center.
find <name>
Find the shortest path from the current center to <name>. The output should be of the format
Kevin Bacon -> <movie1> -> <name1> -> <movie2> -> ... -> <name> (n)
where <name> is the person specified by the user and the movies and actors in between show the path from the current center to that actor. The '(n)' should indicate the Bacon Number. For example, "find James Earl Jones" in the "full" database yields
Kevin Bacon -> Magic 7, The (2008) (TV) -> James Earl Jones (1)
and in the "no-tv-v" set:
Kevin Bacon -> Air I Breathe, The (2007) -> Clark Gregg -> Clear and Present Danger (1994) -> James Earl Jones (2)
Note that your links may differ, but the path length should be the same.
If the target name is not found in the database, write a message such as
<name> does not appear in the database
If the target name is disconnected from the center simply print
<name> is unreachable
recenter <name>
Change the center to the given name if it exists in the database. If the name is not found, print an appropriate message and do not change the center.
avgdistance
Calculate the average bacon distance to the current center,
averaging over all nodes reachable from the center. Your output
should be
in the following format:
<avg><tab><name><space>(<number reachable>,<number unreachable>)
In the top250 database, I get the following
3.7097815064073014 Kevin Bacon (10847,692)
and in the "no-tv-v" set I get
2.8042165574501663 Kevin Bacon (1282468,48668)
topcenter <n>
For each actor in the current connected component (i.e., the
one containing the current center), calculate the average bacon
distance to all actors in that component. (NOTE: this can take a
very long time on larger data
sets.) Then print a table of the n best centers (i.e., the ones
whose average bacon distance is the smallest).
Try adding some additional commands to your application. Here are a few suggestions. (If you implement any extra commands, document them in your README file. Be sure to explain what each command does and how to use it.)
The longest Bacon Number I found in the 'imdb.no-tv-v.txt' dataset for
Kevin
Bacon was 8 ("Jessica Drizd" and others). "Kevin Bacon" has a
center
value of
~2.804 while "Sean Connery" has ~2.645 indicating that he is a better
center
than Kevin Bacon. The Oracle of Bacon has a top 1000 list of
centers
which
could be used to search for better values.
A good
way
to represent the relationship between actors and movies is with a
graph.
There
are a number of ways in which this can be done; however, if you want to
maintain a simple graph you may want to use vertices to represent
both actors and movies and use edges to represent the "appeared in"
relationship. In this case, the bacon distance between two actors
is half the number of edges on the shortest path between them.
Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to write a main method with some test cases.
Considering the size of some of the data files you will be working with, it is important to choose algorithms that are efficient. If you decide to use algorithms or data structures from an outside source (such as our textbook), be sure to give proper credit in the comments of your Java files.
Actors with duplicate names are listed in imdb with the name followed by an index (expressed as a Roman numeral) in parentheses, such as
Doris Day (I)
You can improve your results by appending (I) to a name and retrying
it, if the name is not found in the database.
Use handin to submit the following files:
I affirm that I have adhered to the Honor Code in this assignment.
If you work with a partner, just submit one solution per team.
Information courtesy of The Internet Movie Database
(http://www.imdb.com/).
Used with permission. The data should only be used for personal
and
non-commercial purposes.
Thanks to Ben Kuperman for helping to reformat the imdb
files.