In this lab, you will use graphs to model the (very large) network of actors and their movies, from which you will extract a lot of interesting information.
The purpose of this lab is to:
In this lab, you will write a program that plays the "Kevin Bacon Game". A person's "Bacon Number" is computed based on the number of movies of separation between that person and the actor Kevin Bacon. For example, if you are Kevin Bacon, then your Bacon Number is 0. If you were in a movie with Kevin Bacon, your number would be 1. If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2. In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.
For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.
As you know, graphs can be used to model a set of objects and relationships between those objects. For this lab, the objects in our graph are of two types: actors and movies; the relationships are whether an actor was in a given movie. In particular, we'll have a bipartite graph, that is, a graph where the vertices can be partitioned into two sets X and Y, and each edge connects an element of X (say, an actor) to an element of Y (say, a movie). For example, here is a very small subgraph of imdb's actor-movie bipartite graph:
This graph represents the three actors Kevin Bacon, John Malkovich, and Christian Bale, and the three movies Queens Logic, Empire of the Sun, and Batman Returns. The edges keep track of which actors were in which movies.
Here we have used an undirected graph such that the resulting path length between Kevin Bacon and some other actor X will be double X's Bacon Number. Thus, if you decide to represent the information in this way, you would need to divide the path length by 2 or use weights of 0.5 for the edges in order to make the correct computations.
Another representation could create a directed graph and weight the edges from actors to movies as 0 and from movies to actors as 1. Then, using Dijkstra's algorithm, you could find the shortest path from Kevin Bacon, and without modification this would represent and actor's Bacon number.
In any case, you will need to contruct Vertex, Edge, and Graph classes. Unlike previous labs, I will not list the required methods; at this point in the course you can probably figure out what methods and class variables you need. (We've already discussed many of these issues in class, and the textbook is always a decent resource if you're stuck.) If you decide to either use or model part of your implementation off of what is in the book, be sure to give proper credit in the methods or comments at the start of the file.
Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to include test cases in the main method. It will be a lot easier to test your methods now, on small graphs, than later with the large imdb data sets.
Now that you have a basic Graph, create a program / class called Bacon. This program will read a data file of movie and actor listings, and will allow you to interactively query the system for various statistics, such as the Bacon number and path for any actor in the database.
The program requires a single argument, which is the name of the file containing the information on actors and the roles they played in movies. One optional second argument can be used to specify the initial "center" (in case you don't want it to always be Kevin Bacon). For example, here are three sample command line usages:
% java -Xmx2g Bacon imdb.full.txt # plays the game with the full data set centered at "Kevin Bacon (I)" % java -Xmx2g Bacon imdb.top250.txt "Christian Bale" # plays the game with the center set to "Christian Bale" % java -Xmx2g Bacon http://www.cs.oberlin.edu/~asharp/cs151/labs/imdb.no-tv-v.txt # plays the game with the no TV/V data set centered at "Kevin Bacon (I)"
After reading in the data, the program
should then prompt the user for commands until an end-of-file (CTRL-D) is
hasNextLine() will return false). The commands to be supported are described in more detail later in the lab, but basically the user will be able to query the Bacon number of any given actor, as well as get common graph statistics such as average or maximum degree.
Similar to what you did in past labs, if the filename argument begins with "http:" you should treat it as an URL and read the file from the network. This will enable you to play the game without having to download the entire file. To open a Scanner from an URL, you just need to do something similar to the following:
Scanner s = new Scanner(new URL("http://www.cs.oberlin.edu/").openStream());
The movie data file contains information on what movies an actor appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:
<performer name>|<movie title>
The vertical pipe character '|' can be used to determine where the name
ends and the title begins. There will only be one '|' on a line and there
are no empty names or titles. java.lang.String has a number of methods that can
be used to divide up the line (e.g.,
s.split("\\|") returns the array of substrings of
s that are delimited by the |.)
I have supplied several data files of varying sizes for you to work with.
Rather than cluttering up your account with these files, you can either use the links above for URLs. Also, once you have your lab folder created, you can run 151lab10setup from a lab machine and you'll get symbolic links to the files in the current directory.
Other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument.
Before continuing to the actual commands, I highly recommend getting your program working up to this point. That is, are you correctly reading in the input files? Does your graph have the correct number of vertices and edges? Having these things solid before continuing may save you a lot of trouble down the road.
Now that your program has its data loaded in to your graph, you can implement the required commands. Your program should repeatedly prompt the user for one of the commands below, until they choose to quit the program (with CTRL-D).
I am supplying my class files (the Graph.class, Edge.class, Vertex.class, and Bacon.class files) so that, if you have any questions about desired behaviour, you can try my program to see how it behaves. It's not completely debugged, but it should be able to answer many of your questions.
Find the shortest path from the current center to <name>. The output should be of the format
<name1> -> <movie1> -> <name2> -> <movie2> -> ... -> Kevin Bacon (length n)
where <name1> is the person specified by the user and the movies and actors in between show the path from that actor to the current center. The '(n)' should indicate the Bacon Number. E.g., "find James Earl Jones" in the "full" database yields
James Earl Jones -> The Magic 7 (2009) (TV) -> Kevin Bacon (I) (length 1)
and in the "no-tv-v" set:
James Earl Jones -> A Family Thing (1996) -> Xander Berkeley -> A Few Good Men (1992) -> Kevin Bacon (I) (length 2)
Note that your links may differ than mine, but the path length should be the same.
If someone is disconnected from the center simply print
<name> is unreachable
Change the center to the given name if it exists in the database (otherwise, leave the center unchanged.)
Calculate the average Bacon Number for the given center among all connected actor nodes. Your output should be the following
The average should only be for the nodes reachable from the center. In the small database, I get the following
2.42 Kevin Bacon 0
in the pre1950 database I get
3.11 John Aasen 4725
and in the "no-tv-v" set I get
2.98 Kevin Bacon (I) 98708
Calculate structural statistics for the current graph. You should compute the average degree of all actor nodes, a table listing the number of actors with each degree that is non-zero, and the number of components of the graph.
In the small database, I get the following:
Number of Actors: 161 Average Degree: 11.29 Table of degrees Degree 1: 130 Degree 2: 1 Degree 5: 1 Degree 10: 1 Degree 22: 1 Degree 25: 1 Degree 28: 1 Degree 30: 1 Degree 31: 1 Degree 32: 1 Degree 33: 1 Degree 36: 1 Degree 39: 2 Degree 42: 1 Degree 49: 1 Degree 50: 2 Degree 51: 1 Degree 54: 1 Degree 55: 1 Degree 56: 2 Degree 57: 1 Degree 60: 1 Degree 63: 1 Degree 66: 1 Degree 101: 2 Degree 112: 1 Degree 114: 1 Degree 218: 1 Number of Components: 1
Calculate the average Bacon Number for all entries in the database. That is, recenter on every single actor and compute their Bacon number. NOTE: this can take a very (very) long time on larger data sets.
Note that a low score does not necessarily make a good center. For example, one actor's score of 0.5 means that he has only two actor nodes in his component, and a distance of 1 between them. A more appropriate distance scoring would take into account the number of elements in your component, as well as your average distance to all the nodes in that component.
In the small database, I get the following:
Table of distances for Kevin Bacon Distance 0: 1 Distance 1: 3 Distance 2: 96 Distance 3: 51 Distance 4: 9 Distance 5: 1 Unreachable : 0
You may opt to include additional other commands for consideration towards extra credit. For any additional commands you implement, you should document them in the README file. Be sure to explain what it does and how someone could use it.
Here are some suggestions
Use handin to submit the following files:
If you adhered to the honor code in this assignment, add the following statement to your README file:
I have adhered to the Honor Code in this assignment.
If you work with a partner, just submit one solution per team.
Information courtesy of The Internet Movie Database (http://www.imdb.com/). Used with permission. The data should only be used for personal and non-commercial purposes.
A huuuge thanks to Ben Kuperman for helping to acquire the data sets and parse through them.