Lab 10  The Kevin Bacon Game

Due:  December 14 (midnight)

You may work with one partner for this assignment.

Introduction

In class, we have been discussing how graphs can be used to represent relationships between objects.  For this assignment, you will be writing a program that allows you to play the Kevin Bacon Game.  A person's Bacon Number is computed based on the number of movies of separation between that person and the actor Kevin Bacon.  For example, if you are Kevin Bacon, then your Bacon Number is 0.  If you were in a movie with Kevin Bacon, your number would be 1.  If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2.  In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.

Bacon numbers can be generalized to the notion of bacon distance.  The bacon distance between any two actors is the minimum number of movies of separation between those two actors.

Note that this is a takeoff on Erdös numbers, and the two can be combined to form the more elusive Erdös-Bacon number.

For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.

Program Details

You will be writing a class called BaconNumber (or just Bacon) that will read a data file and allow you to interactively query the system for the Bacon Number and path for any actor in the database.  The program should require a single argument which is the filename containing the information on people and the movies they appeared in.  An optional second argument can be used to specify the initial center.  (The center is the actor from which your program will be computing bacon distances.)  After reading in the data, the program should then prompt the user for commands until an end-of-file (ctrl-D) is reached (hasNextLine() will return false).

You can open a Scanner for a file with a statement such as:

Scanner s = new Scanner(new File(<filename>));

Sample command line usage

% java -Xmx2g BaconNumber ~jdonalds/151/imdb.full.txt 

# plays the game with the full data set centered at "Kevin Bacon"

% java -Xmx2g BaconNumber ~jdonalds/151/imdb.pre1950.txt "Bela Lugosi"

# plays the game using only movies produced prior to 1950, with the center set to "Bela Lugosi"

% java -Xmx2g BaconNumber ~jdonalds/151/imdb.no-tv-v.txt

# plays the game with the no TV/V data set centered at "Kevin Bacon"

File Format

The movie data file (downloaded from www.imdb.com) contains information on what movies an actor appears in.  Every line contains information on one person appearing in one movie.  The lines are formatted as follows:

    <performer name>|<movie title>

The vertical bar character '|' can be used to determine where the name ends and the title begins.  There will only be one '|' on a line and there are no empty names or titles.  java.lang.String has a number of methods that can be used to divide up the line. (e.g., split("\\|"))

I have supplied several data files of varying sizes for you to work with.

You can set up a symbolic link to any of these files in your own directory with the "ln" command.  For example, to create a symbolic link to the small database, just give the command

ln -s ~jdonalds/151/imdb.small.txt

Then you can use the simple name "imdb.small.txt" to refer to the file.

With other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument to the "java" command (as I did in the sample command lines above).

Commands to be supported

Your program should read in the specified file and in the default case, choose "Kevin Bacon" as the initial center.  There are a number of commands you are to support in order to query the database and change the center.

  1. find <name>

    Find the shortest path from the current center to <name>.  The output should be of the format

        Kevin Bacon -> <movie1> -> <name1> -> <movie2> -> ... -> <name> (n)

    where <name> is the person specified by the user and the movies and actors in between show the path from the current center to that actor.  The '(n)' should indicate the Bacon Number.  For example, "find James Earl Jones" in the "full" database yields

        Kevin Bacon -> Magic 7, The (2008) (TV) -> James Earl Jones (1)

    and in the "no-tv-v" set:

        Kevin Bacon -> Air I Breathe, The (2007) -> Clark Gregg -> Clear and Present Danger (1994) -> James Earl Jones (2)

    Note that your links may differ, but the path length should be the same.

    If the target name is not found in the database, write a message such as

        <name> does not appear in the database

    If the target name is disconnected from the center simply print

        <name> is unreachable    
  2. recenter <name>

    Change the center to the given name if it exists in the database.  If the name is not found, print an appropriate message and do not change the center.

  3. avgdistance

    Calculate the average bacon distance to the current center, averaging over all nodes reachable from the center.  Your output should be in the following format:

        <avg><tab><name><space>(<number reachable>,<number unreachable>)

    In the top250 database, I get the following

        3.7097815064073014      Kevin Bacon (10847,692)   

    and in the "no-tv-v" set I get

        2.8042165574501663      Kevin Bacon (1282468,48668)
  4. topcenter <n>

    For each actor in the current connected component (i.e., the one containing the current center), calculate the average bacon distance to all actors in that component.  (NOTE: this can take a very long time on larger data sets.)  Then print a table of the n best centers (i.e., the ones whose average bacon distance is the smallest).

Optional commands

Try adding some additional commands to your application.  Here are a few suggestions.  (If you implement any extra commands, document them in your README file.  Be sure to explain what each command does and how to use it.)

  1. circles - All the actors at a given bacon distance from the center form a circle.  Print out a table showing, for each distance (0, 1, 2, ...), the number of actors in the circle.
  2. connected - Separate the database into its connected components.  Print a table showing, for each component, the number of movies and the number of actors it contains.
  3. longest - print out one of the longest paths to the center
  4. movies <name> - list all outbound edges from a given name

Notes

The longest Bacon Number I found in the 'imdb.no-tv-v.txt' dataset for Kevin Bacon was 8 ("Jessica Drizd" and others).  "Kevin Bacon" has a center value of ~2.804 while "Sean Connery" has ~2.645 indicating that he is a better center than Kevin Bacon.  The Oracle of Bacon has a top 1000 list of centers which could be used to search for better values.

Programming Tips

A good way to represent the relationship between actors and movies is with a graph.  There are a number of ways in which this can be done; however, if you want to maintain a simple graph you may want to use vertices to represent both actors and movies and use edges to represent the "appeared in" relationship.  In this case, the bacon distance between two actors is half the number of edges on the shortest path between them.

Remember that it is best to build and test your program incrementally.  Construct your Graph class and be sure to write a main method with some test cases.

Considering the size of some of the data files you will be working with, it is important to choose algorithms that are efficient.  If you decide to use algorithms or data structures from an outside source (such as our textbook), be sure to give proper credit in the comments of your Java files.

Actors with duplicate names are listed in imdb with the name followed by an index (expressed as a Roman numeral) in parentheses, such as

    Doris Day (I)

You can improve your results by appending (I) to a name and retrying it, if the name is not found in the database.

What to Hand In

Use handin to submit the following files:

  1. All .java files necessary for compiling your code
  2. A README file with:
    1. Your name (and your partner's name if you have one)
    2. Any known problems or interesting design decisions that you made
    3. Any extra commands implemented
    4. If you adhered to the honor code in this assignment, add the following statement to your README file:

      I affirm that I have adhered to the Honor Code in this assignment.

If you work with a partner, just submit one solution per team.

Acknowledgments

Information courtesy of The Internet Movie Database (http://www.imdb.com/).  Used with permission.  The data should only be used for personal and non-commercial purposes.

Thanks to Ben Kuperman for helping to reformat the imdb files.