# Lab 10

Everything's Better with Bacon
Due by 8pm on Friday 14 Dec 2012

In this lab, you will use graphs to model the (very large) network of actors and their movies, from which you will extract a lot of interesting information.

The purpose of this lab is to:

• Have you implement a graph,
• Use your graph to represent a very large social network, and
As usual, you may work with one partner on this assignment, if you choose.

## Introduction

In this lab, you will write a program that plays the "Kevin Bacon Game". A person's "Bacon Number" is computed based on the number of movies of separation between that person and the actor Kevin Bacon. For example, if you are Kevin Bacon, then your Bacon Number is 0. If you were in a movie with Kevin Bacon, your number would be 1. If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2. In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.

Note that this is a take off of Erdos numbers (mine's 3, because my advisor's is 2), and the two can be combined to form the more elusive Erdos-Bacon number.

For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.

## Part 1 - Graphs

As you know, graphs can be used to model a set of objects and relationships between those objects. For this lab, the objects in our graph are of two types: actors and movies; the relationships are whether an actor was in a given movie. In particular, we'll have a bipartite graph, that is, a graph where the vertices can be partitioned into two sets X and Y, and each edge connects an element of X (say, an actor) to an element of Y (say, a movie). For example, here is a very small subgraph of imdb's actor-movie bipartite graph:

This graph represents the three actors Kevin Bacon, John Malkovich, and Christian Bale, and the three movies Queens Logic, Empire of the Sun, and Batman Returns. The edges keep track of which actors were in which movies.

Here we have used an undirected graph such that the resulting path length between Kevin Bacon and some other actor X will be double X's Bacon Number. Thus, if you decide to represent the information in this way, you would need to divide the path length by 2 or use weights of 0.5 for the edges in order to make the correct computations.

Another representation could create a directed graph and weight the edges from actors to movies as 0 and from movies to actors as 1. Then, using Dijkstra's algorithm, you could find the shortest path from Kevin Bacon, and without modification this would represent and actor's Bacon number.

In any case, you will need to contruct Vertex, Edge, and Graph classes. Unlike previous labs, I will not list the required methods; at this point in the course you can probably figure out what methods and class variables you need. (We've already discussed many of these issues in class, and the textbook is always a decent resource if you're stuck.) If you decide to either use or model part of your implementation off of what is in the book, be sure to give proper credit in the methods or comments at the start of the file.

Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to include test cases in the main method. It will be a lot easier to test your methods now, on small graphs, than later with the large imdb data sets.

## Part 2 - Everything's Better with Bacon!

Now that you have a basic Graph, create a program / class called Bacon. This program will read a data file of movie and actor listings, and will allow you to interactively query the system for various statistics, such as the Bacon number and path for any actor in the database.

The program requires a single argument, which is the name of the file containing the information on actors and the roles they played in movies. One optional second argument can be used to specify the initial "center" (in case you don't want it to always be Kevin Bacon). For example, here are three sample command line usages:

```% java -Xmx2g Bacon imdb.full.txt

# plays the game with the full data set centered at "Kevin Bacon (I)"

% java -Xmx2g Bacon imdb.top250.txt "Christian Bale"

# plays the game with the center set to "Christian Bale"

% java -Xmx2g Bacon http://www.cs.oberlin.edu/~asharp/cs151/labs/imdb.no-tv-v.txt

# plays the game with the no TV/V data set centered at "Kevin Bacon (I)"
```

After reading in the data, the program should then prompt the user for commands until an end-of-file (CTRL-D) is reached (`hasNextLine()` will return false). The commands to be supported are described in more detail later in the lab, but basically the user will be able to query the Bacon number of any given actor, as well as get common graph statistics such as average or maximum degree.

### Input Files

Similar to what you did in past labs, if the filename argument begins with "http:" you should treat it as an URL and read the file from the network. This will enable you to play the game without having to download the entire file. To open a Scanner from an URL, you just need to do something similar to the following:

```    Scanner s = new Scanner(new URL("http://www.cs.oberlin.edu/").openStream());
```

The movie data file contains information on what movies an actor appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:

```    <performer name>|<movie title>
```

The vertical pipe character '|' can be used to determine where the name ends and the title begins. There will only be one '|' on a line and there are no empty names or titles. java.lang.String has a number of methods that can be used to divide up the line (e.g., `s.split("\\|")` returns the array of substrings of `s` that are delimited by the |.)

I have supplied several data files of varying sizes for you to work with.

• imdb.tiny.txt - a tiny test set with two components, 4 actors, and 4 movies.
• imdb.small.txt - a 1817 line file with just a handful of actors (161), one component.
• imdb.post1950.txt - a 6206077 line, 1979980 actor file listing just the movies made after 1950.
• imdb.pre1950.txt - a 948557 line, 125258 actor file listing just the movies made before 1950.
• imdb.no-tv-v.txt - a 5295816 line file without the made for TV and direct to video movies (best for the canonical Kevin Bacon game)
• imdb.full.txt - all 7162109 lines of IMDB for you to search through

Rather than cluttering up your account with these files, you can either use the links above for URLs. Also, once you have your lab folder created, you can run 151lab10setup from a lab machine and you'll get symbolic links to the files in the current directory.

Other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument.

Before continuing to the actual commands, I highly recommend getting your program working up to this point. That is, are you correctly reading in the input files? Does your graph have the correct number of vertices and edges? Having these things solid before continuing may save you a lot of trouble down the road.

### Commands to be supported

Now that your program has its data loaded in to your graph, you can implement the required commands. Your program should repeatedly prompt the user for one of the commands below, until they choose to quit the program (with CTRL-D).

I am supplying my class files (the Graph.class, Edge.class, Vertex.class, and Bacon.class files) so that, if you have any questions about desired behaviour, you can try my program to see how it behaves. It's not completely debugged, but it should be able to answer many of your questions.

1. find <name>

Find the shortest path from the current center to <name>. The output should be of the format

`    <name1> -> <movie1> -> <name2> -> <movie2> -> ... -> Kevin Bacon (length n)`

where <name1> is the person specified by the user and the movies and actors in between show the path from that actor to the current center. The '(n)' should indicate the Bacon Number. E.g., "find James Earl Jones" in the "full" database yields

```    James Earl Jones -> The Magic 7 (2009) (TV) -> Kevin Bacon (I)
(length 1)
```

and in the "no-tv-v" set:

```    James Earl Jones -> A Family Thing (1996) -> Xander Berkeley ->
A Few Good Men (1992) -> Kevin Bacon (I)
(length 2)
```

Note that your links may differ than mine, but the path length should be the same.

If someone is disconnected from the center simply print

```    <name> is unreachable
```
2. recenter <name>

Change the center to the given name if it exists in the database (otherwise, leave the center unchanged.)

3. avgdist

Calculate the average Bacon Number for the given center among all connected actor nodes. Your output should be the following

```    <avg><tab><name><space>(<number unreachable>)
```

The average should only be for the nodes reachable from the center. In the small database, I get the following

```    2.42    Kevin Bacon 0
```

in the pre1950 database I get

```    3.11    John Aasen 4725
```

and in the "no-tv-v" set I get

```    2.98    Kevin Bacon (I)	98708
```
4. stats

Calculate structural statistics for the current graph. You should compute the average degree of all actor nodes, a table listing the number of actors with each degree that is non-zero, and the number of components of the graph.

In the small database, I get the following:

```    Number of Actors:   161
Average Degree:	11.29
Table of degrees
Degree    1:	130
Degree    2:	1
Degree    5:	1
Degree   10:	1
Degree   22:	1
Degree   25:	1
Degree   28:	1
Degree   30:	1
Degree   31:	1
Degree   32:	1
Degree   33:	1
Degree   36:	1
Degree   39:	2
Degree   42:	1
Degree   49:	1
Degree   50:	2
Degree   51:	1
Degree   54:	1
Degree   55:	1
Degree   56:	2
Degree   57:	1
Degree   60:	1
Degree   63:	1
Degree   66:	1
Degree  101:	2
Degree  112:	1
Degree  114:	1
Degree  218:	1
Number of Components:	1

```
5. allcenter

Calculate the average Bacon Number for all entries in the database. That is, recenter on every single actor and compute their Bacon number. NOTE: this can take a very (very) long time on larger data sets.

Note that a low score does not necessarily make a good center. For example, one actor's score of 0.5 means that he has only two actor nodes in his component, and a distance of 1 between them. A more appropriate distance scoring would take into account the number of elements in your component, as well as your average distance to all the nodes in that component.

6. table - print out a table of the counts of bacon numbers for the given center from 0 up to the longest

In the small database, I get the following:

```    Table of distances for Kevin Bacon
Distance    0:	       1
Distance    1:	       3
Distance    2:	      96
Distance    3:	      51
Distance    4:	       9
Distance    5:	       1
Unreachable :	       0
```

You may opt to include additional other commands for consideration towards extra credit. For any additional commands you implement, you should document them in the README file. Be sure to explain what it does and how someone could use it.

Here are some suggestions

1. findall - iterate through all actors and actresses and perform a find operation on them.
2. longest - print out one of the longest paths to the center
3. movies <name> - list all outbound edges from a given name
4. most - list the actor with the most film credits (i.e. the actor vertex with the highest degree)

## Hand In

Use handin to submit the following files:

1. All .java files necessary for compiling your code