# CSCI 151 - Lab 10 Everything is better with Bacon

10:00pm, Wednesday, May 4

You may work with a partner on this assignment.

## Introduction

In class, we have been discussing how Graph structures might can be used to represent relationships between groups of objects. For this assignment, you will be writing a program that allows you to play the "Kevin Bacon Game". A person's "Bacon Number" is computed based on the number of movies of separation between that person and the actor Kevin Bacon. For example, if you are Kevin Bacon, then your Bacon Number is 0. If you were in a movie with Kevin Bacon, your number would be 1. If you weren't in a movie with Kevin Bacon, but were in a movie with someone who was, your Bacon Number would be 2. In short, your Bacon Number is one greater than the smallest Bacon Number of any of your co-stars.

Note that this is a take off of Erdos numbers , and the two can be combined to form the more elusive Erdos-Bacon number.

For fun and some additional background, you can try out the Oracle of Bacon at the University of Virginia.

## Program Details

You will be writing a class called BaconNumber that will read a data file and allow you to interactively query the system for the Bacon Number and path for any actor in the database. The program should require a single argument which is the filename containing the information on people and the roles they played in a movie. An optional second argument can be used to specify the initial center. After reading in the data, the program should then prompt the user for commands until an end-of-file (CTRL-D) is reached (`hasNextLine()` will return false).

Similar to what you did in past labs, if the filename argument begins with "http:" you should treat it as an URL and read the file from the network. This will enable you to play the game without having to download the entire file. To open a Scanner from an URL, you just need to do something similar to the following:

```Scanner s = new Scanner( new URL("http://www.cs.oberlin.edu/").openStream() );
```

### Sample command line usage

```% java -Xmx2g BaconNumber imdb.full.txt

# plays the game with the full data set centered at "Kevin Bacon (I)"

% java -Xmx2g BaconNumber imdb.pre1950.txt "Bela Lugosi"

# plays the game with the center set to "Bela Lugosi"

% java -Xmx2g BaconNumber http://www.cs.oberlin.edu/~gr151/imdb/imdb.no-tv-v.txt

# plays the game with the no TV/V data set centered at "Kevin Bacon (I)"
```

### File Format

The movie data file contains information on what movies a performer appears in. Every line contains information on one person appearing in one movie. The lines are formatted as follows:

```    <performer name>|<movie title>
```

The vertical pipe character '|' can be used to determine where the name ends and the title begins. There will only be one '|' on a line and there are no empty names or titles. java.lang.String has a number of methods that can be used to divide up the line. (e.g., `split("\\|")`)

I have supplied several data files of varying sizes for you to work with. (Don't download them to your CS account, see below.)

• imdb.cslam.txt - a 11 line file with the example from the prelab
• imdb.small.txt - a 1817 line file with just a handful of performers (161), fully connected
• imdb.top250.txt - a 14339 line file listing just the top 250 movies on IMDB. (Disconnected groups of foreign films.)
• imdb.pre1950.txt - a 966338 line file with movies made before 1950
• imdb.post1950.txt - a 6848516 line file with the movies made after 1950
• imdb.only-tv-v.txt - a 2021636 line file with only made for TV and direct to video movies
• imdb.no-tv-v.txt - a 5793218 line file without the made for TV and direct to video movies (best for the canonical Kevin Bacon game)
• imdb.full.txt - all 7814854 lines of IMDB for you to search through

Rather than cluttering up your account with these files, you can either use the links above for URLs. Also, once you have your lab folder created, you can run 151lab10setup from a lab machine and you'll get symbolic links to the files in the current directory. Don't submit the imdb files when you handin the assignment.

Other than the small database, you'll almost certainly need to increase the amount of memory allowed via the -Xmx argument.

## Commands to be supported

Your program should read in the specified file and in the default case, choose "Kevin Bacon (I)" as the initial center. There are a number of commands you are to support in order query the database and change the center.

1. find <name>

Find the shortest path from the current center to <name>. The output should be of the format

`    <name1> -> <movie1> -> <name2> -> <movie2> -> ... -> Kevin Bacon (I) (n)`

where <name1> is the person specified by the user and the movies and actors in between show the path from that actor to the current center. The '(n)' should indicate the Bacon Number. E.g., "find James Earl Jones" in the "full" database yields

```    James Earl Jones -> Magic 7, The (2008) (TV) -> Kevin Bacon (I) (1)
```

and in the "no-tv-v" set:

```    James Earl Jones -> Blood Tide (1982) -> Mary Louise Weller
-> Animal House (1978) -> Kevin Bacon (I) (2)
```

Note that your links may differ, but the path length should be the same.

If someone is disconnected from the center simply print

```    <name> is unreachable
```
2. recenter <name>

Change the center to the given name if it exists in the database. If the name is not found, print an appropriate message and do not change the center.

3. avgdist

Calculate the average Bacon Number for the given center among all connected nodes. Your output should be the following

```    <avg><tab><name><space>(<number reachable>,<number unreachable>)
```

The average should only be for the nodes reachable from the center. In the top250 database, I get the following

```    3.5942556977039737  Kevin Bacon (I) (11803,663)
```

and in the "no-tv-v" set I get

```    2.99402433463726    Kevin Bacon (I) (1833436,118796)
```
4. topcenter <n>

For each actor in the current connected component (i.e., the one containing the current center), calculate the average bacon distance to all actors in that component.  (NOTE: this can take a very long time on larger data sets.)  Then print a table of the n best centers (i.e., the ones whose average bacon distance is the smallest).

Calculate the average Bacon Number for all entries in the database. NOTE: this can take a very long time on larger data sets.

In the top 250 set, my program finds "Robert Duvall (11803,663)" is the best center (~2.699) and the worst center is "Kumeko Otowa (11803,663)" (~6.378).

Here's the output from my running topcenter 5 on the top250 dataset:

```Enter a command: topcenter 5
2.6989748369058715  robert duvall
2.7369312886554265  harrison ford (i)
2.741930017792087   robert de niro
2.776666949080742   john ratzenberger
2.798017453189867   alec guinness
```
5. table - print a table of the counts of bacon numbers for the given center from 0 up to the longest.

In the top250 database I get:

```    Table of distances for Kevin Bacon (I)
Number    0:           1
Number    1:          87
Number    2:         539
Number    3:        4462
Number    4:        5786
Number    5:         840
Number    6:          88
Unreachable:         663
```

in the no-tv-v database I get:

```
Table of distances for Kevin Bacon (I)
Number	0:	1
Number	1:	3344
Number	2:	408925
Number	3:	1425751
Number	4:	349704
Number	5:	30061
Number	6:	3482
Number	7:	380
Number	8:	92
Number	9:	12
Unreachable: 	164815

```

and for the full database I get:

```      Table of distances for Kevin Bacon (I)
Number	0:	1
Number	1:	5920
Number	2:	646684
Number	3:	1653925
Number	4:	289613
Number	5:	24138
Number	6:	2738
Number	7:	361
Number	8:	64
Number	9:	6
Unreachable:	176859

```

You may opt to include additional other commands for consideration towards extra credit. For any additional commands you implement, you should document them in the README file. Be sure to explain what it does and how someone could use it.

Here are some suggestions

1. findall - Iterate through all actors and actresses and perform a find operation on them.
2. most - list the actor with the most film credits (i.e., the actor vertex with the highest degree)
3. longest - print out one of the longest paths to the center
4. movies <name> - list all outbound edges from a given name
5. Present the user with a menu to pick from if the IMDB file cannot be opened. Just give the user text descriptions of the data sets and have the URLs stored in your program.

### Notes

The longest Bacon Number I found in the 'imdb.no-tv-v.txt' dataset for Kevin Bacon was 9 ("Andrea Parlato" and others). "Kevin Bacon (I)" has an average distance value of ~2.994 while "Sean Connery" has ~2.955 indicating that he is a better center than Kevin Bacon. The Oracle of Bacon has a top 1000 list of centers which could be used to search for better values.

### Programming Tips

As we have been discussing graphs, It should be no surprise that a good way to represent these acting relationships would be through a graph. There are a number of ways in which this can be done, however, if you want to maintain a simple graph you might want to have both movies and actors be vertices and the edges simply being relationships between them.

While an undirected graph could be used, the resulting path length will be double the Bacon Number. You would need to divide the path length by 2 or use weights of 0.5 for the edges. Another technique would be to create a directed graph and weight the paths from actors to movies as 0 and movies to actors as 1. Then, using Dijkstra's algorithm, you can find the shortest path where all actors and actresses that are listed for a movie can be consider equally.

Remember that it is best to build and test your program incrementally. Construct your Graph class and be sure to include test cases in the main method.

If you decide to either use or model part of your implementation off of what is in the book, be sure to give proper credit in the methods or comments at the start of the file.

You can improve your results by appending a "(I)" to a name and retrying the operation if it isn't found in the database before giving up. (IMDB has been adding that to the end of a number of entries.)

## What to Hand In

Use handin to submit the following files:

1. All .java files necessary for compiling your code
2. Any known problems or interesting design decisions that you made
3. Anything implemented for extra credit

I have adhered to the Honor Code in this assignment.

If you work with a partner, just submit one solution per team.

### Acknowledgments

Information courtesy of The Internet Movie Database (http://www.imdb.com/). Used with permission. The data should only be used for personal and non-commercial purposes.

```Command line argument support: [/5]
Required methods: [/35]
Large file support: [/5]

Extra Methods: [0/10]

TOTAL: [/50]
```

Last Modified: April 8, 2016 by Roberto Hoyle - Original by Benjamin A. Kuperman