Assignment 5: One degree of Kevin Bacon

Due: 5pm, Thursday, March 20. Value: 35 pts. You may work with one other student on this assignment if you wish. Submit to Moodle in ASCII, OpenOffice.org, or PDF format (no Word files!).

The Internet Movie Database is a large compendium of data about film and television production, and they kindly share their database for non-commercial use. I have built a MySQL database of a subset of their information (those relating to productions that are popular enough to receive at least 1,000 user ratings, averaging at least 5.0). This database includes four tables.

films(film_id, title, num_raters, avg_rating) 11,353 rows
people(person_id, gender, name) 282,155 rows
actors(person_id, film_id) 672,479 rows
directors(person_id, film_id) 68,231 rows

For this assignment you will use an HTML interface to this MySQL database accessible through “http://imdb.cs.hendrix.edu/”. It allows you to enter SELECT queries that will be executed on the database.

As an example, here are two queries based on these tables:

  1. Show all productions in which Kevin Bacon appeared.

    SELECT title
    FROM   people JOIN actors ON people.person_id = actors.person_id
             JOIN films ON actors.film_id = films.film_id
    WHERE  name = 'Bacon, Kevin (I)'
  2. Show all actors who appeared in the same film as Kevin Bacon.

    SELECT people.name
    FROM   people AS bacon_person JOIN actors AS bacon_films
               ON bacon_person.person_id = bacon_films.person_id
           JOIN actors ON bacon_films.film_id = actors.film_id
           JOIN people ON actors.person_id = people.person_id
    WHERE  bacon_person.name = 'Bacon, Kevin (I)'

Note: Please notice the “Reset server” link. You may want to use it in either of two situations:

Following are the problems to be answered for this assignment.

  1. Write an SQL query that lists each production's title and the number of actors it has, sorted so the number of actors is decreasing.

  2. Write an SQL query that lists each actor's name and the number of productions whose rating is 9 or more in which the actor has appeared, sorted so that this number is decreasing.

  3. Following is an SQL query that lists each person who has both directed and acted in a film whose rating is 9 or more.

    SELECT DISTINCT name
    FROM   people JOIN actors ON people.person_id = actors.person_id
           JOIN films AS acted_film ON actors.film_id = acted_film.film_id
           JOIN directors ON people.person_id = directors.person_id
           JOIN films AS directed_film ON directors.film_id = directed_film.film_id
    WHERE  acted_film.avg_rating >= 9.0
       AND directed_film.avg_rating >= 9.0
    ORDER BY name

    Find the execution time for this query. To measure execution time for this and following questions, you should execute the query three times, reporting each of the three time measurements and the median of the three trials.

  4. Think of at least two very different queries that each accomplish the same task as the previous query, and time each of them. For example, one way might use subqueries. Time each of them. (You might hope to use INTERSECT, but MySQL doesn't support this. This page discusses some alternative approaches: The UNION ALL approach might be one you try.)

  5. MySQL works hard to find the best order in which to execute your query, but it allows you to designate the order in which tables should be joined together by including the STRAIGHT_JOIN keyword: If you start the query with SELECT STRAIGHT_JOIN, MySQL will load tables in the order they are listed in the FROM clause.

    Look back to the query listed above for listing the productions in which Kevin Bacon has appeared. Use STRAIGHT_JOIN to measure the time it takes for each of the six different ways of rearranging the tables.

    Although no ordering should take more than a minute and a half, you can feel free to cancel a query when it takes more than 20 seconds, just reporting it as 20 seconds in your answer. As noted above, whenever you cancel a query, be sure to click the “Reset server” so that MySQL doesn't continue working on it.

  6. Also measure the time for the query as originally written, without STRAIGHT_JOIN. Based on the time measurements, which order do you think MySQL uses if left to its own devices to find the best way? Can you provide evidence about whether it matters what order you list the tables in the FROM clause?

  7. The tables you have been using so far have several indexes on them: films has indexes on film_id and title, people has indexes on person_id and name, and both actors and directors have indexes on person_id and film_id. I have also created tables that are identical except they have fewer indexes and their name has had an underscore appended: films_ has an index only on film_id, people_ has an index only on person_id, and actors_ and directors_ do not have indexes at all.

    Modify the query listed at the top of the assignment (for displaying the productions in which Kevin Bacon has appeared) so it uses these tables instead. How much time does it take?

  8. Using STRAIGHT_JOIN and possibly rearranging the tables in the FROM clause, can you improve on the time that MySQL takes when left to its own devices to work with these tables with fewer indexes? How much of an improvement could you achieve?