Assignment 5: One degree of Kevin Bacon
Due: 5pm, Thursday, March 20. Value: 35 pts. You may work with one other student on this assignment if you wish. Submit to Moodle in ASCII, OpenOffice.org, or PDF format (no Word files!).
The Internet Movie Database is a large compendium of data about film and television production, and they kindly share their database for non-commercial use. I have built a MySQL database of a subset of their information (those relating to productions that are popular enough to receive at least 1,000 user ratings, averaging at least 5.0). This database includes four tables.
films(film_id, title, num_raters, avg_rating) 11,353 rows people(person_id, gender, name) 282,155 rows actors(person_id, film_id) 672,479 rows directors(person_id, film_id) 68,231 rows
For this assignment you will use
an HTML interface
to this MySQL database
accessible through
“http://imdb.cs.hendrix.edu/”.
It allows you to enter SELECT
queries that will be executed on
the database.
As an example, here are two queries based on these tables:
Show all productions in which Kevin Bacon appeared.
SELECT title
FROM people JOIN actors ON people.person_id = actors.person_id
JOIN films ON actors.film_id = films.film_id
WHERE name = 'Bacon, Kevin (I)'Show all actors who appeared in the same film as Kevin Bacon.
SELECT people.name
FROM people AS bacon_person JOIN actors AS bacon_films
ON bacon_person.person_id = bacon_films.person_id
JOIN actors ON bacon_films.film_id = actors.film_id
JOIN people ON actors.person_id = people.person_id
WHERE bacon_person.name = 'Bacon, Kevin (I)'
Note: Please notice the “Reset server” link. You may want to use it in either of two situations:
In doing this assignment, you may well find that you issue a query that takes very long — longer than you wish to wait. Your first reaction will likely be to reload the page, but this does not cancel a query: Reloading a page sends no useful message to the server, so the server will continue executing the query, which may, indeed, take hours. Instead, to cancel a query, click the “Reset server” link.
You may also use this if you find that “MySQL server status” indicates that MySQL is using a large fraction of the CPU or memory, which could indicate that there is still a query being executed (perhaps by somebody else) that needs to be terminated. Maybe another student is executing a query at the same time, though, so try to be courteous and wait for a moment.
Following are the problems to be answered for this assignment.
Write an SQL query that lists each production's title and the number of actors it has, sorted so the number of actors is decreasing.
Write an SQL query that lists each actor's name and the number of productions whose rating is 9 or more in which the actor has appeared, sorted so that this number is decreasing.
Following is an SQL query that lists each person who has both directed and acted in a film whose rating is 9 or more.
SELECT DISTINCT name
FROM people JOIN actors ON people.person_id = actors.person_id
JOIN films AS acted_film ON actors.film_id = acted_film.film_id
JOIN directors ON people.person_id = directors.person_id
JOIN films AS directed_film ON directors.film_id = directed_film.film_id
WHERE acted_film.avg_rating >= 9.0
AND directed_film.avg_rating >= 9.0
ORDER BY nameFind the execution time for this query. To measure execution time for this and following questions, you should execute the query three times, reporting each of the three time measurements and the median of the three trials.
Think of at least two very different queries that each accomplish the same task as the previous query, and time each of them. For example, one way might use subqueries. Time each of them. (You might hope to use
INTERSECT
, but MySQL doesn't support this. This page discusses some alternative approaches: TheUNION ALL
approach might be one you try.)MySQL works hard to find the best order in which to execute your query, but it allows you to designate the order in which tables should be joined together by including the
STRAIGHT_JOIN
keyword: If you start the query withSELECT STRAIGHT_JOIN
, MySQL will load tables in the order they are listed in theFROM
clause.Look back to the query listed above for listing the productions in which Kevin Bacon has appeared. Use
STRAIGHT_JOIN
to measure the time it takes for each of the six different ways of rearranging the tables.Although no ordering should take more than a minute and a half, you can feel free to cancel a query when it takes more than 20 seconds, just reporting it as 20 seconds in your answer. As noted above, whenever you cancel a query, be sure to click the “Reset server” so that MySQL doesn't continue working on it.
Also measure the time for the query as originally written, without
STRAIGHT_JOIN
. Based on the time measurements, which order do you think MySQL uses if left to its own devices to find the best way? Can you provide evidence about whether it matters what order you list the tables in theFROM
clause?The tables you have been using so far have several indexes on them:
films
has indexes onfilm_id
andtitle
,people
has indexes onperson_id
andname
, and bothactors
anddirectors
have indexes onperson_id
andfilm_id
. I have also created tables that are identical except they have fewer indexes and their name has had an underscore appended:films_
has an index only onfilm_id
,people_
has an index only onperson_id
, andactors_
anddirectors_
do not have indexes at all.Modify the query listed at the top of the assignment (for displaying the productions in which Kevin Bacon has appeared) so it uses these tables instead. How much time does it take?
Using
STRAIGHT_JOIN
and possibly rearranging the tables in theFROM
clause, can you improve on the time that MySQL takes when left to its own devices to work with these tables with fewer indexes? How much of an improvement could you achieve?