BCNF & 3NF

Contents:
1. Relations
2. Functional dependencies
3. Keys
4. BCNF
5. Decomposing
6. 3NF
7. Why decomposing into 3NF doesn't work
8. FD basis
9. Synthesis

1. Relations

Relational databases are so called because they are based on the mathematical notion of a relation, which is simply a set of tuples. One of the more prominent occurrences of relations is in the equivalence relation.

There are two terminologies used for relational databases.

Programming Mathematical
tablerelation
metadataschema
rowtuple
columnattribute

Mathematically, we describe relations using a schema. For example, you can imagine a relation named orders described as follows.

orders(customer_nameshipping_addrproduct_namequantitytime_orderedtime_shipped)

We'll soon need to make some assumptions about how this relation works.

  1. Each customer has a distinct name, and each product has a distinct name.
  2. An order may consist of multiple products; each product within the order is listed as an additional tuple within the relation. An order does not list the same product multiple times — instead, the quantity attribute is used to represent this.
  3. No customer places two orders at the same time.
  4. All items in an order are to be shipped to the same address.

2. Functional dependencies

Definition: A1 A2 … An → B1 B2 … Bm is an FD (a functional dependency) for a relation R if whenever two tuples in R agree on A1A2, … An, they must necessarily agree on B1B2, … Bm.

Example: For the orders relation with assumptions as listed above, two FDs are:

customer_name time_ordered → shipping_addr
customer_name product_name time_ordered → quantity time_shipped

Since the left side is allowed more information than is necessary, another FD is

customer_name time_ordered quantity → shipping_addr

R is said to satisfy an FD, and A1A2, … An are said to determine B1B2, … Bm.

An FD is trivial if { B1B2, … Bm } is a subset of { A1A2, … An } For example, product_name quantity → quantity is a trivial FD.

3. Keys

The set { A1A2, … An } is a superkey for a relation R if it functionally determines all attributes for a relation. In other words, any two tuples of R must necessarily disagree on on at least one of A1A2, … An.

A1A2, … An } is a key for a relation R if it is a “minimal superkey”: That is, the set is a superkey but no proper subset is a superkey.

Example: For the orders relation, { customer_nameproduct_nametime_ordered } is a key.

4. BCNF

A relation R is in Boyce-Codd Normal Form (BCNF) if for every nontrivial FD A1 A2 … An → B1 B2 … Bm satisfied by R, the set { A1A2, … An } is a superkey for R.

Our definition of orders is not in BCNF, since its FD customer_name time_ordered → shipping_addr is nontrivial and the left side { customer_nametime_ordered } is not a superkey for orders.

5. Decomposing

It's desirable for each database to have all relations in BCNF. So, given a relation not in BCNF, we would normally attempt to decompose it into two or more BCNF relations.

This is simple: We find a violating FD (i.e., a nontrivial FD for which the left side is not a superkey), expand the right side to include as many attributes as possible (called computing the closure of the left side), and then split the relation into:

  1. one relation containing all attributes of this expanded FD, and
  2. another relation containing all the attributes of the FD's left side and all attributes that were missing from the FD.

If any of the resulting relations aren't in BCNF, we recurse on it (them).

Take orders as an example. We already decided that customer_name time_ordered → shipping_addr is an FD that causes orders to violate BCNF. We observe that the right side can be expanded by adding customer_name and time_ordered to it; but the remaining attributes (product_name, quantity, time_shipped) don't follow — assuming that an order can contain multiple products, and that these products might be shipped out at varying times. Thus, we'll split into two relations:

  1. orders(customer_nametime_orderedshipping_addr)
  2. order_contents(customer_nametime_orderedproduct_namequantitytime_shipped)

Both are in BCNF given our presumptions, so there's no more work to do; sometimes, though, one or both will need to be further decomposed.

6. Third normal form

Decomposing leads to a potential problem: In some situations, we can “lose” FDs through splitting them apart. Let us suppose our orders relation also had a restriction that no product can be shipped out to two locations simultaneously; this would lead to the new FD

product_name time_shipped → shipping_addr

The FD customer_name time_ordered → shipping_addr still holds, so we might attempt to decompose the relation as we did. But after splitting the relation as in our example, we would have no way to express our new FD in the resulting relation. Splitting closely related information apart is problematic; this situation leads to a slightly looser normal form called third normal form.

To define third normal form, we first define an attribute as prime if it is part of some key for its relation. Given this definition, a relation is in 3NF (third normal form) if every FD satisfies one of these three conditions:

  1. It is trivial;
  2. The left side is a superkey; or
  3. Every attribute on the right side either appears on that FD's left side or is prime.

Note that BCNF has stricter restrictions on what FDs it allows, so any relation that is in BCNF is also in 3NF. In practice, well-designed relations are almost always in BCNF; but occasionally a non-BCNF relation is still well-designed (and is in 3NF).

7. Why decomposing into 3NF doesn't work

Decomposing a relation into 3NF leads to potential problems. Suppose the following schema.

students(first_namehallsex)

Suppose that all first names are either distinctively male or female, and that our campus had only single-sex halls. This leads to two FDs:

first_name → sex
hall → sex

Based on this, we'd conclude that the only key for this relation is { first_name,  hall }. Note that both FDs violate 3NF.

Decomposition would propose that we would divide this relation into two relations based on one of the violating FDs. Suppose we chose to decompose based on hall → sex.

halls(hallsex)
students(first_namehall)

Notice that we have again split up an existing FD (first_name → sex). This is exactly what was wrong with BCNF. What have we gained?

8. FD Basis

There is another algorithm for breaking a non-3NF relation into 3NF relations. We'll get to that, but first we need to cover a new topic: the basis of a set of FDs. (For any who've studied Linear Algebra, this usage of the word basis is very similar to its usage in that field.)

A basis for a set F of FDs is a set G of FDs in which every FD of F is implied by the FDs in G. That is, for each F FD A → B, if we take the closure of A using G, we'd get a set of attributes that includes all attributes of B.

For example: In our students example above, suppose we have the following FDs.

a. first_name → sex
b. hall → sex
c. first_name → hall sex

(FD c. is a bit odd: It suggests that people of the same name must necessarily live in the same hall. When I was an undergrad there was a rumor that Housing found it fun to assign students in that way.)

Then the following constitutes a basis.

b. hall → sex
c. first_name → hall sex

We don't need a. in our basis, since it's implied by c..

Given a set F of FDs, a minimal basis is a set G of FDs for which:

Our basis example from above is not minimal, since one of the FDs has multiple attributes on its right side. We can split it into two pieces.

b. hall → sex
d. first_name → sex
e. first_name → hall

However, this is still not minimal because we can remove d. and still have a basis for our original set of FDs. The set { hall → sexfirst_name → hall } constitutes a minimal basis.

9. Synthesis

Here is our algorithm for splitting a non-3NF relation R into 3NF relations while not losing any FDs.

  1. Find a minimal basis for the set of R's FDs.
  2. For each FD in the minimal basis, create a relation consisting of the attributes in that FD.
  3. If none of the relations have a set of attributes that forms a superkey for R, select a key for R, and create a relation with the attributes from that key.
  4. If any relation includes only a subset of attributes found in another relation, delete that smaller relation.

For our students example with FDs a., b., and c., we'd get two relations from Step 2:

halls(hallsex)
names(first_namehall)

Step 3 would delete no relations. Step 4 would add no relations, since { first_namehall } is a superkey of the original relation, and those attributes appear in the names relation.

What would happen if students had first_name → sex and hall → sex as its FD basis? We'd get three relations.

halls(hallsex)
names(first_namesex)
students(first_namehall)

I leave it for you to figure out how this came out of the synthesis algorithm.