You can’t complain about the sea if you suffer shipwreck for the second time.
—IJslands gezegde

Combining tables with OUTER JOIN

Again, we start with a movie: the second part of the movie from the previous chapter Inner Join.

The need for an OUTER JOIN

Make the following exercise as an introduction.

Our program manager would like an overview of all lecturers with their lecturer number, first name, surname and the name of the Course of which they are coordinator. Sort alphabetically by last name. You should obtain the figure below.

Not so difficult:

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM lecturer L INNER JOIN course c ON L.lecturer_id = c.coordinator
ORDER BY 3;

These are the steps the RDBMS takes when executing this query:

  1. First, the FROM is executed. This is an INNER JOINof two tables and so the cartesian product is created: each row of the lecturer table is linked to each row of the course table.
  2. Not all combinations in this cartesian product make sense: only those where the lecturer number (in one table called ‘lecturer_id’, in the other ‘coordinator’) matches are retained, because those satisfy the join condition ON L.lecturer_id = c.coordinator.
  3. There is no WHERE component, so no more rows are dropped from the result.
  4. The GROUP BY (and HAVING) is also missing.
  5. Finally, the result rows are arranged alphabetically by surname via the ORDER BY component.

You show the requested list to the manager, who unfortunately is not entirely satisfied with the result. “I wanted actually that all lecturers be in the list, including those who are not coordinating any course” you hear him say. The INNER JOIN in your SQL query, however, shows only the lecturers that appear in both tables, so only the coordinators of courses.

If you still want to include all teachers in the list, the missing lecturers must be added to the result of the INNER JOIN. This is exactly what an OUTER JOIN does. You just need to replace the word INNER with LEFT OUTER or RIGHT OUTER (or possibly FULL OUTER):

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM lecturer L LEFT OUTER JOIN course c ON L.lecturer_id = c.coordinator
ORDER BY 3;

Why LEFT? This join contains two tables: one left of the word JOIN (i.e. ‘lecturer’) and one right of the word JOIN (i.e. ‘course’). After the ordinary INNER JOIN some lecturers are still missing from the lecture table. Because that table is to the left of the word JOIN, this thus becomes a LEFT OUTER JOIN.

This code is completely equivalent to this variant:

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM course c RIGHT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
ORDER BY 3;

After all, we simply swapped the order in which the two tables are written. This does not affect the INNER JOINwhich happens first, but you must therefore replace the word LEFT with RIGHT. The following figure shows the result. In comparison with the previous figure, a number of rows are added. For example, there is lecturer Elke Crabbé who is not coordinating any course. Since she does not appear in the table ‘courses’, there is also no information about the name of the course. This is why there is null (no value) in the last column (orange line).

Explain why the following two queries return the exact same result:

-- query 1
SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM course c LEFT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
ORDER BY 3;

-- query 2
SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM course c INNER JOIN lecturer L ON L.lecturer_id = c.coordinator
ORDER BY 3;

From query 2 with the INNER JOIN you already know the result. Query 1 is a LEFT OUTER JOIN. . Each OUTER JOIN first starts as a regular INNER JOIN . Once that is done, it looks at which rows from the left table are missing. Those are then added. Here, however, the left table, is the ‘courses’ table. Every row of this table is already represented in the INNER JOIN (why?), so no more rows are added. Both queries therefore return the same result.

JOIN and GROUP BY

JOIN can be combined with other components without any problem, so also with GROUP BY and HAVING.

We will work out the following example. Please try all the steps yourself. You will make mistakes – often provoked by us – but that is the best way to learn.

List all teachers with the number of courses for which they are coordinator. Even lecturers who are not coordinators of any course, should be included in the list. Sort the list alphabetically by name.

Analyze the task:

To predict what we should obtain, we start from the known code that generated the figure in the section above:

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM course c RIGHT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
ORDER BY 3;

See the list in that figure (above). Lecturer ‘Gerben Adriaens’ is coordinator of one course, ‘Goedele Bogers’ of two, ‘Elke Crabbé’ of none etc.

Customize this code with a GROUP BY:

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM course c RIGHT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
GROUP BY lecturer_id
ORDER BY 3;

You get an error message (think for a moment why):

ERROR: column "c.name" must appear in the GROUP BY clause or be used
in an aggregate function
LINE 2: SELECT lecturer_id, L.first_name, L.last_name, c.name;

Why is there an error message about ‘c.name’? If you studied the chapter on GROUP BY, you should be able to provide the answer to this question. By grouping the information into boxes per lecturer, the detailed information about the courses is lost. The only thing you can still retrieve is information from the so-called ‘aggregation functions’ MIN(), MAX(), AVG(), COUNT() and SUM().

Something strange is going on, however. Why is there no error message about L.first_name and L.last_name? We had explained that in the SELECT can only list the columns that are in the GROUP BY (and aggregation functions as well)? Then why do you get no error message for L.first_name and L.last_name?

The reason is that here you group on the primary key of the lecturer table. So you are sure that one box will be created per lecturer. In that box there is only one lecturer name, first name, etc., but multiple courses. Therefore, you can retrieve the columns L.name and L.first name, but not c.name.

So you have to use an aggregation function in the code. In this case we need to count the number of times something occurs, so a COUNT() funtion is required. Modify the code to:

SELECT lecturer_id, L.first_name, L.last_name, COUNT(*)
FROM course c RIGHT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
GROUP BY lecturer_id
ORDER BY 3;

You obtain this (wrong!) result:

There is a problem with this figure (and thus with our code). Compare for a moment with the previous figure. Gerben does have 1 course, Goedele is coordinator of two courses, but Elke Crabbé is not the coordinator of any course. However, the result of our code is that Elke Crabbé is coordinator of one course (see orange arrow in the figure). What goes wrong?

The error is in the aggregation function COUNT(*). This function (with *) counts all rows, including those with one or more NULL values. So the row containing Elke Crabbé's info is effectively counted. This problem is fortunately easy to solve. Instead of the * enter the column of interest for what you want to count. Specifically, here we are talking about ‘c.name’. A value NULL in this field will not be counted. So you end the SELECT with COUNT(c.name):

SELECT lecturer_id, L.first_name, L.last_name, COUNT(c.name)
FROM course c RIGHT OUTER JOIN lecturer L ON L.lecturer_id = c.coordinator
GROUP BY lecturer_id
ORDER BY 3;

Select rows that do NOT meet a certain condition

Note: This is a quite difficult topic. Exercises that rely on this principle are quite tough!

You get a new request from your manager: “I would like a list of all lecturers who are NOT coordinators of one or multiple courses.”.

Piece of cake, of course! With the OUTER JOIN of previous section you got a list of all coordinators, supplemented by the lecturers who do not appear in the course table and thus are not coordinators. Those “added lecturers” are easily recognisable by the absence of a value for the column with the course name. Filtering rows after running the FROM is done with the WHERE component. The query thus becomes:

SELECT lecturer_id, L.first_name, L.last_name, c.name
FROM lecturer L LEFT OUTER JOIN course c ON L.lecturer_id = c.coordinator
WHERE name = NULL -- idea is OK, code not! See assignment below
ORDER BY 3;

We intentionally put an error in the WHERE component of the above query. Correct it so that the query is executed correctly and you obtain a list of four lecturers: Elke, Maarten, Patrick and Lut.

Comparing with NULL should never be done with ‘=’ but must be done with the word ‘IS’. The correct WHERE component is: WHERE c.name IS NULL.

Exercise

List all teachers who are not coordinators of an course that has 6 SP. Order alphabetically by name and then by first name. Your should obtain the result from this figure:

Tip: start this exercise with more columns than just first name and last name (possibly even with all columns!). First, create an INNER JOIN. Look at the result. Now transform your query into a OUTER JOIN and see what was added.

SELECT L.last_name, l.first_name
FROM course o RIGHT OUTER JOIN lecturer l ON
  o.coordinator = l.lecturer_id AND o.credits = 6
WHERE o.credits IS NULL
ORDER BY 1, 2;

FULL OUTER JOIN

An OUTER JOIN is an INNER JOIN where all missing elements from one table (those attached to the LEFT or the RIGHT side) are added. There is also a FULL OUTER JOIN which essentially means adding the missing elements from both the left and the right table . In our example with only two tables (lecturers and courses), we cannot create a meaningful FULL OUTER JOIN however. This will be covered later in the exercises. The video at the beginning of this chapter does explain it clearly.

Summary overview JOIN

Below is another overview of all the different types of JOIN that we have seen. For each type we list how the set of rows that the rest of the query is run with is constructed.