Understanding SQL Join
Understanding SQL Join
Basic set theory
Set
Relational databases are based on set theory
. A set is a collection of elements, with no ordering and no repeats. For example, {1, 2, 3, 4}
is a set, which is same as {2, 1, 3, 4}
, because sets are unordered. And {1, 1, 2, 3}
is not a set, because there is a repeated element 1
.
Tuple
A tuple is a list of elements, e.g., (a, b, c, d, …). An n-tuple is a tuple with n elements, meanwhile the order of elements is maintained. For example, (a, b)
is a 2-tuple, while (b, a)
is a different 2-tuple.
Cartesian product
For example, there are two sets A = {1, 2} and B = {a, b}, so the Cartesian product of A and B is:
A x B = {1,2} x {a,b} = {(1,a), (1,b), (2,a), (2,b)}
Note that the result of Cartesian product is a set of tuple, which means the elements (1, a)
, (1, b)
, (2, a)
and (2, b)
are ordered pairs(2-tuples), therefore we can easily know that A x B
is not same as B x A
.
SQL relations
In SQL, a relation is a set of tuples:
As shown in the above picture, in this so-called relation set, each row is an element, and each element is a tuple. A relation has no duplicates, so no tuple can appear twice. As for the attribute
, the values in each attribute(column) have a same type. In addition, a key
is an attribute of a relation the value of which is unique for each tuple.
In SQL, we use tables to “implement” or realise relations.
Understanding Join in SQL
Inner Join
In this article, we use PostgreSQL to explian Join
operation. Let’s firstly create two tables course
and student
:
1 | CREATE TABLE COURSE |
So, now we have two tables in PostgreSQL:
And if we query by SELECT * FROM STUDENT, COURSE;
and SELECT * FROM COURSE, STUDENT;
, SQL will return the Cartesian product(cross product) of student
and course
:
You might have noticed that because of the different order, STUDENT, COURSE
and COURSE, STUDENT
are not the same. If we want to realize a Join
using this cross product, we should add a WHERE clause, i.e., SELECT * FROM STUDENT, COURSE WHERE STUDENT.COURSEID = COURSE.ID;
, which is equivalent to SELECT * FROM STUDENT JOIN COURSE ON (STUDENT.COURSEID = COURSE.ID);
.
Actually, a Join
(inner join
, in SQL, joins are by default inner join, therefore we can omit inner
) provides an implicit way of simultaneously selecting the cross product AND applying the where condition to filter out the rows we are not interested in.
Natural Join
Now we create two new tables:
1 | CREATE TABLE LANGUAGE |
Now we have two new tables in PostgreSQL:
We can use natural join
to join these two tables:
SELECT TITLE, DESCRIPTION FROM FILM NATURAL JOIN LANGUAGE;
which is equivalent to the following inner join
statement:
SELECT TITLE, DESCRIPTION FROM FILM JOIN LANGUAGE ON FILM.LANGUAGEID = LANGUAGE.LANGUAGEID;
Note that natural join eliminates duplicate columns, while inner join cannot do this for us.For example:
One interesting thing is that if we use a table to natural join itself:
SELECT * FROM FILM A NATURAL JOIN FILM B;
It returns the same table FILM
, that is, the above statement is equivalent to:
SELECT * FROM FILM;
How do we confirm this? We can use the following statement:
1 | (SELECT * FROM FILM A NATURAL JOIN FILM B) EXCEPT (SELECT * FROM FILM) |
Why we need UNION
? Because we have to check in two directions to make sure that:
1 | A - B = ∅ |
Then we can know that A = B
.
Outer Join
Now let’s first insert some new values into the table FILM
and LANGUAGE
:
1 | INSERT INTO LANGUAGE |
Now these two tables are like this:
Here we only talk about left join. We can use left outer join
, or left join
, or natural left outer join
, or natural left join
:
1 | -- 1. left outer join |
In this case, we can interpret the outer join like this:
- We firstly select out all the titles in the
FILM
table. - Then we select out the matching decription in the LANGUAGE table. For a certain title, if there is no equivalent description, we just show
NULL
.
For outer join, a tricky thing is like this:
As shown, if we use SELECT TITLE, DESCRIPTION FROM FILM NATURAL LEFT JOIN LANGUAGE;
, we can get all the titles. Based on that, if we add WHERE LANGUAGEID != 2
, it would eliminate the titles whose mathching language id is 2 or matching language is NULL
.