Understanding SQL Join

Understanding SQL Join

Basic set theory

Set

Relational databases are based on set theory. A set is a collection of elements, with no ordering and no repeats. For example, {1, 2, 3, 4} is a set, which is same as {2, 1, 3, 4}, because sets are unordered. And {1, 1, 2, 3} is not a set, because there is a repeated element 1.

Tuple

A tuple is a list of elements, e.g., (a, b, c, d, …). An n-tuple is a tuple with n elements, meanwhile the order of elements is maintained. For example, (a, b) is a 2-tuple, while (b, a) is a different 2-tuple.

Cartesian product

For example, there are two sets A = {1, 2} and B = {a, b}, so the Cartesian product of A and B is:

A x B = {1,2} x {a,b} = {(1,a), (1,b), (2,a), (2,b)}

Note that the result of Cartesian product is a set of tuple, which means the elements (1, a), (1, b), (2, a) and (2, b) are ordered pairs(2-tuples), therefore we can easily know that A x B is not same as B x A.

cartesian_product

SQL relations

In SQL, a relation is a set of tuples:

sql_relation

As shown in the above picture, in this so-called relation set, each row is an element, and each element is a tuple. A relation has no duplicates, so no tuple can appear twice. As for the attribute, the values in each attribute(column) have a same type. In addition, a key is an attribute of a relation the value of which is unique for each tuple.

In SQL, we use tables to “implement” or realise relations.

Understanding Join in SQL

Inner Join

In this article, we use PostgreSQL to explian Join operation. Let’s firstly create two tables course and student:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE COURSE
(
ID SERIAL PRIMARY KEY,
COURSENAME VARCHAR( 20 ),
DETAILS VARCHAR( 100 )
) ;

CREATE TABLE STUDENT
(
STUDENTID SERIAL PRIMARY KEY,
NAME VARCHAR( 30 ),
COURSEID INT REFERENCES COURSE( ID )
) ;

INSERT INTO COURSE
( COURSENAME, DETAILS )
VALUES
( 'MSc', 'The MSc CS Course' ),
( 'BSc', 'The BSc CS Course' ) ;

INSERT INTO STUDENT
( NAME, COURSEID )
VALUES
( 'S1', 1 ),
( 'S2', 2 ),
( 'S3', 1 ) ;

So, now we have two tables in PostgreSQL:

student_and_course_table

And if we query by SELECT * FROM STUDENT, COURSE; and SELECT * FROM COURSE, STUDENT;, SQL will return the Cartesian product(cross product) of student and course:

select_result

You might have noticed that because of the different order, STUDENT, COURSE and COURSE, STUDENT are not the same. If we want to realize a Join using this cross product, we should add a WHERE clause, i.e., SELECT * FROM STUDENT, COURSE WHERE STUDENT.COURSEID = COURSE.ID;, which is equivalent to SELECT * FROM STUDENT JOIN COURSE ON (STUDENT.COURSEID = COURSE.ID);.

join_and_cross_product

Actually, a Join(inner join, in SQL, joins are by default inner join, therefore we can omit inner) provides an implicit way of simultaneously selecting the cross product AND applying the where condition to filter out the rows we are not interested in.

Natural Join

Now we create two new tables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
CREATE TABLE LANGUAGE
(
LANGUAGEID SERIAL PRIMARY KEY,
DESCRIPTION VARCHAR( 20 )
) ;

CREATE TABLE FILM
(
FILMID SERIAL PRIMARY KEY,
TITLE VARCHAR( 20 ),
CATEGORY VARCHAR( 100 ),
LANGUAGEID INT REFERENCES LANGUAGE( LANGUAGEID )
) ;

INSERT INTO LANGUAGE
( DESCRIPTION )
VALUES
( 'English' ),
( 'Chinese' ),
( 'French' );

INSERT INTO FILM
( TITLE, CATEGORY, LANGUAGEID )
VALUES
( 'Braveheart', 'HISTORY', 1 ),
( 'The Ghouls', 'ACTION', 2 ),
( 'Jeux d enfants', 'LOVE', 3 );

Now we have two new tables in PostgreSQL:

natural_join_1

We can use natural join to join these two tables:

SELECT TITLE, DESCRIPTION FROM FILM NATURAL JOIN LANGUAGE;

which is equivalent to the following inner join statement:

SELECT TITLE, DESCRIPTION FROM FILM JOIN LANGUAGE ON FILM.LANGUAGEID = LANGUAGE.LANGUAGEID;

natural_join_2

Note that natural join eliminates duplicate columns, while inner join cannot do this for us.For example:

natural_join_3

One interesting thing is that if we use a table to natural join itself:

SELECT * FROM FILM A NATURAL JOIN FILM B;

It returns the same table FILM, that is, the above statement is equivalent to:

SELECT * FROM FILM;

natural_join_4

How do we confirm this? We can use the following statement:

1
2
3
(SELECT * FROM FILM A NATURAL JOIN FILM B) EXCEPT (SELECT * FROM FILM)
UNION ALL
(SELECT * FROM FILM) EXCEPT (SELECT * FROM FILM A NATURAL JOIN FILM B);

natural_join_5

Why we need UNION? Because we have to check in two directions to make sure that:

1
2
A - B = ∅
B - A = ∅

Then we can know that A = B.

Outer Join

Now let’s first insert some new values into the table FILM and LANGUAGE:

1
2
3
4
5
6
7
8
9
10
11
INSERT INTO LANGUAGE 
( DESCRIPTION )
VALUES
( 'Spanish' ),
( 'Korean' );

INSERT INTO FILM
( TITLE, CATEGORY, LANGUAGEID )
VALUES
( 'Seven Samurai', 'HISTORY', NULL ),
( 'Rashomon', NULL, NULL );

Now these two tables are like this:

outer_join_1

Here we only talk about left join. We can use left outer join, or left join, or natural left outer join, or natural left join:

1
2
3
4
5
6
7
8
9
10
11
-- 1. left outer join
SELECT TITLE, DESCRIPTION FROM FILM LEFT OUTER JOIN LANGUAGE USING(LANGUAGEID);

-- 2. left join
SELECT TITLE, DESCRIPTION FROM FILM LEFT JOIN LANGUAGE USING(LANGUAGEID);

-- 3. natural left outer join
SELECT TITLE, DESCRIPTION FROM FILM NATURAL LEFT OUTER JOIN LANGUAGE;

-- 4. natural left join
SELECT TITLE, DESCRIPTION FROM FILM NATURAL LEFT JOIN LANGUAGE;

outer_join_2

In this case, we can interpret the outer join like this:

outer_join_3

  1. We firstly select out all the titles in the FILM table.
  2. Then we select out the matching decription in the LANGUAGE table. For a certain title, if there is no equivalent description, we just show NULL.

For outer join, a tricky thing is like this:

outer_join_4

As shown, if we use SELECT TITLE, DESCRIPTION FROM FILM NATURAL LEFT JOIN LANGUAGE;, we can get all the titles. Based on that, if we add WHERE LANGUAGEID != 2, it would eliminate the titles whose mathching language id is 2 or matching language is NULL.