What is a JOIN ?
Queries against a single table are certainly not rare, but you will find that most of your queries will require two, three, or even more tables. To illustrate, let’s look at the definitions for the customer and address tables and then define a query that retrieves data from both tables. The database used in this post – sakila
Let’s say you want to retrieve the first and last names of each customer, along with their street address. Your query will therefore need to retrieve the customer.first_name, customer.last_name, and address.address columns. But how can you retrieve data from both tables in the same query? The answer lies in the customer.address_id column, which holds the ID of the customer’s record in the address table (in more formal terms, the customer.address_id column is the foreign key to the address table). The query, which you will see shortly, instructs the server to use the customer.address_id column as the transportation between the customer and address tables, thereby allowing columns from both tables to be included in the query’s result set. This type of operation is known as a join.
Cartesian Product –
The easiest way to start is to put the customer and address tables into the from clause of a query and see what happens. Here’s a query that retrieves the customer’s first and last names along with the street address, with a from clause naming both tables separated by the join keyword:
SELECT c.first_name, c.last_name, a.address FROM customer c JOIN address a;
there are only 599 customers and 603 rows in the address table, so how did the result set end up with 361,197 rows? Looking more closely, you can see that many of the customers seem to have the same street address. Because the query didn’t specify how the two tables should be joined, the database server generated the Cartesian product, which is every permutation of the two tables (599 customers x 603 addresses = 361,197 permutations). This type of join is known as a cross join, and it is rarely used (on purpose, at least).
INNER JOIN –
To modify the previous query so that only a single row is returned for each customer, you need to describe how the two tables are related. Earlier, I showed that the customer.address_id column serves as the link between the two tables, so this information needs to be added to the on subclause of the from clause:
SELECT c.first_name, c.last_name, a.address FROM customer c JOIN address a ON c.address_id = a.address_id;
Instead of 361,197 rows, you now have the expected 599 rows due to the addition of the on subclause, which instructs the server to join the customer and address tables by using the address_id column to traverse from one table to the other. For example, Mary Smith’s row in the customer table contains a value of 5 in the address_id column (not shown in the example). The server uses this value to look up the row in the address table having a value of 5 in its address_id column and then retrieves the value ‘1913 Hanoi Way’ from the address column in that row.
If a value exists for the address_id column in one table but not the other, then the join fails for the rows containing that value, and those rows are excluded from the result set. This type of join is known as an inner join, and it is the most commonly used type of join. To clarify, if a row in the customer table has the value 999 in the address_id column and there’s no row in the address table with a value of 999 in the address_id column, then that customer row would not be included in the result set. If you want to include all rows from one table or the other regardless of whether a match exists, you need to specify an outer join, but this will be explained in later posts.
In the previous example, I did not specify in the from clause which type of join to use. However, when you wish to join two tables using an inner join, you should explicitly specify this in your from clause; here’s the same example, with the addition of the join type (note the keyword inner):
SELECT c.first_name, c.last_name, a.address FROM customer c INNER JOIN address a ON c.address_id = a.address_id;
If you do not specify the type of join, then the server will do an inner join by default.
The ANSI Join Syntax –
The notations used previously to join tables was introduced in the SQL92 version of the ANSI SQL standard. All the major databases (Oracle Database, Microsoft SQL Server, MySQL, IBM DB2 Universal Database, and Sybase Adaptive Server) have adopted the SQL92 join syntax. Because most of these servers have been around since before the release of the SQL92 specification, they all include an older join syntax as well. For example, all these servers would understand the following variation of the previous query:
SELECT c.first_name, c.last_name, a.address FROM customer c INNER JOIN address a WHERE c.address_id = a.address_id;
This older method of specifying joins does not include the on subclause; instead, tables are named in the from clause separated by commas, and join conditions are included in the where clause. While you may decide to ignore the SQL92 syntax in favor of the older join syntax, the ANSI join syntax has the following advantages:
Join conditions and filter conditions are separated into two different clauses (the on subclause and the where clause, respectively), making a query easier to understand.
The join conditions for each pair of tables are contained in their own on clause, making it less likely that part of a join will be mistakenly omitted.
Queries that use the SQL92 join syntax are portable across database servers, whereas the older syntax is slightly different across the different servers.
The benefits of the SQL92 join syntax are easier to identify for complex queries that include both join and filter conditions. Consider the following query, which returns only those customers whose postal code is 52137:
SELECT c.first_name, c.last_name, a.address FROM customer c INNER JOIN address a WHERE c.address_id = a.address_id AND a.postal_code = 52137;
At first glance, it is not so easy to determine which conditions in the where clause are join conditions and which are filter conditions. It is also not readily apparent which type of join is being employed (to identify the type of join, you would need to look closely at the join conditions in the where clause to see whether any special characters are employed), nor is it easy to determine whether any join conditions have been mistakenly left out. Here’s the same query using the SQL92 join syntax:
SELECT c.first_name, c.last_name, a.address FROM customer c INNER JOIN address a ON c.address_id = a.address_id WHERE a.postal_code = 52137;
With this version, it is clear which condition is used for the join and which condition is used for filtering.
Joining Three or More Tables –
Joining three tables is similar to joining two tables, but with one slight wrinkle. With a two-table join, there are two tables and one join type in the from clause, and a single on subclause to define how the tables are joined. With a three-table join, there are three tables and two join types in the from clause, and two on subclauses.
To illustrate, let’s change the previous query to return the customer’s city rather than their street address. The city name, however, is not stored in the address table but is accessed via a foreign key to the city table. Here are the table definitions:
To show each customer’s city, you will need to traverse from the customer table to the address table using the address_id column and then from the address table to the city table using the city_id column. The query would look like the following:
SELECT c.first_name, c.last_name, ct.city FROM customer c INNER JOIN address a ON c.address_id = a.address_id INNER JOIN city ct ON a.city_id = ct.city_id;
For this query, there are three tables, two join types, and two on subclauses in the from clause, so things have gotten quite a bit busier. At first glance, it might seem like the order in which the tables appear in the from clause is important, but if you switch the table order, you will get the exact same results.
Using subqueries as Tables –
The following query joins the customer table to a subquery against the address and city tables:
SELECT c.first_name, c.last_name, addr.address, addr.city FROM customer c INNER JOIN (SELECT a.address_id, a.address, ct.city FROM address a INNER JOIN city ct ON a.city_id = ct.city_id WHERE a.district = 'California' ) addr ON c.address_id = addr.address_id;
The subquery, which starts on line 4 and is given the alias addr, finds all addresses that are in California. The outer query joins the subquery results to the customer table to return the first name, last name, street address, and city of all customers who live in California. While this query could have been written without the use of a subquery by simply joining the three tables, it can sometimes be advantageous from a performance and/or readability aspect to use one or more subqueries.
One way to visualize what is going on is to run the subquery by itself and look at the results. Here are the results of the subquery from the prior example:
SELECT a.address_id, a.address, ct.city FROM address a INNER JOIN city ct ON a.city_id = ct.city_id WHERE a.district = 'California'
This result set consists of all nine California addresses. When joined to the customer table via the address_id column, your result set will contain information about the customers assigned to these addresses.