What is a Subquery ?
A subquery is a query contained within another SQL statement. A subquery is always enclosed within parentheses, and it is usually executed prior to the containing statement. Like any query, a subquery returns a result set that may consist of:
A single row with a single column
Multiple rows with a single column
Multiple rows having multiple columns
The type of result set returned by the subquery determines how it may be used and which operators the containing statement may use to interact with the data the subquery returns. When the containing statement has finished executing, the data returned by any subqueries is discarded, making a subquery act like a temporary table with statement scope (meaning that the server frees up any memory allocated to the subquery results after the SQL statement has finished execution).
An example of a subquery is shown below.
SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = (SELECT MAX(customer_id) FROM customer);
In this example, the subquery returns the maximum value found in the customer_id column in the customer table, and the containing statement then returns data about that customer. If you are ever confused about what a subquery is doing, you can run the subquery by itself (without the parentheses) to see what it returns. Here’s the subquery from the previous example:
SELECT MAX(customer_id) FROM customer;
The subquery returns a single row with a single column, which allows it to be used as one of the expressions in an equality condition (if the subquery returned two or more rows, it could be compared to something but could not be equal to anything, but more on this later). In this case, you can take the value the subquery returned and substitute it into the right hand expression of the filter condition in the containing query, as in the following:
SELECT customer_id, first_name, last_name FROM customer WHERE customer_id = 599;
Subquery Types –
Along with the differences noted previously regarding the type of result set returned by a subquery (single row/column, single row/multi column, or multiple columns), you can use another feature to differentiate subqueries; some subqueries are completely self-contained (called non correlated subqueries), while others reference columns from the containing statement (called correlated subqueries). The next several sections explore these two subquery types and show the different operators that you can employ to interact with them.
Noncorrelated Subqueries –
The example from earlier in the post is a noncorrelated subquery; it may be executed alone and does not reference anything from the containing statement. Most subqueries that you encounter will be of this type unless you are writing update or delete statements, which frequently make use of correlated subqueries (more on this later). Along with being noncorrelated, the example from earlier in the post also returns a result set containing a single row and column. This type of subquery is known as a scalar subquery and can appear on either side of a condition using the usual operators (=, <>, <, >, <=, >=). The next example shows how you can use a scalar subquery in an inequality condition:
SELECT city_id, city FROM city WHERE country_id <> (SELECT country_id FROM country WHERE country = 'India');
This query returns all cities that are not in India. The subquery, which is found on the last line of the statement, returns the country ID for India, and the containing query returns all cities that do not have that country ID. While the subquery in this example is quite simple, subqueries may be as complex as you need them to be, and they may utilize any and all the available query clauses (select, from, where, group by, having, and order by).
If you use a subquery in an equality condition but the subquery returns more than one row, you will receive an error. For example, if you modify the previous query such that the subquery returns all countries except for India, you will receive the following error:
SELECT city_id, city FROM city WHERE country_id <> (SELECT country_id FROM country WHERE country <> 'India');
If you run the subquery by itself, you will see the following results:
SELECT country_id FROM country WHERE country <> 'India';
The containing query fails because an expression (country_id) cannot be equated to a set of expressions (country_ids 1, 2, 3, …, 109). In other words, a single thing cannot be equated to a set of things. In the next section, you will see how to fix the problem by using a different operator.
Multiple-Row, Single-Column Subqueries –
If your subquery returns more than one row, you will not be able to use it on one side of an equality condition, as the previous example demonstrated. However, there are four additional operators that you can use to build conditions with these types of subqueries.
The in and not in operators –
While you can’t equate a single value to a set of values, you can check to see whether a single value can be found within a set of values. The next example, while it doesn’t use a subquery, demonstrates how to build a condition that uses the in operator to search for a value within a set of values:
SELECT country_id FROM country WHERE country IN ('Canada','Mexico');
The expression on the left hand side of the condition is the country column, while the right hand side of the condition is a set of strings. The in operator checks to see whether either of the strings can be found in the country column; if so, the condition is met, and the row is added to the result set. You could achieve the same results using two equality conditions, as in:
SELECT country_id FROM country WHERE country = 'Canada' OR country = 'Mexico';
While this approach seems reasonable when the set contains only two expressions, it is easy to see why a single condition using the in operator would be preferable if the set contained dozens (or hundreds, thousands, etc.) of values.
Although you will occasionally create a set of strings, dates, or numbers to use on one side of a condition, you are more likely to generate the set using a subquery that returns one or more rows. The following query uses the in operator with a subquery on the right hand side of the filter condition to return all cities that are in Canada or Mexico:
SELECT city_id, city FROM city WHERE country_id IN (SELECT country_id FROM country WHERE country IN ('Canada', 'Mexico'));
Along with seeing whether a value exists within a set of values, you can check the converse using the not in operator. Here’s another version of the previous query using not in instead of in:
SELECT city_id, city FROM city WHERE country_id NOT IN (SELECT country_id FROM country WHERE country IN ('Canada', 'Mexico'));
This query finds all cities that are not in Canada or Mexico.
The all operator –
While the in operator is used to see whether an expression can be found within a set of expressions, the all operator allows you to make comparisons between a single value and every value in a set. To build such a condition, you will need to use one of the comparison operators (=, <>, <, >, etc.) in conjunction with the all operator. For example, the next query finds all customers who have never gotten a free film rental:
SELECT first_name, last_name FROM customer WHERE customer_id <> ALL (SELECT customer_id FROM payment WHERE amount = 0);
The subquery returns the set of IDs for customers who have paid $0 for a film rental, and the containing query returns the names of all customers whose ID is not in the set returned by the subquery. If this approach seems a bit clumsy to you, you are in good company; most people would prefer to phrase the query differently and avoid using the all operator. To illustrate, the previous query generates the same results as the next example, which uses the not in operator:
SELECT first_name, last_name FROM customer WHERE customer_id NOT IN (SELECT customer_id FROM payment WHERE amount = 0);
It’s a matter of preference, but I think that most people would find the version that uses not in to be easier to understand.
The any operator –
Like the all operator, the any operator allows a value to be compared to the members of a set of values; unlike all, however, a condition using the any operator evaluates to true as soon as a single comparison is favorable. This example will find all customers whose total film rental payments exceed the total payments for all customers in Bolivia, Paraguay, or Chile:
SELECT customer_id, sum(amount) FROM payment GROUP BY customer_id HAVING sum(amount) > ANY (SELECT sum(p.amount) FROM payment p INNER JOIN customer c ON p.customer_id = c.customer_id INNER JOIN address a ON c.address_id = a.address_id INNER JOIN city ct ON a.city_id = ct.city_id INNER JOIN country co on ct.country_id = co.country_id WHERE co.country IN ('Bolivia','Paraguay','Chile') GROUP BY co.country);
The subquery returns the total film rental fees for all customers in Bolivia, Paraguay, and Chile, and the containing query returns all customers who outspent at least one of these three countries.
Although most people prefer to use in, using = any is equivalent to using the in operator.
Multi column Subqueries –
So far, the subquery examples in this post have returned a single column and one or more rows. In certain situations, however, you can use subqueries that return two or more columns. To show the utility of multi column subqueries, it might help to look first at an example that uses multiple, single-column subqueries:
SELECT fa.actor_id, fa.film_id FROM film_actor fa WHERE fa.actor_id IN (SELECT actor_id FROM actor WHERE last_name = 'MONROE') AND fa.film_id IN (SELECT film_id FROM film WHERE rating = 'PG');
This query uses two subqueries to identify all actors with the last name Monroe and all films rated PG, and the containing query then uses this information to retrieve all cases where an actor named Monroe appeared in a PG film. However, you could merge the two single-column subqueries into one multi column subquery and compare the results to two columns in the film_actor table. To do so, your filter condition must name both columns from the film_actor table surrounded by parentheses and in the same order as returned by the subquery, as in:
SELECT actor_id, film_id FROM film_actor WHERE (actor_id, film_id) IN (SELECT a.actor_id, f.film_id FROM actor a CROSS JOIN film f WHERE a.last_name = 'MONROE' AND f.rating = 'PG');
This version of the query performs the same function as the previous example, but with a single subquery that returns two columns instead of two subqueries that each return a single column. The subquery in this version uses a type of join called a cross join, which will be explored in our later posts. The basic idea is to return all combinations of actors named Monroe (2) and all films rated PG (194) for a total of 388 rows, 11 of which can be found in the film_actor table.
Correlated Subqueries –
All of the subqueries shown thus far have been independent of their containing statements, meaning that you can execute them by themselves and inspect the results. A correlated subquery, on the other hand, is dependent on its containing statement from which it references one or more columns. Unlike a noncorrelated subquery, a correlated subquery is not executed once prior to execution of the containing statement; instead, the correlated subquery is executed once for each candidate row (rows that might be included in the final results). For example, the following query uses a correlated subquery to count the number of film rentals for each customer, and the containing query then retrieves those customers who have rented exactly 20 films:
SELECT c.first_name, c.last_name FROM customer c WHERE 20 = (SELECT count(*) FROM rental r WHERE r.customer_id = c.customer_id);
The reference to c.customer_id at the very end of the subquery is what makes the subquery correlated; the containing query must supply values for c.customer_id for the subquery to execute. In this case, the containing query retrieves all 599 rows from the customer table and executes the subquery once for each customer, passing in the appropriate customer ID for each execution. If the subquery returns the value 20, then the filter condition is met, and the row is added to the result set.
Along with equality conditions, you can use correlated subqueries in other types of conditions, such as the range condition illustrated here.
SELECT c.first_name, c.last_name FROM customer c WHERE (SELECT sum(p.amount) FROM payment p WHERE p.customer_id = c.customer_id) BETWEEN 180 AND 240;
This variation on the previous query finds all customers whose total payments for all film rentals are between $180 and $240. Once again, the correlated subquery is executed 599 times (once for each customer row), and each execution of the subquery returns the total account balance for the given customer.