SQL GROUP BY

Spread the love

Sometimes you will want to find trends in your data that will require the database server to cook the data a bit before you can generate the results you are looking for. For example, let’s say that you are in charge of sending coupons for free rentals to your best customers. You could issue a simple query to look at the raw data. In this post we will use the – sakila database .

SELECT customer_id 
FROM rental;

With 599 customers spanning more than 16,000 rental records, it isn’t feasible to determine which customers have rented the most films by looking at the raw data. Instead, you can ask the database server to group the data for you by using the group by clause. Here’s the same query but employing a group by clause to group the rental data by customer ID:

SELECT customer_id 
FROM rental
GROUP BY customer_id;

The result set contains one row for each distinct value in the customer_id column, resulting in 599 rows instead of the full 16,044 rows. The reason for the smaller result set is that some of the customers rented more than one film. To see how many films each customer rented, you can use an aggregate function in the select clause to count the number of rows in each group:

SELECT 
    customer_id,
    COUNT(*)
FROM rental
GROUP BY customer_id;

To get the customer who rented the most amount of time we can order the result using ORDER BY clause in descending order.

SELECT 
    customer_id,
    COUNT(*)
FROM rental
GROUP BY customer_id
ORDER BY 2 DESC;

The customer with id 148 rented highest number of times.

When grouping data, you may need to filter out undesired data from your result set based on groups of data rather than based on the raw data. Since the group by clause runs after the where clause has been evaluated, you cannot add filter conditions to your where clause for this purpose. For example, here’s an attempt to filter out any customers who have rented fewer than 40 films:

SELECT 
    customer_id,
    COUNT(*)
FROM rental
WHERE count(*) >= 40
GROUP BY customer_id
ORDER BY 2 DESC;

You cannot refer to the aggregate function count(*) in your where clause, because the groups have not yet been generated at the time the where clause is evaluated. Instead, you must put your group filter conditions in the having clause. Here’s what the query would look like using having:

SELECT 
    customer_id,
    COUNT(*)
FROM rental
GROUP BY customer_id
HAVING COUNT(*) >= 40
ORDER BY 2 DESC;

Because those groups containing fewer than 40 members have been filtered out via the having clause, the result set now contains only those customers who have rented 40 or more films.

Aggregate Functions –

Aggregate functions perform a specific operation over all rows in a group. Although every database server has its own set of specialty aggregate functions, the common aggregate functions implemented by all major servers include:

max() – Returns the maximum value within a set

min() – Returns the minimum value within a set

avg() – Returns the average value across a set

sum() – Returns the sum of the values across a set

count() – Returns the number of values in a set

Here’s a query that uses all of the common aggregate functions to analyze the data on film rental payments:

SELECT
    MAX(amount) max_amt,
    MIN(amount) min_amt,
    AVG(amount) avg_amt,
    SUM(amount) tot_amt,
    COUNT(*) num_payments
FROM payment;

The results from this query tell you that, across the 16,049 rows in the payment table, the maximum amount paid to rent a film was $11.99, the minimum amount was $0, the average payment was $4.20, and the total of all rental payments was $67,416.51. Hopefully, this gives you an appreciation for the role of these aggregate functions; the next subsections further clarify how you can utilize these functions.

Implicit Versus Explicit Groups –

In the previous example, every value returned by the query is generated by an aggregate function. Since there is no group by clause, there is a single, implicit group (all rows in the payment table). In most cases, however, you will want to retrieve additional columns along with columns generated by aggregate functions. What if, for example, you wanted to extend the previous query to execute the same five aggregate functions for each customer, instead of across all customers? For this query, you would want to retrieve the customer_id column along with the five aggregate functions, as in:

SELECT
    customer_id,
    MAX(amount) max_amt,
    MIN(amount) min_amt,
    AVG(amount) avg_amt,
    SUM(amount) tot_amt,
    COUNT(*) num_payments
FROM payment;

However, if you try to execute the above query either you will get an error or get very unexpected result as shown below.

While it may be obvious to you that you want the aggregate functions applied to each customer found in the payment table, this query fails because you have not explicitly specified how the data should be grouped. Therefore, you will need to add a group by clause to specify over which group of rows the aggregate functions should be applied:

SELECT
    customer_id,
    MAX(amount) max_amt,
    MIN(amount) min_amt,
    AVG(amount) avg_amt,
    SUM(amount) tot_amt,
    COUNT(*) num_payments
FROM payment
GROUP BY customer_id;

With the inclusion of the group by clause, the server knows to group together rows having the same value in the customer_id column first and then to apply the five aggregate functions to each of the 599 groups.

Counting Distinct Values –

When using the count() function to determine the number of members in each group, you have your choice of counting all members in the group or counting only the distinct values for a column across all members of the group.

For example, consider the following query, which uses the count() function with the customer_id column in two different ways:

SELECT 
    COUNT(customer_id) num_rows,
    COUNT(DISTINCT customer_id) num_customers
FROM payment;

The first column in the query simply counts the number of rows in the payment table, whereas the second column examines the values in the customer_id column and counts only the number of unique values. By specifying distinct, therefore, the count() function examines the values of a column for each member of the group in order to find and remove duplicates, rather than simply counting the number of values in the group.

Using Expressions –

Along with using columns as arguments to aggregate functions, you can use expressions as well. For example, you may want to find the maximum number of days between when a film was rented and subsequently returned. You can achieve this via the following query:

SELECT MAX(datediff(return_date, rental_date))
FROM rental;

The datediff function is used to compute the number of days between the return date and the rental date for every rental, and the max function returns the highest value, which in this case is 10 days. While this example uses a fairly simple expression, expressions used as arguments to aggregate functions can be as complex as needed, as long as they return a number, string, or date.

How Nulls Are Handled –

When performing aggregations, or, indeed, any type of numeric calculation, you should always consider how null values might affect the outcome of your calculation. To illustrate, I will build a simple table to hold numeric data and populate it with the set {1, 3, 5}:

CREATE TABLE number_tbl
(val SMALLINT);

INSERT INTO number_tbl
VALUES
(1),
(3),
(5);

Consider the following query, which performs five aggregate functions on the set of numbers:

SELECT 
    COUNT(*) num_rows,
    COUNT(val) num_vals,
    SUM(val) total,
    MAX(val) max_val,
    AVG(val) avg_val
FROM number_tbl;

The results are as you would expect: both count(*) and count(val) return the value 3, sum(val) returns the value 9, max(val) returns 5, and avg(val) returns 3. Next, I will add a null value to the number_tbl table and run the query again:

INSERT INTO number_tbl VALUES (NULL);

SELECT 
    COUNT(*) num_rows,
    COUNT(val) num_vals,
    SUM(val) total,
    MAX(val) max_val,
    AVG(val) avg_val
FROM number_tbl;

Even with the addition of the null value to the table, the sum(), max(), and avg() functions all return the same values, indicating that they ignore any null values encountered. The count() function now returns the value 4, which is valid since the number_tbl table contains four rows, while the count(val) function still returns the value 3. The difference is that count() counts the number of rows, whereas count(val) counts the number of values contained in the val column and ignores any null values encountered.

Generating Groups –

People are rarely interested in looking at raw data; instead, people engaging in data analysis will want to manipulate the raw data to better suit their needs. Examples of common data manipulations include:

Generating totals for a geographic region, such as total European sales

Finding outliers, such as the top salesperson for 2020

Determining frequencies, such as the number of films rented in each month

To answer these types of queries, you will need to ask the database server to group rows together by one or more columns or expressions. As you have seen already in several examples, the group by clause is the mechanism for grouping data within a query. In this section, you will see how to group data by one or more columns, how to group data using expressions, and how to generate rollups within groups.

Single-Column Grouping –

Single-column groups are the simplest and most-often-used type of grouping. If you want to find the number of films associated with each actor, for example, you need only group on the film_actor.actor_id column, as in:

SELECT 
    actor_id,
    COUNT(*)
FROM film_actor
GROUP BY actor_id;

This query generates 200 groups, one for each actor, and then sums the number of films for each member of the group.

Multiple column Grouping –

In some cases, you may want to generate groups that span more than one column. Expanding on the previous example, imagine that you want to find the total number of films for each film rating (G, PG, …) for each actor. The following example shows how you can accomplish this:

SELECT 
    fa.actor_id, 
    f.rating,
    COUNT(*)
FROM film_actor fa
INNER JOIN film f 
ON fa.film_id = f.film_id
GROUP BY fa.actor_id, f.rating
ORDER BY 1, 2;

This version of the query generates 996 groups, one for each combination of actor and film rating found by joining the film_actor table with the film table. Along with adding the rating column to the select clause, I also added it to the group by clause, since rating is retrieved from a table and is not generated via an aggregate function such as max or count.

Grouping via Expressions –

Along with using columns to group data, you can build groups based on the values generated by expressions. Consider the following query, which groups rentals by year:

SELECT 
    extract(YEAR FROM rental_date) year,
    COUNT(*) how_many
FROM rental
GROUP BY extract(YEAR FROM rental_date);

This query employs a fairly simple expression that uses the extract() function to return only the year portion of a date to group the rows in the rental table.

Rating: 1 out of 5.

Leave a Reply