SQL is the most popular language for turning raw data stored in a database into actionable insights. Intermediate SQL course – DatacampUsing a database of films made around the world, this course covers:
- How to filter and compare data
- How to use aggregate functions to summarize data
- How to sort and group your data
- How to present your data cleanly using tools such as rounding and aliasing
Accompanied at every step with hands-on practice queries, this course teaches you everything you need to know to analyze data using your own SQL code today!
1. Selecting Data
In this first chapter, you’ll learn how to query a films database and select the data needed to answer questions about the movies and actors. You’ll also understand how SQL code is executed and formatted.
1.1. Querying a database
Learning to COUNT()
You saw how to use COUNT() in the video. Do you remember what it returns?
Here is a query counting film_id. Select the answer below that correctly describes what the query will return.
SELECT COUNT(film_id) AS count_film_id
Run the query in the console to test your theory!
- e number of unique films in the reviews table.
- The number of records containing a film_id.
- The total number of records in the reviews table.
- The sum of the film_id field.
Correct! COUNT(field_name) returns the number of records containing a value in a field. In this example, that field is film_id.
Practice with COUNT()
As you’ve seen, COUNT(*) tells you how many records are in a table. However, if you want to count the number of non-missing values in a particular field, you can call COUNT() on just that field.
Let’s get some practice with COUNT()! You can look at the data in the tables throughout these exercises by clicking on the table name in the console.
- Count the number of records in the people table, aliasing the result as count_records.
-- Count the number of records in the people table SELECT COUNT(people) AS count_records FROM people;
Count the number of records with a birthdate in the people table, aliasing the result as count_birthdate.
-- Count the number of birthdates in the people table SELECT COUNT(birthdate) AS count_birthdate FROM people;
Count the languages and countries in the films table; alias as count_languages and count_countries.
-- Count the languages and countries represented in the films table SELECT COUNT(language) AS count_languages, COUNT(country) AS count_countries FROM films;
Tres Bien! Looking at the differences between the count of unique values, total values, and all records can provide useful insights into your data.
Often query results will include many duplicate values. You can use the DISTINCT keyword to select the unique values from a field.
This might be useful if, for example, you’re interested in knowing which languages are represented in the films table. See if you can find out what countries are represented in this table with the following exercises.Return the unique countries represented in the films table using DISTINCT.
-- Return the unique countries from the films table SELECT DISTINCT country FROM films;
Return the number of unique countries represented in the films table, aliased as count_distinct_countries.
-- Count the distinct countries from the films table SELECT COUNT(DISTINCT country) AS count_distinct_countries FROM films;
Congratulations! Using DISTINCT is a great tool to see the unique values of a dataset. This table has 64 unique countries.
1.2. Query execution
Order of execution
SQL code is processed differently than other programming languages in that you need to let the processor know where to pull the data from before making selections.
It’s essential to know your code’s order of execution to understand what results you’ll get from your query and how to fix any errors that may come up.
- Drag the SQL keywords into the order that they will be executed in from first (top) to last (bottom).
Congratulations! This is the correct order of execution. It makes sense that SQL needs to SELECT data FROM a table before it can LIMIT the results.
Debugging is an essential skill for all coders, and it comes from making many mistakes and learning from them.
In this exercise, you’ll be given some buggy code that you’ll need to fix.
- Debug and fix the SQL query provided.
-- Debug this code SELECT certification FROM films LIMIT 5;
Find the two errors in this code; the same error has been repeated twice.
-- Debug this code SELECT film_id, imdb_score, num_votes FROM reviews;
Find the two bugs in this final query.
-- Debug this code SELECT COUNT(birthdate) AS count_birthdays FROM people;
Excellent extermination of those bugs! This is an important skill that will come in very handy.
1.3. SQL style
SQL best practices
SQL style guides outline standard best practices for writing code.
This exercise will present several SQL style tips. Your job will be to decide whether they are considered best practices.
We’ll be following Holywell’s style guide.
- Drag and drop the items into the correct zone.
Well done! You’ll soon become everyone’s favorite SQL programmer.
Readable code is highly valued in the coding community and professional settings. Without proper formatting, code and results can be difficult to interpret. You’ll often be working with other people that need to understand your code or be able to explain your results, so having a solid formatting habit is essential.
In this exercise, you’ll correct poorly written code to better adhere to SQL style standards.
- Adjust the sample code so that it is in line with standard practices.
-- Rewrite this query select person_id, role from roles limit 10
-- Rewrite this query SELECT person_id, role FROM roles LIMIT 10;
Great work formatting the code! Clean code allows for clean communication.
You may occasionally receive a dataset with poorly named fields. Ideally, you would fix these, but you can work around it with some added punctuation in this instance.
A sample query and schema have been provided; imagine you need to be able to run it with a non-standard field name. Select the multiple-choice answer that would correctly fill in the blank to return both a film’s id and its number of Facebook likes for all reviews:
SELECT film_id, ___
- facebook likes
- “facebook likes”
- facebook, likes
Correct! Using double quotes around a non-standard name allows us to run the SQL query.
2. Filtering Records
Learn about how you can filter numerical and textual data with SQL. Filtering is an important use for this language. You’ll learn how to use new keywords and operators to help you narrow down your query to get results that meet your desired criteria and gain a better understanding of NULL values and how to handle them.
2.1. Filtering numbers
The WHERE clause allows you to filter based on text and numeric values in a table using comparison operators.
What does the following query return?
WHERE release_year > 2000;
- Films released before the year 2000
- Films released after the year 2000
- Films released after the year 2001
- Films released in 2000
Correct! Our query will return films that were released after the year 2000.
Using WHERE with numbers
Filtering with WHERE allows you to analyze your data better. You may have a dataset that includes a range of different movies, and you need to do a case study on the most notable films with the biggest budgets. In this case, you’ll want to filter your data to a specific budget range.
Now it’s your turn to use the WHERE clause to filter numeric values!
- Select the film_id and imdb_score from the reviews table and filter on scores higher than 7.0.
-- Select film_ids and imdb_score with an imdb_score over 7.0 SELECT film_id, imdb_score FROM reviews WHERE imdb_score > 7.0;
Select the film_id and facebook_likes of the first ten records with less than 1000 likes from the reviews table.
-- Select film_ids and facebook_likes for ten records with less than 1000 likes SELECT film_id, facebook_likes FROM reviews WHERE facebook_likes < 1000 LIMIT 10;
Count how many records have a num_votes of at least 100,000; use the alias films_over_100K_votes.
-- Count the records with at least 100,000 votes SELECT COUNT(num_votes) AS films_over_100K_votes FROM reviews WHERE num_votes >= 100000;
Well done! Applying a WHERE filter with SQL is much easier and faster than scrolling through a spreadsheet or using a highlighter!
Using WHERE with text
WHERE can also filter string values.
Imagine you are part of an organization that gives cinematography awards, and you have several international categories. Before you confirm an award for every language listed in your dataset, it may be worth seeing if there are enough films of a specific language to make it a fair competition. If there is only one movie or a significant skew, it may be worth considering a different way of giving international awards.
Let’s try this out!
- Select and count the language field using the alias count_spanish.
- Apply a filter to select only Spanish from the language field.
-- Count the Spanish-language films SELECT COUNT(language) AS count_spanish FROM films WHERE language = 'Spanish';
Bien hecho! Well done! There are 40 Spanish-language films in this table.
2.2. Multiple criteria
The following exercises combine AND and OR with the WHERE clause. Using these operators together strengthens your queries and analyses of data.
You will apply these new skills now on the films database.
- Select the title and release_year for all German-language films released before 2000.
-- Select the title and release_year for all German-language films released before 2000 SELECT title, release_year FROM films WHERE language = 'German' AND release_year < 2000;
Update the query from the previous step to show German-language films released after 2000 rather than before.
-- Update the query to see all German-language films released after 2000 SELECT title, release_year FROM films WHERE release_year > 2000 AND language = 'German';
Select all details for German-language films released after 2000 but before 2010 using only WHERE and AND.
-- Select all records for German-language films released after 2000 and before 2010 SELECT * FROM films WHERE (release_year > 2000 AND release_year < 2010) AND language ='German';
Great work! Combining conditions with AND will prove to be very useful when we want our query to return a specific subset of records.
This time you’ll write a query to get the title and release_year of films released in 1990 or 1999, which were in English or Spanish and took in more than $2,000,000 gross.
It looks like a lot, but you can build the query up one step at a time to get comfortable with the underlying concept in each step. Let’s go!
- Select the title and release_year for films released in 1990 or 1999 using only WHERE and OR.
-- Find the title and year of films from the 1990 or 1999 SELECT title, release_year FROM films WHERE release_year = 1990 OR release_year = 1999;
Filter the records to only include English or Spanish-language films.
SELECT title, release_year FROM films WHERE (release_year = 1990 OR release_year = 1999) -- Add a filter to see only English or Spanish-language films AND((language='English' OR language='Spanish'));
Finally, restrict the query to only return films worth more than $2,000,000 gross.
SELECT title, release_year FROM films WHERE (release_year = 1990 OR release_year = 1999) AND (language = 'English' OR language = 'Spanish') -- Filter films with more than $2,000,000 gross AND gross > 2000000;
Amazing work! Some of these queries were pretty complex. Any Die Hard fans out there?
Let’s use BETWEEN with AND on the films database to get the title and release_year of all Spanish-language films released between 1990 and 2000 (inclusive) with budgets over $100 million.
We have broken the problem into smaller steps so that you can build the query as you go along!Select the title and release_year of all films released between 1990 and 2000 (inclusive) using BETWEEN.
-- Select the title and release_year for films released between 1990 and 2000 SELECT title, release_year FROM films WHERE release_year BETWEEN 1990 AND 2000;
Build on your previous query to select only films with a budget over $100 million.
SELECT title, release_year FROM films WHERE release_year BETWEEN 1990 AND 2000 -- Narrow down your query to films with budgets > $100 million AND budget > 100000000;
Now, restrict the query to only return Spanish-language films.
SELECT title, release_year FROM films WHERE release_year BETWEEN 1990 AND 2000 AND budget > 100000000 -- Restrict the query to only Spanish-language films AND(language='Spanish');
Finally, amend the query to include all Spanish-language or French-language films with the same criteria.
SELECT title, release_year FROM films WHERE release_year BETWEEN 1990 AND 2000 AND budget > 100000000 -- Amend the query to include Spanish or French-language films AND(language ='Spanish' OR language='French');
Superb! Using WHERE with a combination of AND, OR, and BETWEEN is an efficient way to query a desired range of values.
2.3. Filtering text
LIKE and NOT LIKE
The LIKE and NOT LIKE operators can be used to find records that either match or do not match a specified pattern, respectively. They can be coupled with the wildcards % and _. The % will match zero or many characters, and _ will match a single character.
This is useful when you want to filter text, but not to an exact word.
Do the following exercises to gain some practice with these keywords.
- Select the names of all people whose names begin with ‘B’.
-- Select the names that start with B SELECT name FROM people WHERE name LIKE 'B%';
Select the names of people whose names have ‘r’ as the second letter.
SELECT name FROM people -- Select the names that have r as the second letter WHERE name LIKE '_r%';
Select the names of people whose names don’t start with ‘A’.
SELECT name FROM people -- Select names that don't start with A WHERE name NOT LIKE 'A%';
I LIKE to see the progress we’re making! Filtering your data to find specified patterns is vital to your skillset. Our results still had names that started with Á with an accent, showing that we need to be specific with our filtering criteria.
You now know you can query multiple conditions using the IN operator and a set of parentheses. It is a valuable piece of code that helps us keep our queries clean and concise.
Try using the IN operator yourself!
- Select the title and release_year of all films released in 1990 or 2000 that were longer than two hours.
-- Find the title and release_year for all films over two hours in length released in 1990 and 2000 SELECT title, release_year FROM films WHERE release_year IN (1990, 2000) AND duration > 120;
Select the title and language of all films in English, Spanish, or French using IN.
-- Find the title and language of all films in English, Spanish, and French SELECT title, language FROM films WHERE language IN ('English','Spanish','French');
Select the title, certification and language of all films certified NC-17 or R that are in English, Italian, or Greek.
-- Find the title, certification, and language all films certified NC-17 or R that are in English, Italian, or Greek SELECT title, certification, language FROM films WHERE certification IN ('NC-17','R') AND language IN ('English','Italian','Greek');
Your SQL vocabulary is growing by the minute! Interestingly, A Fistful of Dollars starring Clint Eastwood is listed as Italian.
Combining filtering and selecting
Time for a little challenge. So far, your SQL vocabulary from this course includes COUNT(), DISTINCT, LIMIT, WHERE, OR, AND, BETWEEN, LIKE, NOT LIKE, and IN. In this exercise, you will try to use some of these together. Writing more complex queries will be standard for you as you become a qualified SQL programmer.
As this query will be a little more complicated than what you’ve seen so far, we’ve included a bit of code to get you started. You will be using DISTINCT here too because, surprise, there are two movies named ‘Hamlet’ in this dataset!
Follow the instructions to find out what 90’s films we have in our dataset that would be suitable for English-speaking teens.
- Count the unique titles from the films database and use the alias provided.
- Filter to include only movies with a release_year from 1990 to 1999, inclusive.
- Add another filter narrowing your query down to English-language films.
- Add a final filter to select only films with ‘G’, ‘PG’, ‘PG-13’ certifications.
-- Count the unique titles SELECT COUNT(DISTINCT(title)) AS nineties_english_films_for_teens FROM films -- Filter to release_years to between 1990 and 1999 WHERE release_year BETWEEN 1990 AND 1999 -- Filter to English-language films AND language ='English' -- Narrow it down to G, PG, and PG-13 certifications AND certification IN ( 'G', 'PG', 'PG-13');
You’ve got a natural flair for filtering! Nice work, this filter tells us we have 310 films that the 90’s obsessed teenagers can enjoy.
2.4. NULL values
What does NULL mean?
I hope you were paying attention in the video. Pop quiz: What does NULL represent?
- A corrupt entry
- A missing value
- An empty string
- An invalid value
Correct! NULL is used to represent unknown values.
Practice with NULLs
Well done. Now that you know what NULL means and what it’s used for, it’s time for some more practice!
Let’s explore the films table again to better understand what data you have.
- Select the title of every film that doesn’t have a budget associated with it and use the alias no_budget_info.
-- List all film titles with missing budgets SELECT title AS no_budget_info FROM films WHERE budget IS NULL;
Count the number of films with a language associated with them and use the alias count_language_known.
-- Count the number of films we have language data for SELECT COUNT(language) AS count_language_known FROM films
Alright! That’s 4957 films with language data. We’ve mastered selecting and filtering data which means you’re halfway through the course!
3. Aggregate Functions
SQL allows you to zoom in and out to better understand an entire dataset, its subsets, and its individual records. You’ll learn to summarize data using aggregate functions and perform basic arithmetic calculations inside queries to gain insights into what makes a successful film.
3.1. Summarizing data
Aggregate functions and data types
Aggregate functions are another valuable tool for the SQL programmer. They are used extensively across businesses to calculate important metrics, such as the average cost of making a film.
You know five different aggregate functions:
Test your knowledge of what data types they are compatible with.
- Classify the function based on what data type it is compatible with.
Congratulations! You’ve got aggregate functions under control.
Practice with aggregate functions
Now let’s try extracting summary information from a table using these new aggregate functions. Summarizing is helpful in real life when extracting top-line details from your dataset. Perhaps you’d like to know how old the oldest film in the films table is, what the most expensive film is, or how many films you have listed.
Now it’s your turn to get more insights about the films table!
- Use the SUM() function to calculate the total duration of all films and alias with total_duration.
-- Query the sum of film durations SELECT SUM(duration) AS total_duration FROM films;
Calculate the average duration of all films and alias with average_duration.
-- Calculate the average duration of all films SELECT AVG(duration) AS average_duration FROM films;
Find the most recent release_year in the films table, aliasing as latest_year
-- Find the latest release_year SELECT MAX(release_year) AS latest_year FROM films;
Find the duration of the shortest film and use the alias shortest_film.
-- Find the duration of the shortest film SELECT MIN(duration) AS shortest_film FROM films;
Well done! You’ll find yourself using aggregate functions over and over again to get a quick grasp of the data in a SQL database.
3.2. Summarizing subsets
Combining aggregate functions with WHERE
When combining aggregate functions with WHERE, you get a powerful tool that allows you to get more granular with your insights, for example, to get the total budget of movies made from the year 2010 onwards.
This combination is useful when you only want to summarize a subset of your data. In your film-industry role, as an example, you may like to summarize each certification category to compare how they each perform or if one certification has a higher average budget than another.
Let’s see what insights you can gain about the financials in the dataset.
- Use SUM() to calculate the total gross for all films made in the year 2000 or later, and use the alias total_gross.
-- Calculate the sum of gross from the year 2000 or later SELECT SUM(gross) AS total_gross FROM films WHERE release_year >= 2000;
Calculate the average amount grossed by all films whose titles start with the letter ‘A’ and alias with avg_gross_A.
-- Calculate the average gross of films that start with A SELECT AVG(gross) AS avg_gross_A FROM films WHERE title LIKE 'A%';
Calculate the lowest gross film in 1994 and use the alias lowest_gross.
-- Calculate the lowest gross film in 1994 SELECT MIN(gross) AS lowest_gross FROM films WHERE release_year = 1994;
Calculate the highest gross film between 2000 and 2012, inclusive, and use the alias highest_gross.
-- Calculate the highest gross film released between 2000-2012 SELECT MAX(gross) AS highest_gross FROM films WHERE release_year BETWEEN 2000 AND 2012;
Nice. SQL provides us with several building blocks that we can combine in all kinds of ways, hence the name: Structured Query Language.
Aggregate functions work great with numerical values; however, these results can sometimes get unwieldy when dealing with long decimal values. Luckily, SQL provides you with the ROUND() function to tame these long decimals.
If asked to give the average budget of your films, ten decimal places is not necessary. Instead, you can round to two decimal places to create results that make more sense for currency.
Now you try!
- Calculate the average facebook_likes to one decimal place and assign to the alias, avg_facebook_likes.
-- Round the average number of facebook_likes to one decimal place SELECT ROUND(AVG(facebook_likes),1) AS avg_facebook_likes FROM reviews
Well done! The average, rounded facebook_likes is 7802.9. This insight can be used as a benchmark to measure film reviews; any film with over 7802.9 likes can be considered popular.
ROUND() with a negative parameter
A useful thing you can do with ROUND() is have a negative number as the decimal place parameter. This can come in handy if your manager only needs to know the average number of facebook_likes to the hundreds since granularity below one hundred likes won’t impact decision making.
Social media plays a significant role in determining success. If a movie trailer is posted and barely gets any likes, the movie itself may not be successful. Remember how 2020’s “Sonic the Hedgehog” movie got a revamp after the public saw the trailer?
Let’s apply this to other parts of the dataset and see what the benchmark is for movie budgets so, in the future, it’s clear whether the film is above or below budget.
- Calculate the average budget from the films table, aliased as avg_budget_thousands, and round to the nearest thousand.
-- Calculate the average budget rounded to the thousands SELECT ROUND(AVG(budget),-3) AS avg_budget_thousands FROM films
ROUND() of applause! The ROUND() function is very handy when making financial calculations to get a top-level view or specify to the penny or cent.
3.3. Aliasing and arithmetic
SQL arithmetic comes in handy when your table is missing a metric you want to review. Suppose you have some data on movie ticket sales, but the table only has fields for ticket price and discount. In that case, you could combine these by subtracting the discount from the ticket price to get the amount the film-goer paid.
You have seen that SQL can act strangely when dividing integers. What is the result if you divide a discount of two dollars by the paid_price of ten dollars to get the discount percentage?
Correct! SQL thinks we want the answer to be an integer since we are dividing two integers. 0 is the closest integer to 0.2.
Aliasing with functions
Aliasing can be a lifesaver, especially as we start to do more complex SQL queries with multiple criteria. Aliases help you keep your code clean and readable. For example, if you want to find the MAX() value of several fields without aliasing, you’ll end up with the result with several columns called max and no idea which is which. You can fix this with aliasing.
Now, it’s over to you to clean up the following queries.
- Select the title and duration in hours for all films and alias as duration_hours; since the current durations are in minutes, you’ll need to divide duration by 60.0.
-- Calculate the title and duration_hours from films SELECT title, (duration /60.0) AS duration_hours FROM films;
Calculate the percentage of people who are no longer alive and alias the result as percentage_dead.
-- Calculate the percentage of people who are no longer alive SELECT COUNT(deathdate) * 100.0 / COUNT(name) AS percentage_dead FROM people;
Find how many decades the films table covers by using MIN() and MAX() and alias as number_of_decades.
-- Find the number of decades in the films table SELECT (MAX(release_year)-MIN(release_year)) / 10.0 AS number_of_decades FROM films;
Amazing work mastering arithmetic, aggregate functions, and aliasing! Now you know that our films table covers films released over ten decades, or a hundred years!
You found some valuable insights in the previous exercise, but many of the results were inconveniently long. We forgot to round! We won’t make you redo them all; however, you’ll update the worst offender in this exercise.
- Update the query by adding ROUND() around the calculation and round to two decimal places.
-- Round duration_hours to two decimal places SELECT title, ROUND(duration / 60.0,2) AS duration_hours FROM films;
That’s better! Now you can clearly see how long a movie is.
4. Sorting and Grouping
This final chapter teaches you how to sort and group data. These skills will take your analyses to a new level by helping you uncover critical business insights and identify trends and performance. You’ll get hands-on experience to determine which films performed the best and how movie durations and budgets changed over time.
4.1. Sorting results
SQL provides you with the ORDER BY keyword to sort one or more fields from your data. It can do this multi-directionally and helps make results easy to interpret.
How does ORDER BY sort a column of text values by default?
- Alphabetically (A-Z)
- Reverse alphabetically (Z-A)
- There’s no natural ordering to text data
- By number of characters (fewest to most)
Correct! Because ascending is the default, indicating ASC is helpful for clarity but not required!.
Sorting single fields
Now that you understand how ORDER BY works, you’ll put it into practice. In this exercise, you’ll work on sorting single fields only. This can be helpful to extract quick insights such as the top-grossing or top-scoring film.
The following exercises will help you gain further insights into the film database.Select the name of each person in the people table, sorted alphabetically.
-- Select name from people and sort alphabetically SELECT name FROM people ORDER BY name;
Select the title and duration for every film, from longest duration to shortest.
-- Select the title and duration from longest to shortest film SELECT title, duration FROM films ORDER BY duration DESC
Superb sorting! ORDER BY is another simple yet effective way to gain intelligence about your business and data. You now know how to extract your best and worst-performing assets with only a few lines of code
Sorting multiple fields
ORDER BY can also be used to sort on multiple fields. It will sort by the first field specified, then sort by the next, and so on. As an example, you may want to sort the people data by age and keep the names in alphabetical order.
Try using ORDER BY to sort multiple columns.
- Select the release_year, duration, and title of films ordered by their release year and duration, in that order.
-- Select the release year, duration, and title sorted by release year and duration SELECT release_year, duration, title FROM films ORDER BY release_year, duration;
Select the certification, release_year, and title from films ordered first by certification (alphabetically) and second by release year, starting with the most recent year.
-- Select the certification, release year, and title sorted by certification and release year SELECT certification, release_year, title FROM films ORDER BY certification ASC, release_year DESC;
Nicely done! The second column you order on only steps in when the first column has been ordered.
4.2. Grouping data
GROUP BY single fields
GROUP BY is a SQL keyword that allows you to group and summarize results with the additional use of aggregate functions. For example, films can be grouped by the certification and language before counting the film titles in each group. This allows you to see how many films had a particular certification and language grouping.
In the following steps, you’ll summarize other groups of films to learn more about the films in your database.
- Select the release_year and count of films released in each year aliased as film_count.
-- Find the release_year and film_count of each year SELECT release_year, COUNT(*) AS film_count FROM films GROUP BY release_year;
Select the release_year and average duration aliased as avg_duration of all films, grouped by release_year.
-- Find the release_year and average duration of films for each year SELECT release_year, AVG(duration) AS avg_duration FROM films GROUP BY release_year;
Great job grouping! Using GROUP BY with a time or date field such as release_year can help us identify trends such as a period of time where movies were really short!
GROUP BY multiple fields
GROUP BY becomes more powerful when used across multiple fields or combined with ORDER BY and LIMIT.
Perhaps you’re interested in learning about budget changes throughout the years in individual countries. You’ll use grouping in this exercise to look at the maximum budget for each country in each year there is data available.
- Select the release_year, country, and the maximum budget aliased as max_budget for each year and each country; sort your results by release_year and country.
-- Find the release_year, country, and max_budget, then group and order by release_year and country SELECT release_year, country, MAX(budget) AS max_budget FROM films GROUP BY release_year, country ORDER BY release_year, country;
Well done! You can see how building on your SQL queries helps you gain more insights and detect trends in the data, such as how film budgets may change throughout the years.
Answering business questions
In the real world, every SQL query starts with a business question. Then it is up to you to decide how to write the query that answers the question. Let’s try this out.
Which release_year had the most language diversity?
Take your time to translate this question into code. We’ll get you started then it’s up to you to test your queries in the console.
“Most language diversity” can be interpreted as COUNT(DISTINCT ___). Now over to you.
Well done! The year 2006 had 16 distinct languages, that’s more than any other year.
4.3. Filtering grouped data
Filter with HAVING
Your final keyword is HAVING. It works similarly to WHERE in that it is a filtering clause, with the difference that HAVING filters grouped data.
Filtering grouped data can be especially handy when working with a large dataset. When working with thousands or even millions of rows, HAVING will allow you to filter for just the group of data you want, such as films over two hours in length!
Practice using HAVING to find out which countries (or country) have the most varied film certifications.
- Select country from the films table, and get the distinct count of certification aliased as certification_count.
- Group the results by country.
- Filter the unique count of certifications to only results greater than 10.
-- Select the country and distinct count of certification as certification_count SELECT country,COUNT(DISTINCT certification) AS certification_count FROM films -- Group by country GROUP BY country -- Filter results to countries with more than 10 different certifications HAVING COUNT(DISTINCT certification) > 10;
Great job! The answer is USA with 12 different certifications.
HAVING and sorting
Filtering and sorting go hand in hand and gives you greater interpretability by ordering our results.
Let’s see this magic at work by writing a query showing what countries have the highest average film budgets.
- Select the country and the average budget as average_budget, rounded to two decimal, from films.
- Group the results by country.
- Filter the results to countries with an average budget of more than one billion (1000000000).
- Sort by descending order of the average_budget.
-- Select the country and average_budget from films SELECT country, ROUND(AVG(budget),2) AS average_budget FROM films -- Group by country GROUP BY country -- Filter to countries with an average_budget of more than one billion HAVING ROUND(AVG(budget),2) > 1000000000 -- Order by descending order of the aggregated budget ORDER BY average_budget DESC;
ou did it! South Korea and Hungary seem to have pricey films… or do they? Actually, these budgets are pretty standard for their local currency.
All together now
It’s time to use much of what you’ve learned in one query! This is good preparation for using SQL in the real world where you’ll often be asked to write more complex queries since some of the basic queries can be answered by playing around in spreadsheet applications.
In this exercise, you’ll write a query that returns the average budget and gross earnings for films each year after 1990 if the average budget is greater than 60 million.
This will be a big query, but you can handle it!
- Select the release_year for each film in the films table, filter for records released after 1990, and group by release_year.
-- Select the budget for films released after 1990 grouped by year SELECT release_year FROM films WHERE release_year >1990 GROUP BY release_year;
Modify the query to include the average budget aliased as avg_budget and average gross aliased as avg_gross for the results we have so far.
-- Modify the query to also list the average budget and average gross SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross FROM films WHERE release_year > 1990 GROUP BY release_year;
Modify the query once more so that only years with an average budget of greater than 60 million are included.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross FROM films WHERE release_year > 1990 GROUP BY release_year -- Modify the query to see only years with an avg_budget of more than 60 million HAVING AVG(budget) > 60000000;
Finally, order the results from the highest average gross and limit to one.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross) AS avg_gross FROM films WHERE release_year > 1990 GROUP BY release_year HAVING AVG(budget) > 60000000 -- Order the results from highest to lowest average gross and limit to one ORDER BY avg_gross DESC LIMIT 1;
Superb work! SQL queries can get rather long, but breaking them down into individual clauses makes them easier to write.