Exploratory Data Analysis in SQL – Datacamp. Learn how to explore what’s available in a database: the tables, relationships between them, and data stored in them.
You have access to a database. Now what do you do? Building on your existing skills joining tables, using basic functions, grouping data, and using subqueries, the next step in your SQL journey is learning how to explore a database and the data in it. Using data from Stack Overflow, Fortune 500 companies, and 311 help requests from Evanston, IL, you’ll get familiar with numeric, character, and date/time data types. You’ll use functions to aggregate, summarize, and analyze data without leaving the database. Errors and inconsistencies in the data won’t stop you! You’ll learn common problems to look for and strategies to clean up messy data. By the end of this course, you’ll be ready to start exploring your own PostgreSQL databases and analyzing the data in them.
Course Url: Exploratory Data Analysis in SQL Course | DataCamp
1. What’s in the database?
Start exploring a database by identifying the tables and the foreign keys that link them. Look for missing values, count the number of observations, and join tables to understand how they’re related. Learn about coalescing and casting data along the way.
1.1. What’s in the database?

Explore table sizes
Let’s start by exploring five related tables:
- stackoverflow: questions asked on Stack Overflow with certain tags
- company: information on companies related to tags in stackoverflow
- tag_company: links stackoverflow to company
- tag_type: type categories applied to tags in stackoverflow
- fortune500: information on top US companies
Count the number of rows in a table with
SELECT count(*)
FROM tablename;
Count the number of columns in a table by selecting a few rows and manually counting the columns in the result.
Which table has the most rows? Which table has the most columns?
- stackoverflow has the most rows; company has the most columns
- tag_company has the most rows; company has the most columns
- stackoverflow has the most rows; fortune500 has the most columns
- tag_type has the most rows; fortune500 has the most columns
Great start! Knowing how much data you have is a first step in exploratory data analysis.
Count missing values
Which column of fortune500 has the most missing values? To find out, you’ll need to check each column individually, although here we’ll check just three.
Course Note: While you’re unlikely to encounter this issue during this exercise, note that if you run a query that takes more than a few seconds to execute, your session may expire, or you may be disconnected from the server. You will not have this issue with any of the exercise solutions, so if your session expires or disconnects, there’s an error with your query. First, figure out how many rows are in fortune500 by counting them.
-- Select the count of the number of rows
SELECT COUNT(*)
FROM fortune500;
Output
count |
500 |
Subtract the count of the non-NULL ticker values from the total number of rows, alias the difference as missing.
-- Select the count of ticker,
-- subtract from the total number of rows,
-- and alias as missing
SELECT count(*) - count(ticker) AS missing
FROM fortune500;
OUTPUT
missing |
32 |
Repeat for the profits_change column.
-- Select the count of profits_change,
-- subtract from total number of rows, and alias as missing
SELECT count(*) - count(profits_change) AS missing
FROM fortune500;
OUTPUT
missing |
63 |
Repeat for the industry column.
-- Select the count of industry,
-- subtract from total number of rows, and alias as missing
SELECT count(*) - count(industry) AS missing
FROM fortune500;
OUTPUT
missing |
13 |
Good work! Note that the result of a call to count(*) is labelled count in the result. You can refer to this column with count in ORDER BY clauses without explicitly aliasing count(*) AS count.
Join tables
Part of exploring a database is figuring out how tables relate to each other. The company and fortune500 tables don’t have a formal relationship between them in the database, but this doesn’t prevent you from joining them.
To join the tables, you need to find a column that they have in common where the values are consistent across the tables. Remember: just because two tables have a column with the same name, it doesn’t mean those columns necessarily contain compatible data. If you find more than one pair of columns with similar data, you may need to try joining with each in turn to see if you get the same number of results.
Reference the entity relationship diagram if needed.
- Look at the contents of the company and fortune500 tables. Find a column that they have in common where the values for each company are the same in both tables.
- Join the company and fortune500 tables with an INNER JOIN.
Select only company.name for companies that appear in both tables.
SELECT company.name
-- Table(s) to select from
FROM company
INNER JOIN fortune500
ON company.ticker =fortune500.ticker;
OUTPUT
name |
Apple Incorporated |
Amazon.com Inc |
Alphabet |
You got it! You can join tables when they share a column with consistent data values.
1.2. The keys to the database
Foreign keys
Recall that foreign keys reference another row in the database via a unique ID. Values in a foreign key column are restricted to values in the referenced column OR NULL.
Using what you know about foreign keys, why can’t the tag column in the tag_type table be a foreign key that references the tag column in the stackoverflow table?
Remember, you can reference the slides using the icon in the upper right of the screen to review the requirements for a foreign key.
- stackoverflow.tag is not a primary key
- tag_type.tag contains NULL values
- stackoverflow.tag contains duplicate values
- tag_type.tag does not contain all the values in stackoverflow.tag
Brillant! Foreign keys must reference a column with unique values for each row so the referenced row can be identified.
Read an entity relationship diagram
The information you need is sometimes split across multiple tables in the database.
What is the most common stackoverflow tag_type? What companies have a tag of that type?
To generate a list of such companies, you’ll need to join three tables together.
Reference the entity relationship diagram as needed when determining which columns to use when joining tables.
- First, using the tag_type table, count the number of tags with each type.
- Order the results to find the most common tag type.
-- Count the number of tags with each type
SELECT type, Count(*) AS count
FROM tag_type
-- To get the count for each type, what do you need to do?
GROUP BY type
-- Order the results with the most common
-- tag types listed first
ORDER BY type DESC;
OUTPUT
type | count |
storage | 2 |
spreadsheet | 2 |
payment | 5 |
- Join the tag_company, company, and tag_type tables, keeping only mutually occurring records.
- Select company.name, tag_type.tag, and tag_type.type for tags with the most common type from the previous step.
-- Select the 3 columns desired
SELECT company.name, tag_type.tag, tag_type.type
FROM company
-- Join to the tag_company table
INNER JOIN tag_company
ON company.Id = tag_company.company_id
-- Join to the tag_type table
INNER JOIN tag_type
ON tag_company.tag = tag_type.tag
-- Filter to most common type
WHERE type='cloud';
OUTPUT
name | tag | type |
Amazon Web Services | amazon-cloudformation | cloud |
Amazon Web Services | amazon-cloudfront | cloud |
Amazon Web Services | amazon-cloudsearch | cloud |
Superb! You could combine these steps in a single query by using a subquery in the WHERE clause instead of the value ‘cloud’.
Coalesce
The coalesce() function can be useful for specifying a default or backup value when a column contains NULL values.
coalesce() checks arguments in order and returns the first non-NULL value, if one exists.
- coalesce(NULL, 1, 2) = 1
- coalesce(NULL, NULL) = NULL
- coalesce(2, 3, NULL) = 2
In the fortune500 data, industry contains some missing values. Use coalesce() to use the value of sector as the industry when industry is NULL. Then find the most common industry.
- Use coalesce() to select the first non-NULL value from industry, sector, or ‘Unknown’ as a fallback value.
- Alias the result of the call to coalesce() as industry2.
- Count the number of rows with each industry2 value.
- Find the most common value of industry2.
-- Use coalesce
SELECT coalesce(industry, sector, 'Unknown') AS industry2,
-- Don't forget to count!
count(*)
FROM fortune500
-- Group by what? (What are you counting by?)
GROUP BY industry2
-- Order results to see most common first
ORDER BY count DESC
-- Limit results to get just the one value you want
LIMIT 1;
OUTPUT
industry2 | count |
Utilities: Gas and Electric | 22 |
Terrific! coalesce is essential when the value you need could be in more than one column. In the next exercise, you’ll use coalesce as part of a self join.
Coalesce with a self-join
You previously joined the company and fortune500 tables to find out which companies are in both tables. Now, also include companies from company that are subsidiaries of Fortune 500 companies as well.
To include subsidiaries, you will need to join company to itself to associate a subsidiary with its parent company’s information. To do this self-join, use two different aliases for company.
coalesce will help you combine the two ticker columns in the result of the self-join to join to fortune500.
- Join company to itself to add information about a company’s parent to the original company’s information.
- Use coalesce to get the parent company ticker if available and the original company ticker otherwise.
- INNER JOIN to fortune500 using the ticker.
- Select original company name, fortune500 title and rank.
SELECT company_original.name, title, rank
-- Start with original company information
FROM company AS company_original
-- Join to another copy of company with parent
-- company information
LEFT JOIN company AS company_parent
ON company_original.parent_id = company_parent.id
-- Join to fortune500, only keep rows that match
INNER JOIN fortune500
-- Use parent ticker if there is one,
-- otherwise original ticker
ON coalesce(company_original.ticker,
company_parent.ticker) =
fortune500.ticker
-- For clarity, order by rank
ORDER BY rank;
output
name | title | rank |
Apple Incorporated | Apple | 3 |
Amazon.com Inc | Amazon.com | 12 |
Amazon Web Services | Amazon.com | 12 |
Awesome! Self-joins can get confusing. Use meaningful aliases to help keep everything straight.
1.3. Column types and constraints
Effects of casting
When you cast data from one type to another, information can be lost or changed. See how the casting changes values and practice casting data using the CAST() function and the :: syntax.
SELECT CAST(value AS new_type);
SELECT value::new_type;
- Select profits_change and profits_change cast as integer from fortune500.
- Look at how the values were converted.
-- Select the original value
SELECT profits_change,
-- Cast profits_change
CAST(profits_change AS integer) AS profits_change_int
FROM fortune500;
OUTPUT
profits_change | profits_change_int |
-7.2 | -7 |
0 | 0 |
-14.4 | -14 |
- Compare the results of casting of dividing the integer value 10 by 3 to the result of dividing the numeric value 10 by 3.
-- Divide 10 by 3
SELECT 10/3,
-- Cast 10 as numeric and divide by 3
10::numeric/3;
output
?column? | ?column? |
3 | 3.3333333333333333 |
- Now cast numbers that appear as text as numeric.
- Note: 1e3 is scientific notation.
SELECT '3.2'::numeric,
'-123'::numeric,
'1e3'::numeric,
'1e-3'::numeric,
'02314'::numeric,
'0002'::numeric;
output
numeric | numeric | numeric | numeric | numeric | numeric |
3.2 | -123 | 1000 | 0.001 | 2314 | 2 |
Good job! Note that numbers cast as integer are rounded to the nearest whole number and division produces different results for integer values than for numeric values.
Summarize the distribution of numeric values
Was 2017 a good or bad year for revenue of Fortune 500 companies? Examine how revenue changed from 2016 to 2017 by first looking at the distribution of revenues_change and then counting companies whose revenue increased.
- Use GROUP BY and count() to examine the values of revenues_change.
- Order the results by revenues_change to see the distribution.
-- Select the count of each value of revenues_change
SELECT revenues_change, count(*)
FROM fortune500
GROUP BY revenues_change
-- order by the values of revenues_change
ORDER BY revenues_change;
output
revenues_change | count |
-57.5 | 1 |
-53.3 | 1 |
- Repeat step 1, but this time, cast revenues_change as an integer to reduce the number of different values.
-- Select the count of each revenues_change integer value
SELECT revenues_change::integer, COUNT(*)
FROM fortune500
GROUP BY revenues_change::integer
-- order by the values of revenues_change
ORDER BY revenues_change;
output
revenues_change | count |
-58 | 1 |
-53 | 1 |
-51 | 2 |
- How many of the Fortune 500 companies had revenues increase in 2017 compared to 2016? To find out, count the rows of fortune500 where revenues_change indicates an increase.
-- Count rows
SELECT Count(*)
FROM fortune500
-- Where...
WHERE revenues_change > 0;
OUTPUT
count |
298 |
You got it. Examining distributions and counting observations of interest are two first steps in exploring data. In the next chapter, we’ll learn other functions and approaches for summarizing numeric data.
2. Summarizing and aggregating numeric data
You’ll build on functions like min and max to summarize numeric data in new ways. Add average, variance, correlation, and percentile functions to your toolkit, and learn how to truncate and round numeric values too. Build complex queries and save your results by creating temporary tables.
2.1. Numeric data types and summary functions
Division
Compute the average revenue per employee for Fortune 500 companies by sector.
- Compute revenue per employee by dividing revenues by employees; use casting to produce a numeric result.
- Take the average of revenue per employee with avg(); alias this as avg_rev_employee.
- Group by sector.
- Order by the average revenue per employee.
-- Select average revenue per employee by sector
SELECT sector,
AVG(revenues/employees::numeric) AS avg_rev_employee
FROM fortune500
GROUP BY sector
-- Use the column alias to order the results
ORDER BY avg_rev_employee;
output
sector | avg_rev_employee |
Hotels, Restaurants & Leisure | 0.09498718151056814829 |
Apparel | 0.27865942976680063809 |
Food & Drug Stores | 0.30799950410060207070 |
Sensational! You know to watch out for integer division problems, and that ordering your query results by the value of interest will help you make sense of the results.
Explore with division
In exploring a new database, it can be unclear what the data means and how columns are related to each other.
What information does the unanswered_pct column in the stackoverflow table contain? Is it the percent of questions with the tag that are unanswered (unanswered ?s with tag/all ?s with tag)? Or is it something else, such as the percent of all unanswered questions on the site with the tag (unanswered ?s with tag/all unanswered ?s)?
Divide unanswered_count (unanswered ?s with tag) by question_count (all ?s with tag) to see if the value matches that of unanswered_pct to determine the answer.
- Exclude rows where question_count is 0 to avoid a divide by zero error.
- Limit the result to 10 rows.
-- Divide unanswered_count by question_count
SELECT unanswered_count/question_count::numeric AS computed_pct,
-- What are you comparing the above quantity to?
unanswered_pct
FROM stackoverflow
-- Select rows where question_count is not 0
WHERE question_count != 0
LIMIT 10;
output
computed_pct | unanswered_pct |
0.46548476454293628809 | 0.001751857 |
0.38636363636363636364 | 0.000116972 |
0.39376770538243626062 | 0.000058 |
Super! The values don’t match. unanswered_pct is the percent of unanswered questions on Stack Overflow with the tag, not the percent of questions with the tag that are unanswered.
Summarize numeric columns
Summarize the profit column in the fortune500 table using the functions you’ve learned.
You can access the course slides for reference using the PDF icon in the upper right corner of the screen.Compute the min(), avg(), max(), and stddev() of profits.
-- Select min, avg, max, and stddev of fortune500 profits
SELECT min(profits),
avg(profits),
max(profits),
stddev(profits)
FROM fortune500;
output
min | avg | max | stddev |
-6177 | 1783.4753507014028056 | 45687 | 3940.495363490788 |
- Now repeat step 1, but summarize profits by sector.
- Order the results by the average profits for each sector.
-- Select sector and summary measures of fortune500 profits
SELECT sector,
min(profits),
avg(profits) AS avg,
max(profits),
stddev(profits)
FROM fortune500
-- What to group by?
GROUP BY sector
-- Order by the average profits
ORDER BY avg;
output
sector | min | avg | max | stddev |
Energy | -6177 | 10.4446428571428571 | 7840 | 2264.572142925951 |
Materials | -440 | 272.4684210526315789 | 1027 | 406.632781447055 |
Engineering & Construction | 15 | 390.1692307692307692 | 911.8 | 277.665120197620 |
Fantastic work! You now know how to get basic summary measures of numeric variables.
Summarize group statistics
Sometimes you want to understand how a value varies across groups. For example, how does the maximum value per group vary across groups?
To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.
For this exercise, what is the standard deviation across tags in the maximum number of Stack Overflow questions per day? What about the mean, min, and max of the maximums as well?
- Start by writing a subquery to compute the max() of question_count per tag; alias the subquery result as maxval.
- Then compute the standard deviation of maxval with stddev().
- Compute the min(), max(), and avg() of maxval too.
-- Compute standard deviation of maximum values
SELECT min(maxval),
-- min
max(maxval),
-- max
avg(maxval),
-- avg
stddev(maxval)
-- Subquery to compute max of question_count by tag
FROM (SELECT max(question_count) AS maxval
FROM stackoverflow
-- Compute max by...
GROUP BY tag) AS max_results; -- alias for subquery
output
min | max | avg | stddev |
30 | 1138658 | 52652.433962264151 | 176458.37952720 |
Great job summarizing! A subquery was necessary here because the tag maximums must be computed before you can summarize them.
2.2. Exploring distributions
Truncate
Use trunc() to examine the distributions of attributes of the Fortune 500 companies.
Remember that trunc() truncates numbers by replacing lower place value digits with zeros:
trunc(value_to_truncate, places_to_truncate)
Negative values for places_to_truncate indicate digits to the left of the decimal to replace, while positive values indicate digits to the right of the decimal to keep.
- Use trunc() to truncate employees to the 100,000s (5 zeros).
- Count the number of observations with each truncated value.
-- Truncate employees
SELECT trunc(employees, -5) AS employee_bin,
-- Count number of companies with each truncated value
Count(*)
FROM fortune500
-- Use alias to group
GROUP BY employee_bin
-- Use alias to order
ORDER BY employee_bin;
output
employee_bin | count |
0 | 433 |
100000 | 35 |
200000 | 20 |
- Repeat step 1 for companies with < 100,000 employees (most common).
- This time, truncate employees to the 10,000s place.
-- Truncate employees
SELECT trunc(employees, -4) AS employee_bin,
-- Count number of companies with each truncated value
Count(*)
FROM fortune500
-- Limit to which companies?
WHERE employees < 100000
-- Use alias to group
GROUP BY employee_bin
-- Use alias to order
ORDER BY employee_bin;
output
employee_bin | count |
0 | 102 |
10000 | 108 |
20000 | 63 |
Awesome! Start exploring a distribution by grouping values into large bins, then refine as needed for ranges of values where there are a lot of observations.
Generate series
Summarize the distribution of the number of questions with the tag “dropbox” on Stack Overflow per day by binning the data.
Recall:
generate_series(from, to, step)
You can reference the slides using the PDF icon in the upper right corner of the screen.
- Start by selecting the minimum and maximum of the question_count column for the tag ‘dropbox’ so you know the range of values to cover with the bins.
-- Select the min and max of question_count
SELECT min(question_count),
max(question_count)
-- From what table?
FROM stackoverflow
-- For tag dropbox
WHERE tag = 'dropbox';
output
min | max |
2315 | 3072 |
- Next, use generate_series() to create bins of size 50 from 2200 to 3100.
- To do this, you need an upper and lower bound to define a bin.
- This will require you to modify the stopping value of the lower bound and the starting value of the upper bound by the bin width.
-- Create lower and upper bounds of bins
SELECT generate_series(2200, 3050, 50) AS lower,
generate_series(2250, 3100, 50) AS upper;
output
lower | upper |
2200 | 2250 |
2250 | 2300 |
2300 | 2350 |
- Select lower and upper from bins, along with the count of values within each bin bounds.
- To do this, you’ll need to join ‘dropbox’, which contains the question_count for tag “dropbox”, to the bins created by generate_series().
- The join should occur where the count is greater than or equal to the lower bound, and strictly less than the upper bound.
-- Bins created in Step 2
WITH bins AS (
SELECT generate_series(2200, 3050, 50) AS lower,
generate_series(2250, 3100, 50) AS upper),
-- Subset stackoverflow to just tag dropbox (Step 1)
dropbox AS (
SELECT question_count
FROM stackoverflow
WHERE tag='dropbox')
-- Select columns for result
-- What column are you counting to summarize?
SELECT lower, upper, count(question_count)
FROM bins -- Created above
-- Join to dropbox (created above),
-- keeping all rows from the bins table in the join
LEFT JOIN dropbox
-- Compare question_count to lower and upper
ON question_count >= lower
AND question_count < upper
-- Group by lower and upper to count values in each bin
GROUP BY lower, upper
-- Order by lower to put bins in order
ORDER BY lower;
OUTPUT
lower | upper | count |
2200 | 2250 | 0 |
2250 | 2300 | 0 |
2300 | 2350 | 22 |
Awesome! generate_series allows you to group values into any size interval and also include ranges with zero values.
2.3. More summary functions
Correlation
What’s the relationship between a company’s revenue and its other financial attributes? Compute the correlation between revenues and other financial variables with the corr() function.
- Compute the correlation between revenues and profits.
- Compute the correlation between revenues and assets.
- Compute the correlation between revenues and equity.
-- Correlation between revenues and profit
SELECT corr(revenues, profits) AS rev_profits,
-- Correlation between revenues and assets
corr(revenues, assets) AS rev_assets,
-- Correlation between revenues and equity
corr(revenues, equity) AS rev_equity
FROM fortune500;
output
rev_profits | rev_assets | rev_equity |
0.5999935815724783 | 0.3294995213185059 | 0.5465709997184311 |
Well done. Profits, assets, and equity are all positvely correlated with revenue for Fortune 500 companies.
Mean and Median
Compute the mean (avg()) and median assets of Fortune 500 companies by sector.
Use the percentile_disc() function to compute the median:
percentile_disc(0.5)
WITHIN GROUP (ORDER BY column_name)
- Select the mean and median of assets.
- Group by sector.
- Order the results by the mean.
-- What groups are you computing statistics by?
SELECT sector,
-- Select the mean of assets with the avg function
avg(assets) AS mean,
-- Select the median
percentile_disc(0.5) WITHIN GROUP (ORDER BY assets) AS median
FROM fortune500
-- Computing statistics for each what?
GROUP BY sector
-- Order results by a value of interest
ORDER BY mean;
output
sector | mean | median |
Engineering & Construction | 8199.2307692307692308 | 8709 |
Wholesalers | 9362.5862068965517241 | 5390 |
Materials | 10833.263157894737 | 7741 |
Phenomenal! The mean and median can differ significantly for skewed distributions that have a few extreme values.
2.4. Creating temporary tables
Create a temp table
Find the Fortune 500 companies that have profits in the top 20% for their sector (compared to other Fortune 500 companies).
To do this, first, find the 80th percentile of profit for each sector with
percentile_disc(fraction)
WITHIN GROUP (ORDER BY sort_expression)
and save the results in a temporary table.
Then join fortune500 to the temporary table to select companies with profits greater than the 80th percentile cut-off.
- Create a temporary table called profit80 containing the sector and 80th percentile of profits for each sector.
- Alias the percentile column as pct80.
-- To clear table if it already exists;
-- fill in name of temp table
DROP TABLE IF EXISTS profit80;
-- Create the temporary table
CREATE TEMP TABLE profit80 AS
-- Select the two columns you need; alias as needed
SELECT sector,
percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
-- What table are you getting the data from?
FROM fortune500
-- What do you need to group by?
GROUP BY sector;
-- See what you created: select all columns and rows
-- from the table you created
SELECT *
FROM profit80;
output
sector | pct80 |
Aerospace & Defense | 4895 |
Apparel | 1074.1 |
Business Services | 1401 |
- Using the profit80 table you created in step 1, select companies that have profits greater than pct80.
- Select the title, sector, profits from fortune500, as well as the ratio of the company’s profits to the 80th percentile profit.
-- Code from previous step
DROP TABLE IF EXISTS profit80;
CREATE TEMP TABLE profit80 AS
SELECT sector,
percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
FROM fortune500
GROUP BY sector;
-- Select columns, aliasing as needed
SELECT title, fortune500.sector,
profits, profits/pct80 AS ratio
-- What tables do you need to join?
FROM fortune500
LEFT JOIN profit80
-- How are the tables joined?
ON fortune500.sector=profit80.sector
-- What rows do you want to select?
WHERE profits > pct80;
output
title | sector | profits | ratio |
Walmart | Retailing | 13643 | 11.1099348534201954 |
Berkshire Hathaway | Financials | 24074 | 7.9873921698739217 |
Apple | Technology | 45687 | 6.2877786952931462 |
Thumbs up! Instead of creating a temporary table, you could do this in one step with a subquery. But if you’ll use the same subquery multiple times, a temporary table can be a good option.
Create a temp table to simplify a query
The Stack Overflow data contains daily question counts through 2018-09-25 for all tags, but each tag has a different starting date in the data.
Find out how many questions had each tag on the first date for which data for the tag is available, as well as how many questions had the tag on the last day. Also, compute the difference between these two values.
To do this, first compute the minimum date for each tag.
Then use the minimum dates to select the question_count on both the first and last day. To do this, join the temp table startdates to two different copies of the stackoverflow table: one for each column – first day and last day – aliased with different names.First, create a temporary table called startdates with each tag and the min() date for the tag in stackoverflow.
-- To clear table if it already exists
DROP TABLE IF EXISTS startdates;
-- Create temp table syntax
CREATE TEMP TABLE startdates AS
-- Compute the minimum date for each what?
SELECT tag,
min(date) AS mindate
FROM stackoverflow
-- What do you need to compute the min date for each tag?
GROUP BY tag;
-- Look at the table you created
SELECT *
FROM startdates;
output
tag | mindate |
amazon-route53 | 2016-01-01 |
google-spreadsheet | 2016-01-01 |
dropbox | 2016-01-01 |
- Join startdates to stackoverflow twice using different table aliases.
- For each tag, select mindate, question_count on the mindate, and question_count on 2018-09-25 (the max date).
- Compute the change in question_count over time.
-- To clear table if it already exists
DROP TABLE IF EXISTS startdates;
CREATE TEMP TABLE startdates AS
SELECT tag, min(date) AS mindate
FROM stackoverflow
GROUP BY tag;
-- Select tag (Remember the table name!) and mindate
SELECT startdates.tag,
mindate,
-- Select question count on the min and max days
so_min.question_count AS min_date_question_count,
so_max.question_count AS max_date_question_count,
-- Compute the change in question_count (max- min)
so_max.question_count - so_min.question_count AS change
FROM startdates
-- Join startdates to stackoverflow with alias so_min
INNER JOIN stackoverflow AS so_min
-- What needs to match between tables?
ON startdates.tag = so_min.tag
AND startdates.mindate = so_min.date
-- Join to stackoverflow again with alias so_max
INNER JOIN stackoverflow AS so_max
-- Again, what needs to match between tables?
ON startdates.tag = so_max.tag
AND so_max.date = '2018-09-25';
output
tag | mindate | min_date_question_count | max_date_question_count | change |
paypal | 2016-01-01 | 13296 | 18050 | 4754 |
amazon-elb | 2016-09-01 | 576 | 1452 | 876 |
amazon-mws | 2016-09-01 | 367 | 706 | 339 |
Sensational! The main query here was already complicated, so creating the temporary table first helped simplify the analysis.
Insert into a temp table
While you can join the results of multiple similar queries together with UNION, sometimes it’s easier to break a query down into steps. You can do this by creating a temporary table and inserting rows into it.
Compute the correlations between each pair of profits, profits_change, and revenues_change from the Fortune 500 data.
The resulting temporary table should have the following structure:
measure | profits | profits_change | revenues_change |
profits | 1.00 | # | # |
profits_change | # | 1.00 | # |
revenues_change | # | # | 1.00 |
Recall the round() function to make the results more readable:
round(column_name::numeric, decimal_places)
Note that Steps 1 and 2 do not produce output. It is normal for the query result pane to say “Your query did not generate any results.”
Create a temp table correlations.
- Compute the correlation between profits and each of the three variables (i.e. correlate profits with profits, profits with profits_change, etc).
- Alias columns by the name of the variable for which the correlation with profits is being computed.
DROP TABLE IF EXISTS correlations;
-- Create temp table
CREATE TEMP TABLE correlations AS
-- Select each correlation
SELECT 'profits'::varchar AS measure,
-- Compute correlations
corr(profits, profits) AS profits,
corr(profits, profits_change) AS profits_change,
corr(profits, revenues_change) AS revenues_change
FROM fortune500;
Insert rows into the correlations table for profits_change and revenues_change.
DROP TABLE IF EXISTS correlations;
CREATE TEMP TABLE correlations AS
SELECT 'profits'::varchar AS measure,
corr(profits, profits) AS profits,
corr(profits, profits_change) AS profits_change,
corr(profits, revenues_change) AS revenues_change
FROM fortune500;
-- Add a row for profits_change
-- Insert into what table?
INSERT INTO correlations
-- Follow the pattern of the select statement above
-- Using profits_change instead of profits
SELECT 'profits_change'::varchar AS measure,
corr(profits_change, profits) AS profits,
corr(profits_change, profits_change) AS profits_change,
corr(profits_change, revenues_change) AS revenues_change
FROM fortune500;
-- Repeat the above, but for revenues_change
INSERT INTO correlations
SELECT 'revenues_change'::varchar AS measure,
corr(revenues_change, profits) AS profits,
corr(revenues_change, profits_change) AS profits_change,
corr(revenues_change, revenues_change) AS revenues_change
FROM fortune500;
- Select all rows and columns from the correlations table to view the correlation matrix.
- First, you will need to round each correlation to 2 decimal places.
- The output of corr() is of type double precision, so you will need to also cast columns to numeric
DROP TABLE IF EXISTS correlations;
CREATE TEMP TABLE correlations AS
SELECT 'profits'::varchar AS measure,
corr(profits, profits) AS profits,
corr(profits, profits_change) AS profits_change,
corr(profits, revenues_change) AS revenues_change
FROM fortune500;
INSERT INTO correlations
SELECT 'profits_change'::varchar AS measure,
corr(profits_change, profits) AS profits,
corr(profits_change, profits_change) AS profits_change,
corr(profits_change, revenues_change) AS revenues_change
FROM fortune500;
INSERT INTO correlations
SELECT 'revenues_change'::varchar AS measure,
corr(revenues_change, profits) AS profits,
corr(revenues_change, profits_change) AS profits_change,
corr(revenues_change, revenues_change) AS revenues_change
FROM fortune500;
-- Select each column, rounding the correlations
SELECT measure,
round(profits::numeric, 2) AS profits,
round(profits_change::numeric, 2) AS profits_change,
round(revenues_change::numeric, 2) AS revenues_change
FROM correlations;
output
measure | profits | profits_change | revenues_change |
profits | 1.00 | 0.02 | 0.02 |
profits_change | 0.02 | 1.00 | -0.09 |
revenues_change | 0.02 | -0.09 | 1.00 |
Fantastic work! When specifying the number of decimal places with the round or trunc functions, the first value must be of type numeric. The correlations were double precision before being cast to numeric.
3. Exploring categorical data and unstructured text
Text, or character, data can get messy, but you’ll learn how to deal with inconsistencies in case, spacing, and delimiters. Learn how to use a temporary table to recode messy categorical data to standardized values you can count and aggregate. Extract new variables from unstructured text as you explore help requests submitted to the city of Evanston, IL.
3.1. Character data types and common issues
Count the categories
In this chapter, we’ll be working mostly with the Evanston 311 data in table evanston311. This is data on help requests submitted to the city of Evanston, IL.
This data has several character columns. Start by examining the most frequent values in some of these columns to get familiar with the common categories.
- How many rows does each priority level have?
-- Select the count of each level of priority
SELECT priority, count(*)
FROM evanston311
Group by priority;
How many distinct values of zip appear in at least 100 rows?
-- Find values of zip that appear in at least 100 rows
-- Also get the count of each value
SELECT zip, COUNT(*)
FROM evanston311
GROUP BY zip
HAVING COUNT(*) >= 100;
How many distinct values of source appear in at least 100 rows?
-- Find values of source that appear in at least 100 rows
-- Also get the count of each value
SELECT source, COUNT(*)
FROM evanston311
GROUP BY source
HAVING COUNT(*)>=100;
Select the five most common values of street and the count of each.
-- Find the 5 most common values of street and the count of each
SELECT street, count(*)
FROM evanston311
GROUP BY street
ORDER BY count(*) DESC
LIMIT 5;
Fabulous! Becoming familiar with common categorical values helps you learn what to expect from your data.
Spotting character data problems
Explore the distinct values of the street column. Select each street value and the count of the number of rows with that value. Sort the results by street to see similar values near each other.
Look at the results.
Which of the following is NOT an issue you see with the values of street?
Possible Answers
- The street suffix (e.g. Street, Avenue) is sometimes abbreviated
- There are sometimes extra spaces at the beginning and end of values
- House/street numbers sometimes appear in the column
- Capitalization is not consistent across values
- All of the above are potential problems
Correct! street values do not have extra spaces. You could verify this with a LIKE query.
3.2. Cases and spaces
Trimming
Some of the street values in evanston311 include house numbers with # or / in them. In addition, some street values end in a ..
Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the street values as a first attempt at cleaning up the values.
- Trim digits 0-9, #, /, ., and spaces from the beginning and end of street.
- Select distinct original street value and the corrected street value.
- Order the results by the original street value.
SELECT distinct street,
-- Trim off unwanted characters from street
trim(street, '0123456789 #/.') AS cleaned_street
FROM evanston311
ORDER BY street;
Good job! Note that the “cleaned” values still include letters from house numbers, and trim() stripped off some numbers that belong as part of road names. It can take several tries to find the right combination of functions to clean up messy values.
Exploring unstructured text
The description column of evanston311 has the details of the inquiry, while the category column groups inquiries into different types. How well does the category capture what’s in the description?
LIKE and ILIKE queries will help you find relevant descriptions and categories. Remember that with LIKE queries, you can include a % on each side of a word to find values that contain the word. For example:
SELECT category
FROM evanston311
WHERE category LIKE ‘%Taxi%’;
% matches 0 or more characters.
Building up the query through the steps below, find inquires that mention trash or garbage in the description without trash or garbage being in the category. What are the most frequent categories for such inquiries?
- Use ILIKE to count rows in evanston311 where the description contains ‘trash’ or ‘garbage’ regardless of case.
-- Count rows
SELECT Count(*)
FROM evanston311
-- Where description includes trash or garbage
WHERE description ILIKE '%trash%'
OR description ILIKE '%garbage%';
category values are in title case. Use LIKE to find category values with ‘Trash’ or ‘Garbage’ in them.
-- Select categories containing Trash or Garbage
SELECT *
FROM evanston311
-- Use LIKE
WHERE category LIKE '%Trash%'
OR category LIKE '%Garbage%';
Count rows where the description includes ‘trash’ or ‘garbage’ but the category does not.
-- Count rows
SELECT COUNT(*)
FROM evanston311
-- description contains trash or garbage (any case)
WHERE (description ILIKE '%trash%'
OR description ILIKE '%garbage%')
-- category does not contain Trash or Garbage
AND category NOT LIKE '%Trash%'
AND category NOT LIKE '%Garbage%';
Find the most common categories for rows with a description about trash that don’t have a trash-related category.
-- Count rows with each category
SELECT category, COUNT(*)
FROM evanston311
WHERE (description ILIKE '%trash%'
OR description ILIKE '%garbage%')
AND category NOT LIKE '%Trash%'
AND category NOT LIKE '%Garbage%'
-- What are you counting?
GROUP BY category
--- order by most frequent values
ORDER BY COUNT(*) DESC
LIMIT 10;
Nice going! The results include some categories that appear to be related to trash, while others are more general.
3.3. Splitting and concatenating text
Concatenate strings
House number (house_num) and street are in two separate columns in evanston311. Concatenate them together with concat() with a space in between the values.
- Concatenate house_num, a space ‘ ‘, and street into a single value using the concat().
- Use a trim function to remove any spaces from the start of the concatenated value.
-- Concatenate house_num, a space, and street
-- and trim spaces from the start of the result
SELECT trim(concat(house_num,' ',street)) AS address
FROM evanston311;
output
address |
606-612 Sheridan Road |
930 Washington St |
1183-1223 Lincoln St |
Great! When joining values that might be NULL with a separator between them, consider using the concat_ws() function, which you can read about in the PostgreSQL documentation, to avoid duplicate or unnecessary separators in the result.
Split strings on a delimiter
The street suffix is the part of the street name that gives the type of street, such as Avenue, Road, or Street. In the Evanston 311 data, sometimes the street suffix is the full word, while other times it is the abbreviation.
Extract just the first word of each street value to find the most common streets regardless of the suffix.
To do this, use
split_part(string_to_split, delimiter, part_number)
- Use split_part() to select the first word in street; alias the result as street_name.
- Also select the count of each value of street_name.
-- Select the first word of the street value
SELECT split_part(street,' ',1) AS street_name,
count(*)
FROM evanston311
GROUP BY street_name
ORDER BY count DESC
LIMIT 20;
output
street_name | count |
null | 1699 |
Chicago | 1569 |
Central | 1529 |
You got it!
Shorten long strings
The description column of evanston311 can be very long. You can get the length of a string with the length() function.
For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left() function to get a specified number of characters at the start of each value.
To indicate that more data is available, concatenate ‘…’ to the end of any shortened description. To do this, you can use a CASE WHEN statement to add ‘…’ only when the string length is greater than 50.
Select the first 50 characters of description when description starts with the word “I”.
- Select the first 50 characters of description with ‘…’ concatenated on the end where the length() of the description is greater than 50 characters. Otherwise just select the description as is.
- Select only descriptions that begin with the word ‘I’ and not the letter ‘I’.
- For example, you would want to select “I like using SQL!”, but would not want to select “In this course we use SQL!”.
-- Select the first 50 chars when length is greater than 50
SELECT CASE WHEN length(description) > 50
THEN left(description, 50) || '...'
-- otherwise just select description
ELSE description
END
FROM evanston311
-- limit to descriptions that start with the word I
WHERE description LIKE 'I %'
ORDER BY description;
output
description |
I work for Schermerhorn & Co. and manage this con… |
I Live in a townhouse with garbage cans in back, i… |
I Put In For Reserve Disabled Parking, A Week Ago … |
Well done! Shortening long fields can help you scan the values quickly and spot patterns you might not otherwise see.
3.4. Strategies for multiple transformations
Create an “other” category
If we want to summarize Evanston 311 requests by zip code, it would be useful to group all of the low frequency zip codes together in an “other” category.
Which of the following values, when substituted for ??? in the query, would give the result below?
Query:
SELECT CASE WHEN zipcount < ??? THEN ‘other’
ELSE zip
END AS zip_recoded,
sum(zipcount) AS zipsum
FROM (SELECT zip, count(*) AS zipcount
FROM evanston311
GROUP BY zip) AS fullcounts
GROUP BY zip_recoded
ORDER BY zipsum DESC;
Result:
zip_recoded zipsum
60201 19054
60202 11165
null 5528
other 429
60208 255
- 255
- 1000
- 100
- 60201
Correct! All of the zip codes with fewer observations than 60208 have less than 100 rows in the table.
Group and recode values
There are almost 150 distinct values of evanston311.category. But some of these categories are similar, with the form “Main Category – Details”. We can get a better sense of what requests are common if we aggregate by the main category.
To do this, create a temporary table recode mapping distinct category values to new, standardized values. Make the standardized values the part of the category before a dash (‘-‘). Extract this value with the split_part() function:
split_part(string text, delimiter text, field int)
You’ll also need to do some additional cleanup of a few cases that don’t fit this pattern.
Then the evanston311 table can be joined to recode to group requests by the new standardized category values.
- Create recode with a standardized column; use split_part() and then rtrim() to remove any remaining whitespace on the result of split_part().
-- Fill in the command below with the name of the temp table
DROP TABLE IF EXISTS recode;
-- Create and name the temporary table
CREATE TEMP TABLE recode AS
-- Write the select query to generate the table
-- with distinct values of category and standardized values
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
-- What table are you selecting the above values from?
FROM evanston311;
-- Look at a few values before the next step
SELECT DISTINCT standardized
FROM recode
WHERE standardized LIKE 'Trash%Cart'
OR standardized LIKE 'Snow%Removal%';
output
standardized |
Snow Removal |
Snow Removal/Concerns |
Snow/Ice/Hazard Removal |
- UPDATE standardized values LIKE ‘Trash%Cart’ to ‘Trash Cart’.
UPDATE standardized values of ‘Snow Removal/Concerns’ and ‘Snow/Ice/Hazard Removal’ to ‘Snow Removal’.
-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
FROM evanston311;
-- Update to group trash cart values
UPDATE recode
SET standardized='Trash Cart'
WHERE standardized LIKE 'Trash%Cart';
-- Update to group snow removal values
UPDATE recode
SET standardized='Snow Removal'
WHERE standardized LIKE 'Snow%Removal%';
-- Examine effect of updates
SELECT DISTINCT standardized
FROM recode
WHERE standardized LIKE 'Trash%Cart'
OR standardized LIKE 'Snow%Removal%';
OUTPUT
standardized |
Snow Removal |
Trash Cart |
- UPDATE recode by setting standardized values of ‘THIS REQUEST IS INACTIVE…Trash Cart’, ‘(DO NOT USE) Water Bill’, ‘DO NOT USE Trash’, and ‘NO LONGER IN USE’ to ‘UNUSED’.
-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
FROM evanston311;
UPDATE recode SET standardized='Trash Cart'
WHERE standardized LIKE 'Trash%Cart';
UPDATE recode SET standardized='Snow Removal'
WHERE standardized LIKE 'Snow%Removal%';
-- Update to group unused/inactive values
UPDATE recode
SET standardized='UNUSED'
WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart',
'(DO NOT USE) Water Bill',
'DO NOT USE Trash',
'NO LONGER IN USE');
-- Examine effect of updates
SELECT DISTINCT standardized
FROM recode
ORDER BY standardized;
OUTPUT
standardized |
ADA/Inclusion Aids |
Abandoned Bicycle on City Property |
Abandoned Vehicle |
- Now, join the evanston311 and recode tables to count the number of requests with each of the standardized values
- List the most common standardized values first.
-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
SELECT DISTINCT category,
rtrim(split_part(category, '-', 1)) AS standardized
FROM evanston311;
UPDATE recode SET standardized='Trash Cart'
WHERE standardized LIKE 'Trash%Cart';
UPDATE recode SET standardized='Snow Removal'
WHERE standardized LIKE 'Snow%Removal%';
UPDATE recode SET standardized='UNUSED'
WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart',
'(DO NOT USE) Water Bill',
'DO NOT USE Trash', 'NO LONGER IN USE');
-- Select the recoded categories and the count of each
SELECT standardized, count(*)
-- From the original table and table with recoded values
FROM evanston311
LEFT JOIN recode
-- What column do they have in common?
ON evanston311.category = recode.category
-- What do you need to group by to count?
GROUP BY standardized
-- Display the most common val values first
ORDER BY count(*) DESC;
output
standardized | count |
Broken Parking Meter | 6092 |
Trash | 3699 |
Ask A Question / Send A Message | 2595 |
Awesome! You now have a strategy for recoding the values of a variable. Next, you’ll use a similar approach to create some new variables.
Create a table with indicator variables
Determine whether medium and high priority requests in the evanston311 data are more likely to contain requesters’ contact information: an email address or phone number.
- Emails contain an @.
- Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: 555-555-1212.
Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and _ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows you to locate it within other text.
For example, ‘%___.com%’would allow you to search for a reference to a website with the top-level domain ‘.com’ and at least three characters preceding it.
Create and store indicator variables for email and phone in a temporary table. LIKE produces True or False as a result, but casting a boolean (True or False) as an integer converts True to 1 and False to 0. This makes the values easier to summarize later.
- Create a temp table indicators from evanston311 with three columns: id, email, and phone.
- Use LIKE comparisons to detect the email and phone patterns that are in the description, and cast the result as an integer with CAST().
- Your phone indicator should use a combination of underscores _ and dashes – to represent a standard 10-digit phone number format.
- Remember to start and end your patterns with % so that you can locate the pattern within other text!
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;
-- Create the temp table
CREATE TEMP TABLE indicators AS
SELECT id,
CAST (description LIKE '%@%' AS integer) AS email,
CAST (description LIKE '%___-___-____%' AS integer) AS phone
FROM evanston311;
-- Select the column you'll group by
SELECT ___,
-- Compute the proportion of rows with each indicator
___(email)/___::___ AS email_prop,
___(phone)/___::___ AS phone_prop
-- Tables to select from
FROM ___
___ JOIN ___
-- Joining condition
ON ___.___=___.___
-- What are you grouping by?
GROUP BY ___;
OUTPUT
id | phone | |
1340563 | 0 | 0 |
1826017 | 0 | 0 |
1849204 | 0 | 0 |
- Join the indicators table to evanston311, selecting the proportion of reports including an email or phone grouped by priority.
- Include adjustments to account for issues arising from integer division.
-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;
-- Create the temp table
CREATE TEMP TABLE indicators AS
SELECT id,
CAST (description LIKE '%@%' AS integer) AS email,
CAST (description LIKE '%___-___-____%' AS integer) AS phone
FROM evanston311;
-- Select the column you'll group by
SELECT priority,
-- Compute the proportion of rows with each indicator
sum(email)/count(*)::numeric AS email_prop,
sum(phone)/count(*)::numeric AS phone_prop
-- Tables to select from
FROM evanston311
LEFT JOIN indicators
-- Joining condition
ON evanston311.id=indicators.id
-- What are you grouping by?
GROUP BY priority;
output
priority | email_prop | phone_prop |
MEDIUM | 0.01966927763272410792 | 0.01845082680591818973 |
NONE | 0.00412220338419600412 | 0.00568465144110900568 |
HIGH | 0.01136363636363636364 | 0.02272727272727272727 |
You made it through! You not only created new variables but also used them successfully to learn about the data. Medium and high priority requests do contain contact information more frequently.
4. Working with dates and timestamps
What time is it? In this chapter, you’ll learn how to find out. You’ll aggregate date/time data by hour, day, month, or year and practice both constructing time series and finding gaps in them.
4.1. Date/time types and formats
ISO 8601
Which date format below conforms to the ISO 8601 standard?
- June 15, 2018 3:30pm
- 2018-06-15 15:30:00
- 6/15/18 13:00
- 2018-6-15 3:30:00
Correct! The units are ordered from largest to smallest, and single digit values have a leading zero.
Date comparisons
When working with timestamps, sometimes you want to find all observations on a given day. However, if you specify only a date in a comparison, you may get unexpected results. This query:
SELECT count(*)
FROM evanston311
WHERE date_created = ‘2018-01-02’;
returns 0, even though there were 49 requests on January 2, 2018.
This is because dates are automatically converted to timestamps when compared to a timestamp. The time fields are all set to zero:
SELECT ‘2018-01-02’::timestamp;
2018-01-02 00:00:00
When working with both timestamps and dates, you’ll need to keep this in mind.
- Count the number of Evanston 311 requests created on January 31, 2017 by casting date_created to a date.
-- Count requests created on January 31, 2017
SELECT count(*)
FROM evanston311
WHERE date(date_created)='2017-01-31';
output
count |
45 |
Count the number of Evanston 311 requests created on February 29, 2016 by using >= and < operators.
-- Count requests created on February 29, 2016
SELECT count(*)
FROM evanston311
WHERE date_created >= '2016-02-29'
AND date_created < '2016-03-01';
output
count |
58 |
- Count the number of requests created on March 13, 2017.
- Specify the upper bound by adding 1 to the lower bound.
-- Count requests created on March 13, 2017
SELECT count(*)
FROM evanston311
WHERE date_created >= '2017-03-13'
AND date_created < '2017-03-13'::date + 1;
output
count |
33 |
Good job! The strategy used in the first step is the simpliest and will often work, but the other approaches may be useful in different circumstances
Date arithmetic
You can subtract dates or timestamps from each other.
You can add time to dates or timestamps using intervals. An interval is specified with a number of units and the name of a datetime field. For example:
- ‘3 days’::interval
- ‘6 months’::interval
- ‘1 month 2 years’::interval
- ‘1 hour 30 minutes’::interval
Practice date arithmetic with the Evanston 311 data and now().Subtract the minimum date_created from the maximum date_created.
-- Subtract the min date_created from the max
SELECT max(date_created)-min(date_created)
FROM evanston311;
output
?column? |
911 days, 16:33:39 |
- Using now(), find out how old the most recent evanston311 request was created.
-- How old is the most recent request?
SELECT now()-max(date_created)
FROM evanston311;
output
?column? |
1678 days, 12:08:20.938001 |
- Add 100 days to the current timestamp.
-- Add 100 days to the current timestamp
SELECT now() +'100 days'::interval;
output
?column? |
2023-05-14 05:48:14.220367+02:00 |
- Select the current timestamp and the current timestamp plus 5 minutes.
-- Select the current timestamp,
-- and the current timestamp + 5 minutes
SELECT now(), now()+'5 minutes'::interval;
output
now | ?column? |
2023-02-03 05:49:18.505114+01:00 | 2023-02-03 05:54:18.505114+01:00 |
Completion time by category
The evanston311 data includes a date_created timestamp from when each request was created and a date_completed timestamp for when it was completed. The difference between these tells us how long a request was open.
Which category of Evanston 311 requests takes the longest to complete?
- Compute the average difference between the completion timestamp and the creation timestamp by category.
- Order the results with the largest average time to complete the request first.
-- Select the category and the average completion time by category
SELECT category,
AVG(date_completed - date_created) AS completion_time
FROM evanston311
GROUP BY category
-- Order the results
Order by completion_time DESC;
OUTPUT
category | completion_time |
Rodents- Rats | 64 days, 10:58:23.000766 |
Fire Prevention – Public Education | 34 days, 16:48:10 |
Key Request – All City Employees | 32 days, 0:52:11 |
Nice going! The results of one query can help you generate further questions to answer: Why do rat issues take so long to resolve? We’ll investigate in an upcoming exercise.
4.2. Date/time components and aggregation
Date parts
The date_part() function is useful when you want to aggregate data by a unit of time across multiple larger units of time. For example, aggregating data by month across different years, or aggregating by hour across different days.
Recall that you use date_part() as:
SELECT date_part(‘field’, timestamp);
In this exercise, you’ll use date_part() to gain insights about when Evanston 311 requests are submitted and completed.
- How many requests are created in each of the 12 months during 2016-2017?
-- Extract the month from date_created and count requests
SELECT date_part('month', date_created) AS month,
COUNT(*)
FROM evanston311
-- Limit the date range
WHERE date_created >= '2016-01-01'
AND date_created <= '2018-01-01'
-- Group by what to get monthly counts?
GROUP BY month;
output
month | count |
6 | 3403 |
8 | 3110 |
4 | 2385 |
- What is the most common hour of the day for requests to be created?
-- Get the hour and count requests
SELECT date_part('hour',date_created) AS hour,
count(*)
FROM evanston311
GROUP BY hour
-- Order results to select most common
ORDER BY COUNT DESC
LIMIT 1;
OUTPUT
hour | count |
11 | 3960 |
- During what hours are requests usually completed? Count requests completed by hour.
- Order the results by hour.
-- Count requests completed by hour
SELECT date_part('hour',date_completed) AS hour,
COUNT(*)
FROM evanston311
GROUP BY hour
ORDER BY hour;
output
hour | count |
0 | 59 |
1 | 49 |
2 | 10 |
Brilliant! You can also aggregate by day or week of the year, but make sure you read the documentation about how they are computed before using these units.
Variation by day of week
Does the time required to complete a request vary by the day of the week on which the request was created?
We can get the name of the day of the week by converting a timestamp to character data:
to_char(date_created, ‘day’)
But character names for the days of the week sort in alphabetical, not chronological, order. To get the chronological order of days of the week with an integer value for each day, we can use:
EXTRACT(DOW FROM date_created)
DOW stands for “day of week.”
-- Select name of the day of the week the request was created
SELECT to_char(date_created, 'day') AS day,
-- Select avg time between request creation and completion
AVG(date_completed - date_created) AS duration
FROM evanston311
-- Group by the name of the day of the week and
-- integer value of day of week the request was created
GROUP BY day, EXTRACT(DOW FROM date_created)
-- Order by integer value of the day of the week
-- the request was created
ORDER BY EXTRACT(DOW FROM date_created);
output
day | duration |
sunday | 8 days, 23:02:01.677165 |
monday | 7 days, 0:47:18.215322 |
tuesday | 7 days, 2:36:24.417655 |
Terrific! Requests created at the beginning of the work week are closed sooner on average than those created at the end of the week or on the weekend.
Date truncation
Unlike date_part() or EXTRACT(), date_trunc() keeps date/time units larger than the field you specify as part of the date. So instead of just extracting one component of a timestamp, date_trunc() returns the specified unit and all larger ones as well.
Recall the syntax:
date_trunc(‘field’, timestamp)
Using date_trunc(), find the average number of Evanston 311 requests created per day for each month of the data. Ignore days with no requests when taking the average.
- Write a subquery to count the number of requests created per day.
- Select the month and average count per month from the daily_count subquery.
-- Aggregate daily counts by month
SELECT date_trunc('month', day) AS month,
AVG(count)
-- Subquery to compute daily counts
FROM (SELECT date_trunc('day',date_created) AS day,
count(*) AS count
FROM evanston311
GROUP BY day) AS daily_count
GROUP BY month
ORDER BY month;
output
month | avg |
2016-01-01 00:00:00+01:00 | 23.4838709677419355 |
2016-02-01 00:00:00+01:00 | 30.7586206896551724 |
2016-03-01 00:00:00+01:00 | 35.5483870967741935 |
You got it! This query ignores dates with no requests. You’ll learn how to account for missing dates in an upcoming exercise.
4.3. Aggregating with date/time series
Find missing dates
The generate_series() function can be useful for identifying missing dates.
Recall:
generate_series(from, to, interval)
where from and to are dates or timestamps, and interval can be specified as a string with a number and a unit of time, such as ‘1 month’.
Are there any days in the Evanston 311 data where no requests were created?
- Write a subquery using generate_series() to get all dates between the min() and max() date_created in evanston311.
- Write another subquery to select all values of date_created as dates from evanston311.
- Both subqueries should produce values of type date (look for the ::).
- Select dates (day) from the first subquery that are NOT IN the results of the second subquery. This gives you days that are not in date_created.
SELECT day
-- 1) Subquery to generate all dates
-- from min to max date_created
FROM (SELECT generate_series(min(date_created),
max(date_created),
'1 day')::date AS day
-- What table is date_created in?
FROM evanston311) AS all_dates
-- 4) Select dates (day from above) that are NOT IN the subquery
WHERE day NOT IN
-- 2) Subquery to select all date_created values as dates
(SELECT date_created::date
FROM evanston311);
output
day |
2016-05-08 |
2016-11-06 |
2017-02-05 |
Fantastic! This approach works for finding missing values of other units of time, such as hours or months, as well.
Custom aggregation periods
Find the median number of Evanston 311 requests per day in each six month period from 2016-01-01 to 2018-06-30. Build the query following the three steps below.
Recall that to aggregate data by non-standard date/time intervals, such as six months, you can use generate_series() to create bins with lower and upper bounds of time, and then summarize observations that fall in each bin.
Remember: you can access the slides with an example of this type of query using the PDF icon link in the upper right corner of the screen.
- Use generate_series() to create bins of 6 month intervals. Recall that the upper bin values are exclusive, so the values need to be one day greater than the last day to be included in the bin.
- Notice how in the sample code, the first bin value of the upper bound is July 1st, and not June 30th.
- Use the same approach when creating the last bin values of the lower and upper bounds (i.e. for 2018).
-- Generate 6 month bins covering 2016-01-01 to 2018-06-30
-- Create lower bounds of bins
SELECT generate_series('2016-01-01', -- First bin lower value
'2018-01-01', -- Last bin lower value
'6 months'::interval) AS lower,
-- Create upper bounds of bins
generate_series('2016-07-01', -- First bin upper value
'2018-12-30', -- Last bin upper value
'6 months'::interval) AS upper;
output
lower | upper |
2016-01-01 00:00:00+01:00 | 2016-07-01 00:00:00+02:00 |
2016-07-01 00:00:00+02:00 | 2017-01-01 00:00:00+01:00 |
2017-01-01 00:00:00+01:00 | 2017-07-01 00:00:00+02:00 |
- Count the number of requests created per day. Remember to not count *, or you will risk counting NULL values.
Include days with no requests by joining evanston311 to a daily series from 2016-01-01 to 2018-06-30.
– Note that because we are not generating bins, you can use June 30th as your series end date.
-- Count number of requests made per day
SELECT day, COUNT(date_created) AS count
-- Use a daily series from 2016-01-01 to 2018-06-30
-- to include days with no requests
FROM (SELECT generate_series('2016-01-01', -- series start date
'2018-06-30', -- series end date
'1 day'::interval)::date AS day) AS daily_series
LEFT JOIN evanston311
-- match day from above (which is a date) to date_created
ON day = date_created::date
GROUP BY day;
output
- Assign each daily count to a single 6 month bin by joining bins to daily_counts.
- Compute the median value per bin using percentile_disc().
-- Bins from Step 1
WITH bins AS (
SELECT generate_series('2016-01-01',
'2018-01-01',
'6 months'::interval) AS lower,
generate_series('2016-07-01',
'2018-07-01',
'6 months'::interval) AS upper),
-- Daily counts from Step 2
daily_counts AS (
SELECT day, count(date_created) AS count
FROM (SELECT generate_series('2016-01-01',
'2018-06-30',
'1 day'::interval)::date AS day) AS daily_series
LEFT JOIN evanston311
ON day = date_created::date
GROUP BY day)
-- Select bin bounds
SELECT lower,
upper,
-- Compute median of count for each bin
percentile_disc(0.5) WITHIN GROUP (ORDER BY count) AS median
-- Join bins and daily_counts
FROM bins
LEFT JOIN daily_counts
-- Where the day is between the bin bounds
ON day >= lower
AND day < upper
-- Group by bin bounds
GROUP BY lower,upper
ORDER BY lower;
OUTPUT
lower | upper | median |
2016-01-01 00:00:00+01:00 | 2016-07-01 00:00:00+02:00 | 37 |
2016-07-01 00:00:00+02:00 | 2017-01-01 00:00:00+01:00 | 41 |
2017-01-01 00:00:00+01:00 | 2017-07-01 00:00:00+02:00 | 44 |
Well done! You might need to create custom bins to correspond to fiscal years, academic years, 2-week periods, or other reporting periods for your organization.
Monthly average with missing dates
Find the average number of Evanston 311 requests created per day for each month of the data.
This time, do not ignore dates with no requests.
- Generate a series of dates from 2016-01-01 to 2018-06-30.
- Join the series to a subquery to count the number of requests created per day.
- Use date_trunc() to get months from date, which has all dates, NOT day.
- Use coalesce() to replace NULL count values with 0. Compute the average of this value.
-- generate series with all days from 2016-01-01 to 2018-06-30
WITH all_days AS
(SELECT generate_series('2016-01-01',
'2018-06-30',
'1 day'::interval) AS date),
-- Subquery to compute daily counts
daily_count AS
(SELECT date_trunc('day', date_created) AS day,
count(*) AS count
FROM evanston311
GROUP BY day)
-- Aggregate daily counts by month using date_trunc
SELECT date_trunc('month', date) AS month,
-- Use coalesce to replace NULL count values with 0
avg(coalesce(count, 0)) AS average
FROM all_days
LEFT JOIN daily_count
-- Joining condition
ON all_days.date=daily_count.day
GROUP BY month
ORDER BY month;
output
month | average |
2016-01-01 00:00:00+01:00 | 23.4838709677419355 |
2016-02-01 00:00:00+01:00 | 30.7586206896551724 |
2016-03-01 00:00:00+01:00 | 35.5483870967741935 |
Well done. Because there are few days with no requests, including them doesn’t change the averages much. But, including them is always the right way to compute accurate averages!
4.4. Time between events




Longest gap
What is the longest time between Evanston 311 requests being submitted?
Recall the syntax for lead() and lag():
lag(column_to_adjust) OVER (ORDER BY ordering_column)
lead(column_to_adjust) OVER (ORDER BY ordering_column)
- Select date_created and the date_created of the previous request using lead() or lag() as appropriate.
- Compute the gap between each request and the previous request.
- Select the row with the maximum gap.
-- Compute the gaps
WITH request_gaps AS (
SELECT date_created,
-- lead or lag
lag(date_created) OVER (ORDER BY date_created) AS previous,
-- compute gap as date_created minus lead or lag
date_created - lag(date_created) OVER (ORDER BY date_created) AS gap
FROM evanston311)
-- Select the row with the maximum gap
SELECT *
FROM request_gaps
-- Subquery to select maximum gap from request_gaps
WHERE gap = (SELECT max(gap)
FROM request_gaps);
output
date_created | previous | gap |
2018-01-07 19:41:34+01:00 | 2018-01-05 19:04:09+01:00 | 2 days, 0:37:25 |
Hooray! This query uses a WITH clause because we need to refer to request_gap twice to select the row with the maximum value.
Rats!
Requests in category “Rodents- Rats” average over 64 days to resolve. Why?
Investigate in 4 steps:
- Why is the average so high? Check the distribution of completion times. Hint: date_trunc() can be used on intervals.
- See how excluding outliers influences average completion times.
- Do requests made in busy months take longer to complete? Check the correlation between the average completion time and requests per month.
- Compare the number of requests created per month to the number completed.
Remember: the time to resolve, or completion time, is date_completed – date_created.
- Use date_trunc() to examine the distribution of rat request completion times by number of days.
-- Truncate the time to complete requests to the day
SELECT date_trunc('day',date_completed - date_created) AS completion_time,
-- Count requests with each truncated time
Count(*)
FROM evanston311
-- Where category is rats
WHERE category = 'Rodents- Rats'
-- Group and order by the variable of interest
GROUP BY completion_time
ORDER BY completion_time;
OUTPUT
completion_time | count |
0:00:00 | 73 |
1 day, 0:00:00 | 17 |
2 days, 0:00:00 | 23 |
- Compute average completion time per category excluding the longest 5% of requests (outliers).
SELECT category,
-- Compute average completion time per category
AVG(date_completed - date_created) AS avg_completion_time
FROM evanston311
-- Where completion time is less than the 95th percentile value
WHERE date_completed - date_created <
-- Compute the 95th percentile of completion time in a subquery
(SELECT percentile_disc(0.95) WITHIN GROUP (ORDER BY date_completed - date_created)
FROM evanston311)
GROUP BY category
-- Order the results
ORDER BY avg_completion_time DESC;
output
category | avg_completion_time |
Trash Cart – Downsize, Upsize or Remove | 12 days, 17:47:50.586912 |
Sanitation Billing Questions | 12 days, 11:13:25.888889 |
THIS REQUEST IS INACTIVE…Trash Cart – Compost Bin | 12 days, 6:32:42.024390 |
Get corr() between avg. completion time and monthly requests. EXTRACT(epoch FROM interval) returns seconds in interval.
-- Compute correlation (corr) between
-- avg_completion time and count from the subquery
SELECT corr(avg_completion, count)
-- Convert date_created to its month with date_trunc
FROM (SELECT date_trunc('month', date_created) AS month,
-- Compute average completion time in number of seconds
AVG(EXTRACT(epoch FROM date_completed - date_created)) AS avg_completion,
-- Count requests per month
count(*) AS count
FROM evanston311
-- Limit to rodents
WHERE category='Rodents- Rats'
-- Group by month, created above
GROUP BY month)
-- Required alias for subquery
AS monthly_avgs;
output
corr |
0.23199855213424334 |
- Select the number of requests created and number of requests completed per month.
-- Compute monthly counts of requests created
WITH created AS (
SELECT date_trunc('month',date_created) AS month,
count(*) AS created_count
FROM evanston311
WHERE category='Rodents- Rats'
GROUP BY month),
-- Compute monthly counts of requests completed
completed AS (
SELECT date_trunc('month',date_completed) AS month,
count(*) AS completed_count
FROM evanston311
WHERE category='Rodents- Rats'
GROUP BY month)
-- Join monthly created and completed counts
SELECT created.month,
created_count,
completed_count
FROM created
INNER JOIN completed
ON created.month=completed.month
ORDER BY created.month;
output
month | created_count | completed_count |
2016-01-01 00:00:00+01:00 | 10 | 1 |
2016-02-01 00:00:00+01:00 | 22 | 11 |
2016-03-01 00:00:00+01:00 | 31 | 14 |
Outstanding! There is a slight correlation between completion times and the number of requests per month. But the bigger issue is the disproportionately large number of requests completed in November 2017.
Wrap-up
