Statistics are all around us, from marketing to sales to healthcare. The ability to collect, analyze, and draw conclusions from data is not only extremely valuable, but it is also becoming commonplace to expect roles that are not traditionally analytical to understand the fundamental concepts of statistics. This course will equip you with the necessary skills to feel confident in working with analyzing data to draw insights. You’ll be introduced to common methods used for summarizing and describing data, learn how probability can be applied to commercial scenarios, and discover how experiments are conducted to understand relationships and patterns. You’ll work with real-world datasets including crime data in London, England, and sales data from an online retail company!
Course URL: https://www.datacamp.com/courses/introduction-to-statistics-in-python
2.1. Summary Statistics
Summary statistics gives you the tools you need to describe your data. In this chapter, you’ll explore summary statistics including mean, median, and standard deviation, and learn how to accurately interpret them. You’ll also develop your critical thinking skills, allowing you to choose the best summary statistics for your data.
2.1.1. What is statistics?
Using statistics in the real-world
Recall that statistics can help to answer specific, measurable questions.
In this exercise, you have been provided with several real-world scenarios and need to select which one can be solved through the application of statistics.
- Why do some people prefer dogs to cats?
- Testing whether a new model of car is safer than the current model?
- What factors make one TV show more popular than another?
- What will tomorrow’s winning lottery numbers be?
Yes! Using inferential statistics to check whether a new product improves on a current version is common practice in many industries.
Identifying data types
You saw that there are two main types of data: numeric and categorical.
Numeric data can be classified as either continuous or count/interval, and categorical data can be classified as either nominal or ordinal. The data type determines which approaches are suitable when summarizing your data.
You’ve been provided with several examples to classify as continuous, count, nominal, or ordinal data.
- Map each example to its data type by dragging each item and dropping it into the correct data type.
Well done! Now let’s look at how statistics are used in the real-world!
Descriptive vs. Inferential statistics
Recall that there are two main branches of statistics—descriptive statistics and inferential statistics.
Understanding what type of statistics is required for a given situation is an essential skill in order to draw accurate conclusions.
Now you have the opportunity to check your knowledge of how these two branches of statistics are used in practice. You’ll classify each scenario as a task for descriptive or inferential statistics.
- Choose which type of statistics to use for each scenario.
Exactly! Recognizing the type of statistics required to answer a given question allows you to choose the appropriate method and, by extension, produce the most accurate answer!
2.1.2. Measures of center
Typical number of robberies per London Borough
In the video, you saw that the mean and median can both provide information about the typical value of a variable.
Here are three definitions representing the mode, median, and mean, along with their respective values for robberies in the London crimes dataset:
Add all values and divide by the number of observations | The London Borough where Robbery occurs most frequently | Sort all the data and take the middle value |
1496.16 | Westminster | 1354.5 |
Looking at the table, what are the values of the mean and median?
- Mean: 1496.16 ; Median: Westminster
- Mean: 1354.50 ; Median: 1496.16
- Mean: 1496.16 ; Median: 1354.50
- Mean: Westminster ; Median: 1354.50
Well done! The mean number of robberies is higher than the median, suggesting one or more London boroughs have a particularly high volume of robberies.
Choosing a measure
Selecting the correct measure of center is essential when describing a typical value of the data being observed.
An app has been displayed, which shows a histogram of robberies in London. You can use the app to change which type of crime is displayed, and whether to include a dotted line for the mean or median.
Your task is to decide which type of crime has a symmetrical histogram and can be accurately described by both the mean and median.
- Robbery
- Drug Offenses
- Arson and Criminal Damage
- Public Order Offenses
- None of these crimes
Correct! None of these crimes have a symmetrical distribution and there are different values for the mean and median, so we cannot accurately describe a typical value using both measures!
London Boroughs with most frequent crimes
The mean and median are great for summarizing numeric data, but if you want to understand the typical value of a categorical variable then these measures can’t be applied.
An app containing the count of various crimes for each London Borough has been provided. You can use the arrows next to the column names to sort the data from smallest to largest and vice versa.
Using this, your task is to find out which London Boroughs are the mode for Vehicle Offenses and Burglary.
- Vehicle Offenses: Kingston upon Thames ; Burglary: Tower Hamlets
- Vehicle Offenses: Hackney ; Burglary: Kingston upon Thames
- Vehicle Offenses: Enfield ; Burglary: Greenwich
- Vehicle Offenses: Enfield ; Burglary: Tower Hamlets
Magnificent mode detection skills!
2.1.3. Measures of spread
Defining measures of spread
In the video you learned about several measures of spread—the range, the variance, and the standard deviation.
So, how do you calculate each measure and what can they tell us about the data?
Three buckets have been included, one for each measure. You’ll need to match the definition and use for each of these measures to the appropriate bucket.
- Place the definitions and use for each measure into their respective bucket.

Well done! Now you have a sense of the different ways to measure spread and their uses, let’s see how visualization can help understand spread
Box plots for measuring spread
Data visualization can be useful in highlighting measures of spread, such as the interquartile range (IQR).
Below is a box plot displaying the number of crimes across all London Boroughs in February 2021, grouped by the type of crime.
Your task is to use the plot to determine which type of crime had the largest interquartile range for this month.
- Possession of Weapons
- Vehicle offenses
- Drug offenses
- Theft
Correct! Sometimes it might not be clear if two categories have similar interquartile ranges, in which case the measure needs to be calculated.
Which crime has the larger standard deviation
For the final exercise of the chapter, it’s time to see how you interpret spread using the standard deviation.
Two histograms are displayed—one for Public Order Offenses and another for Miscellaneous Crimes Against Society.
Your task is to choose which type of crime has a larger standard deviation.
- It’s not possible to determine which has the larger standard deviation based on the plots.
- Miscellaneous Crimes Against Society
- Public Order Offenses
Perfect! Public Order Offenses has a much larger x-value range than Miscellanous Crimes Against Society, and since the standard deviation is measured in units of the data it results in Public Order Offenses having a much larger standard deviation.
2.2. Probability and distributions
Probability underpins a large part of statistics, where it is used to calculate the chance of events occurring. You’ll work with real-world sales data and learn how data with different values can be interpreted as a probability distribution. You’ll find out about discrete and continuous probability distributions, including the discovery of the normal distribution and how it occurs frequently in natural events!
2.2.1. What are the chances?
What is more likely?
You saw how to calculate the probability of a single event in the video. Now let’s check your understanding of this process.
Four scenarios have been provided; your task is to choose which one is most likely to occur.
- Scoring 90% or more in an exam, when only 5% of students score this highly.
- A restaurant receiving an order for a steak, where steak orders have made up 100 out of 1000 total orders to date.
- Picking a red card out of a standard pack of 52 playing cards, where half of the cards are red and the other half are black.
- Drawing a person’s name out of a hat that also contains 19 other names.
Correct! There are 26 red cards in a pack of playing cards, and 52 cards in total, so the probability of selecting one is one in two, or fifty percent!
Chances of the next sale being more than the mean
In the video, you saw how to calculate the probability of the next order in the online retail sales dataset being for a specific product type.
In this exercise, you will determine the chances of the next order being worth more than the mean order value, which is $188.50.
You will need to identify the number of orders generating Total Net Sales more than or equal to the mean, and divide this value by a count of all orders.
The app has a table containing definitions and values for the measures you require to calculate this probability. Use the input fields to enter the correct values, which will produce the probability as a percentage. Use this output to select an appropriate answer from the options provided.
- 50.37%
- 22.98%
- 198.54%
Well done! There is around a 23% chance of an order being worth more than $188.50.
2.2.2. Conditional probability
Dependent vs. Independent events
It can be tricky to decide whether events are dependent or independent, but understanding this is crucial when correctly calculating the probability of event outcomes!
In this exercise, you have been provided with some scenarios and need to classify them as dependent or independent.
- Match the scenarios to the appropriate bucket based on whether they are dependent or independent events.
Well done! When trying to determine if two events are dependent, ask yourself whether the outcome of one changes the probability of another.
Orders of more than 10 basket products
Recall that the order in which dependent events occur affects conditional probability.
In this exercise, you will need to use the image to calculate the conditional probability that an order from the online retail dataset will be for more than 10 items, given the order is for Basket products.
- 19/125
- 19/551
- 125/1767
- 19/1767
Perfect! This works out as a chance of approximately 3.4%, so fairly unlikely!
2.2.3. Discrete distributions
Identifying distributions
Recognizing the type of distribution is very important when analyzing data, as you will learn later on in the course.
You have been provided with an image containing three probability distributions drawn from samples of different sets of data. Your task is to choose which sample is most likely to have been taken from a uniform distribution?
- A
- B
- C
Great work! Since the histogram depicts a sample and not the actual probability distribution, each outcome won’t occur the exact same number of times due to randomness, but this looks pretty close.
Sample mean vs. Theoretical mean
The app will take a sample from a discrete uniform distribution, which includes the numbers one through nine, and calculate the sample’s mean. You can adjust the size of the sample using the slider. Note that the expected value of this distribution is five.
A sample is taken, and you win twenty dollars if the sample’s mean is less than four. There’s a catch: you get to pick the sample’s size.
Which sample size is most likely to win you the twenty dollars?
- 10
- 100
- 1000
- 5000
- 10000
Nice work! Since the sample mean will likely be closer to 5 (the expected value) with larger sample sizes, you have a better chance of getting a sample mean further away from 5 with a smaller sample.
2.2.4. Continuous distributions
Discrete vs. Continuous distributions
You’ve learned about some distributions for discrete and continuous data. These distributions can be used to visualize the probability of outcomes for everyday situations.
Now you will need to classify the scenarios provided based on which type of distribution they follow.
- Match the scenarios to the appropriate type of distribution.
Congratulations. Understanding the distribution of your data is often a necessary step before you can begin to analyze it!
Finding the normal distribution
In the video, you were introduced to the normal distribution, which is prevalent in many natural phenomena.
Your task is to identify which of the four plots most closely represents a normal distribution.
Which of the following represents a normal distribution?
- a
- b
- c
- d
Correct! While not perfect, this plot does appear to most closely resemble a normal distribution based on its shape looking like a bell curve, and that it is symmetrical. In the next chapter, you will find out about even more distributions!
Calculating probability with a uniform distribution
In the video, you saw that the area underneath the line of a continuous uniform distribution can be used to calculate the probability of an event outcome.
Here is a distribution displaying the probability of waiting anywhere from zero to 20 minutes for a train to arrive.
Your task is to calculate the probability of the next train arriving between five and 13 minutes.
- 5%
- 40%
- 80%
- 30%
Correct! Subtracting five from 13 gives eight out of a possible 20 minutes, which can be reduced to two out of five, or a 40% probability of the train arriving between five and 13 minutes! Now let’s learn some more distributions!
2. 3. More Distributions and the Central Limit Theorem
It’s time to explore more probability distributions. You’ll learn about the binomial distribution for visualizing the probability of binary outcomes, and one of the most important distributions in statistics, the normal distribution. You’ll see how distributions can be described by their shape, along with discovering the Poisson distribution and its role in calculating the probabilities of events occuring over time. You’ll also gain an understanding of the central limit theorem!
2.3.1. The binomial distribution
Recognizing a binomial distribution
Recall that a binomial distribution counts the number of successes in independent events.
Four scenarios are provided below; your task is to choose which one describes a binomial distribution based on the type of data described.
- The probability of a train arriving in under 10 minutes.
- The probability of a man’s height being 175 centimeters or less.
- The probability of matching four numbers in the lottery.
- The probability of the temperature being more than 80 degrees Celsius.
Brilliant binomial distribution deduction skills! As the lottery ticket has a target of four numbers, it can be classed as a success or failure depending on whether that many numbers were matched or not.
How probability affects the binomial distribution
Recall that the binomial distribution can be described by two parameters, n and p
To examine how these parameters affect the distribution, three plots have been provided representing the probability of closing between one and 12 sales per week for three sales people. The probability, p , is different for each individual.
Your task is to select which sales person has the highest probability of closing nine or more sales per week.
- George
- Izzy
- James
Correct! See how James’ distribution has higher probabilities towards the right of the plot, representing an increased probability of closing more sales per week.
Identifying n and p
Great work recognizing which sales person had the highest probability of closing nine or more sales.
Recall that the shape of a binomial distribution is determined by the parameters n and p and the expected value is represented by the bar with the highest peak.
But can you spot what n and p are for James’ sales distribution?
- n = 8, p = 12
- n = 12, p= 0.23
- n = 8, p= 0.23
- n = 12, p = 0.67
Perfect parameter identification! The distributions shows n equals 12 as there are a maximum 12 possible sales per week, and eight is the most likely outcome, meaning p can be found through dividing eight by 12, resulting in 0.67.
2.3.2. The normal distribution
Recognizing the normal distribution
The normal distribution is commonly observed in the real-world, such as shoe size, adult height, birth weight of babies, and IQ scores; therefore, it is important to be able to recognize this distribution when visualizing data.
Here is a grid of four plots displaying the distribution of schools achieving various percentages of pass grades for secondary school exams in the United Kingdom, where each visualization represents students from a different ethnicity.Your task is to identify which of these plots does not represent a normal distribution
- Plot A
- Plot B
- Plot C
- Plot D
Perfect! This distribution shows that more schools achieve a higher proportion of pass grades among this group of students.
What makes the normal distribution special?
In the video, you learned several special facts about the normal distribution.
Your task in this exercise is to classify which of these facts are unique to the normal distribution and which apply to all probability distributions.
- Classify each fact as either being specific to the normal distribution or applicable to any probability distribution.
Congratulations—you clearly recognize how the normal distribution is unique! Now let’s looked at skewed distributions.
Identifying skewness
While the normal distribution is commonly observed in the real-world, it is also quite likely that you will encounter data that is skewed.
Recognizing the direction of the skew and what this means for the data you are interpreting is an extremely valuable skill!
Two distributions have been provided. Each statement describes either the direction of skewness for one of the two distributions, or an interpretation of the data based on its distribution.
Your task is to classify whether the following statements are true or false.
- Classify each statement as true or false.
Key:
Great work! It can be tricky to identify the direction of a skewed distribution, but building this skill gives a huge advantage in interpreting data!
Describing distributions using kurtosis
Another means of describing the shape of a distribution is by its kurtosis, which can represent the size of its central peak and how spread out the tails are.
Kurtosis allows you to summarize whether values are bunched up close to the mean, and how far out any extreme values may lie.
Three definitions have been provided. Your task is to select which one accurately represents a normal distribution with negative kurtosis.
- A distribution with the same kurtosis as a normal distribution.
- A distribution with a larger central peak and smaller tails compared to a typical normal distribution.
- A distribution with a smaller peak and wider tails compared to a typical normal distribution.
Excellent! Now let’s learn about sampling and a concept known as the central limit theorem!
2.3.3. The central limit theorem
Visualizing sampling distributions
Now let’s check your understanding of how the central limit theorem applies to different distributions!
An app has been displayed allowing you to create sampling distributions of different summary statistics from samples of different distributions.
Which distribution does the central limit theorem apply to?
- Discrete uniform distribution
- Continuous uniform distribution
- Binomial distribution
- All of the above
Victorious visualizing! Regardless of the shape of the distribution you’re taking sample means from, the central limit theorem will apply if the sampling distribution contains enough sample means.
The CLT vs. The law of large numbers
Recall that earlier in the course you learned about the law of large numbers. As a reminder, here is a plot showing the distribution of 100 samples and 1000 samples of a die roll.
This concept can sometimes be confused with the central limit theorem, so let’s check your understanding of the definitions and differences between the two.
You will need to classify each statement as being applicable to either the central limit theorem or the law of large numbers.
- Match the statement to the correct concept.
Nicely done! Understanding the difference between these two concepts is important as we begin to learn about hypothesis testing in the next chapter! Now let’s examine when to use the central limit theorem!
When to use the central limit theorem
In the video, you saw a workflow on whether the central limit theorem should be applied or not, depending on various factors.
In this exercise, four scenarios have been provided. Your task is to choose which scenario would require the use of the central limit theorem to produce summary statistics that are normally distributed.
- Finding the mean IQ of all 100 students in a high school.
- Measuring the standard deviation of household income in a road with 20 houses.
- Determining the percentage of adults in the USA that have been diagnosed with Type 2 Diabetes.
- Counting the number of babies born with blue eyes in a hospital within one day.
Nicely done! It would be impossible to find out the Type 2 Diabetes status of every adult in the USA, so an appropriate approach is to take lots of small samples across several locations and use the sampling distribution to calculate the percentage diagnosed.
2.3.4. The Poisson distribution
Identifying Poisson processes
The Poisson distribution is very important because it occurs in a variety of real-life circumstances. Recognizing when the Poisson distribution applies can be helpful for business planning or in studying the occurrence of events in nature!
Your task in this exercise is to match the scenarios provided to either the Poisson distribution or another distribution.
- Match the scenarios to the appropriate distribution.
Perfect Poisson identification skills! Now let’s see if you can recognize the value of lambda when presented with a Poisson distribution.
Recognizing lambda in the Poisson distribution
Now that you’ve learned about the Poisson distribution, you know that its shape is described by a value called lambda (λ). In this exercise, you’ll select which of these plots represents a Poisson distribution with lambda equal to two.
- A
- B
- C
Perfect! Identifying lambda when presented with a Poisson distribution can help in understanding the probability of a rate of events occurring, which has applications in a range of industries.
2.4. Correlation and Hypothesis Testing
In the final chapter, you’ll be introduced to hypothesis testing and how it can be used to accurately draw conclusions about a population. You’ll discover correlation and how it can be used to quantify a linear relationship between two variables. You’ll find out about experimental design techniques such as randomization and blinding. You’ll also learn about concepts used to minimize the risk of drawing the wrong conclusion about the results of hypothesis tests!
2.4.1. Hypothesis testing
Sunshine and sleep
Say you are planning to perform two hypothesis tests:
- One to establish whether a relationship exists between the number of hours of sunshine per day in London and how many hours per night the residents sleep, and
- Another to check if exercise is related to blood pressure in elderly men.
Some statements have been provided. Your task is to correctly classify each statement as describing a null hypothesis or an alternative hypotheses for these two tests.
- Classify each statement as either a null hypothesis or an alternative hypothesis.
Well done! The alternative hypothesis can state that a relationship exists or propose how changes in the values of one variable may align with the values of another variable. Now let’s look at the hypothesis testing workflow!
The hypothesis testing workflow
Hypothesis testing requires a sequence of tasks to be completed in a particular order.
You are preparing to perform a hypothesis test on whether a difference exists between the frequency of colds in 18-30 year olds who eat meat and vegetarians.
Your task is to reorder the actions to accurately represent a typical hypothesis testing workflow.
- Order the sequence of events to accurately represent a general hypothesis testing workflow.
Congratulations! Remember, there are many ways to perform hypothesis testing, as we will see shortly, but this is a great workflow to ensure you do not perform an action earlier than you should!
Independent and dependent variables
You are planning a hypothesis test to see if there is a difference in the amount of sunshine a city receives per year and how many hours are worked annually. Your alternative hypothesis is that more sunshine is associated with fewer hours worked.
You monitor the amount of sunshine in three cities for one hundred days, and collect data on the number of hours worked from 30 people in each city.
Your task is to decide what would be considered the dependent variable in this hypothesis test.
- City
- Hours worked annually
- Amount of sunshine
Impressive! The alternative hypothesis means that you expect the number of hours worked annually to change based on how much sunshine is received.
2.4.2. Experiments
Recognizing controlled trials
The video highlighted the benefits of controlled trials, particularly using randomization and double-blinding. However, there are lots of ways to perform experiments and sometimes a controlled trial might not be appropriate or even feasible.
In this exercise, you’ll need to classify each scenario based on whether it describes a controlled trial or another type of experiment.
- Place each scenario into the bucket describing the correct type of experiment.
Fantastic! The other scenarios also describe experiments, just not controlled trials.
Why use randomization?
Controlled trials allow participants to be split into either a control group or a treatment group. Randomization is incredibly beneficial as part of this process, but can you recall why? Some reasons are listed below; your task is to select the one that best describes why randomization is useful when running a controlled trial.
- To reduce bias caused by assigning to groups based on specific characteristics.
- To ensure groups are comparable.
- To maximize the chance of the results being normally distributed.
- To reduce bias, ensure groups are comparable, and increase the chances that results will be representative of the target population.
- To prevent participants from knowing whether they are in the treatment or control group.
Great work! Randomization is a powerful tool as these benefits highlight! Now let’s learn one way of quantifying the relationship between variables—correlation.
2.4.3. Correlation
Identifying correlation between variables
You saw that adding a trendline to a scatter plot is a great way to get a sense of the strength and direction of a relationship between variables.
An app has been provided for you to display scatter plots with a trendline for data about cities. The data point represents a city, and its location in the plot is determined by the cost of a bottle of water and a monthly gym membership in that city.
Your task is to describe the Pearson correlation coefficient for the cost of a bottle of water (£) as the dependent variable and annual hours worked as the independent variable.
- Weak-to-moderate positive relationship
- Strong negative relationship
- Weak-to-moderate negative relationship
- Weak positive relationship
Perfect Pearson correlation coefficient spotting!
What can correlation tell you?
In the last exercise, you saw a weak-to-moderate negative relationship between the cost of a bottle of water and the number of hours worked annually in different cities. Here is the plot as a reminder:
- An increase in the number of hours worked annually causes a decrease in the cost of a bottle of water.
- An increase in the number of hours worked annually is related to an increase in the cost of a bottle of water.
- A decrease in the cost of a bottle of water causes a decrease in the number of hours worked annually.
- An increase in the number of hours worked annually is related to a decrease in the cost of a bottle of water.
Perfect! It is reasonable to conclude that a relationship exists between the two variables, however there might be other factors influencing their values as it is difficult to identify why this relationship is occurring.
Your task is to choose the statement that accurately describes what you can reasonably conclude from this visualization.
Confounding variables
You have been asked to perform an experiment to investigate the relationship between neighborhood residence and lung capacity. You will measure the lung capacity of thirty people from neighborhood A, which is located near a highway, and thirty people from neighborhood B, which is not near a highway. Both groups have similar smoking habits and a similar gender breakdown.
Which of the following could be a confounding variable in this experiment?
- Lung capacity
- Neighborhood
- Air pollution
- Smoking status
- Gender
Correct! You would expect there to be more air pollution in the neighborhood situated near the highway, which may cause lower lung capacity.
2.4.4. Interpreting hypothesis test results
Significance levels vs. p-values
You learned about two very important elements of hypothesis testing: significance levels (α) and p-values.
It is very common to get these two confused, especially when you are new to statistics!
In this exercise, you have been provided with statements and need to match the definitions provided based on whether they accurately describe α or p-values.
- Match the statements to the correct bucket.
Well done, it can be tricky to recognize the difference between these two terms!
Type I and type II errors
You are interested in whether there is a difference in average weight between adult men in Madrid, Spain, and Berlin, Germany.
Your null hypothesis is that there is no difference in weight between men in Madrid and Berlin, and your alternative hypothesis is that men in Berlin are heavier than those in Madrid. α has been set to 0.05, and the weights of 100 adult male residents from each city have been collected. Sampling with replacement has been performed to produce 10000 sample means of both cities. What conclusion can you draw based on the sample mean distributions?
- Reject the null hypothesis, men in Berlin are heavier than men in Madrid
- Accept the null hypothesis, there is no difference between men’s weight in Berlin and Madrid
- It is impossible to determine this from the plot
- Reject the null hypothesis, men in Madrid are heavier than men in Berlin
Correct! There is a substantial overlap between distributions, so you cannot reasonably conclude that a difference exists in average weight between men in each city. Great work on completing the course; let’s recap on how far you’ve come!