Standard deviation is a statistical technique, which shows to what extent the values of the studied feature deviate on average from the mean. We use it to determine whether the units in our sample or population are similar with respect to the studied feature, or whether they differ significantly from each other. If you want to learn how to find standard deviation in R or just learn what standard deviation is, then read on.
This guide will offer a detailed explanation of calculating standard deviation in R, covering various methods and practical examples to assist users in analyzing data efficiently.
Standard deviation is a measure defining the average variation of individual values of a statistical feature from the arithmetic mean. It has a more intuitive interpretation as a measure of variability of a distribution. If the calculation had been undertaken with distances from the mean, then the sum would always be 0, a very shaky result.
The mathematical formula of standard deviation is:
=∑(xi−μ)2N
Where Σ
represents sum, xi
is each observation, μ
is the mean of the data, and N
is the total number of observations. It is usually abbreviated as SD.
The smaller the standard deviation, the closer the values are to the average, which shows that the data is more consistent. To properly judge whether the SD is small or large, it's important to know the range of the scale being used.
The standard deviation is very helpful when comparing the variability between two data sets of similar size and average. Only using the simple average often does not help in deeper analysis. What good is it that we know the average salary in the company, if one does not know the variability of the salary? Do all employees get exactly the same? Or maybe the manager is overstating the average salary? To dig deeper and get to the underlying truth, we will have to calculate standard deviation.
Similarly, standard deviation is also helpful to find the risk while making investment decisions. If on the stock exchange, one company brought an average annual profit of 4% and another an average annual profit of 5%, it does not mean that it is better to choose the second company without thinking.
Setting aside both fundamental and technical analysis of a specific company, as well as the broader macroeconomic conditions, it's valuable to focus on the fluctuations in the quotations themselves.
If the stock value of the first company had slight, several percent fluctuations during the year, and the other fluctuated by several dozen percent. Then it is logical that the investment in the first company was much less risky. And to compare different rates of return and check their riskiness, you can use the standard deviation.
To perform any kind of analysis, first we must have some data. In R you can input manually by defining a vector or importing it from external sources, such as excel or CSV file. Let’s create a vector with six values:
data <- c(4, 8, 6, 5, 3, 7)
Alternatively, datasets can be imported using the read.csv()
function, which loads data from a CSV file into R. Here's an example of importing data:
# Read a CSV file into a data frame
data <-read.csv("datafile.csv")
# Install the 'readxl' package
install.packages("readxl")
# Load the library
library(readxl)
# Read an Excel file into a data frame
data_excel <- read_excel("datafile.xlsx", sheet = 1)
A quick and easy way to standard deviation of a sample is through the sd()
function which is one of the built-in function in R. It takes a data sample, often in the form of a vector, as input and returns the standard deviation. For example, to measure the SD of the vector created earlier:
sd(data)
Output:
[1] 1.870829
If your sample has missing or null values, then you just need to set the parameter na.rm=TRUE
in the sd()
function and the missing value will not be included in the analysis:
standard_deviation <- sd(data, na.rm = TRUE)
To calculate the population standard deviation, we will first find the mean and subtract it from each observation in the dataset and square the results. Once we have the squared differences, we just have to find their average to find the variance. Finally, taking the square root of the variance will give us the population SD.
Here is the R code to manually compute population standard deviation:
mean_data <- mean(data)
squared_differences <- (data - mean_data)^2
mean_squared_diff <- mean(squared_differences)
standard_deviation_manual <- sqrt(mean_squared_diff)
print(standard_deviation_manual)
Let's say you are analyzing the grades of students across different subjects in a school. The categorical variable here is “subject,” and you want to know not only the average grade for each subject but also the variation in grades. This will help us understand if certain subject have a wide or uniform range of grades.
To determine the standard deviation for each category in a dataset containing categorical variables, one can utilize the dplyr
package. The group_by()
function facilitates the segmentation of the data by the categorical variable, and summarise()
then calculates the SD for each distinct group.
Before moving to calculation, we will install the dplyr
package:
install.packages("dplyr")
Following our earlier example, let’s take a dataset which contains grades of students across different subjects:
library(dplyr)
# Example data frame with class and grades
data <- data.frame(
Subject = c('Math', 'Math', 'Math', 'History', 'History', 'History'),
grade = c(85, 90, 78, 88, 92, 85)
)
# Calculate standard deviation for each class
grouped_sd <- data %>%
group_by(Subject) %>%
summarise(Standard_Deviation = sd(grade))
print(grouped_sd)
Output:
# A tibble: 2 × 2
Subject Standard_Deviation
<chr> <dbl>
1 History 3.511885
2 Math 6.027714
In R, there are a number of different ways to find column-wise standard deviation. To find the SD of specific columns, you can use apply the sd()
function. A more efficient way is to use the summarise()
or summarise_all()
functions of the dplyr
package.
Example using apply():
data_frame <- data.frame(A = c(1, 2, 3), B = c(4, 5, 6))
apply(data_frame, 2, sd)
Example using dplyr:
library(dplyr)
data_frame %>%
summarise(across(everything(), sd))
Now imagine that you are a manager of a sports league where a team has 5 players while others have 50 players. If you calculate the SD of scores across the entire league and treat all teams equally, the 5-player teams would contribute just as much to the calculation as the 50-player teams, even when they have far fewer players.
Such an analysis will be misleading, therefore we need a measure like weighted standard deviation which controls for the weights based on the size of the teams, ensuring that teams with more players contribute proportionally to the overall variability.
The formula for calculating the weighted standard deviation is as follows:
Dw=∑wi(xi−μw)2∑wi
Where:
𝑤i
represents the weight for each data point,𝑥i
denotes each data point,μw
is the weighted mean, calculated as:μw=∑wixi∑wi
Though R does not have a built-in function for measuring weighted standard deviation, it can be computed manually.
Let's say we have test grades data with corresponding weights, and we want to measure the weighted standard deviation:
# Example data with grades and weights
grades <- c(85, 90, 78, 88, 92, 85)
weights <- c(0.2, 0.3, 0.1, 0.15, 0.1, 0.15)
# Calculate the weighted mean
weighted_mean <- sum(grades * weights) / sum(weights)
# Calculate the squared differences from the weighted mean
squared_differences <- (grades - weighted_mean)^2
# Calculate the weighted variance
weighted_variance <- sum(weights * squared_differences) / sum(weights)
# Calculate the weighted standard deviation
weighted_sd <- sqrt(weighted_variance)
print(weighted_sd)
Output:
[1] 3.853245
Standard deviation is quite easy to calculate, despite those cruel sums and roots in the formula, and even easier to interpret. If you just want to make friends with statistics or data science, then like it or not, you also have to make friends with standard deviation and how to measure it in R.