I wanted to see if the change of carbon dioxide emissions per capita, measured in metric tons, is significant in the long run. We expect it to be positive and distinguishable from zero. The dataset is from the World Bank, sourced from Climate Watch Historical GHG Emissions (1990-2020), and available here
First, we might calculate the change rate of the emissions for each country over the period 1960-2016.
#load the libraries we need.library(readxl)library(dplyr)library(data.table)library(EnvStats)library(DT)#import the dataset. I renamed the file as you can see.emission<-read_xls("carbondioxideemmision.xls")#now let's see the first small part of the dataemission[1:7,1:7] #alternatively, you can use head() function to review the first parts of the data.
# A tibble: 7 × 7
`Data Source` `World Development Indicators` ...3 ...4 ...5 ...6 ...7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Last Updated Date 44274 <NA> <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 Country Name Country Code Indi… Indi… 1960 1961 1962
4 Aruba ABW CO2 … EN.A… 204.… 208.… 226.…
5 Afghanistan AFG CO2 … EN.A… 0.04… 0.05… 0.07…
6 Angola AGO CO2 … EN.A… 0.10… 0.08… 0.21…
7 Albania ALB CO2 … EN.A… 1.25… 1.37… 1.43…
#remove the first two rows, since they contain nothing.emission<-emission[-c(1:2),]#the first row contains the names of columns, yet the columns are not named properly. Change the columns names to be as the first rowcolnames(emission)<-emission[1,]#after we renamed the columns by the first row, we don't need the first row anymore, so let's delete it.emission<-emission[-c(1),]#Now let's see again:emission[1:5,]
#since we have "years" as variables (columns), we can transpose the data so that countries will be variables and years as rows. I'll tell you later why we did that.emission<-as.data.frame(t(emission))#again column names are not properly named, so we change them as before.colnames(emission)<-emission[1,]#the first four rows are no use for us, we can remove them. The first row contains country names, which we had already used to name our columns, the remaining rows contain country code, indicator name, and indicator code, respectively, and we'll not use any of them.emission<-emission[-c(1:4),]#now let's have a lookemission[1:7,1:7]
#now before playing with these numbers let's see if R recognize them as numbers. Let's take the first country as an example.class(emission$Aruba)
[1] "character"
#R doesn't recognize the values as numbers. So we can use lapply function to declare all of them as numbers to R. Now here comes why we transposed the data before. The idea is to not have a country column, so I'd declare all columns as numbers to R without being concerned about a character column (the country column) emission[]<-lapply(emission,function(x) as.numeric(as.character(x)))#then let's remove the countries that we don't have any observation about it.emission<-emission[, colSums(is.na(emission))<nrow(emission)]#the "year" is not recognized as a column by R yet. Let's declare it.setDT(emission,keep.rownames ="year")#let's have a quick lookemission[1:5,1:5]
#let's transform our data to long format so that we'd have the countries as a column altogether, resulting in a panel data.emission<-melt(emission, id.vars="year",variable.name ="country")head(emission)
year country value
<char> <fctr> <num>
1: 1960 Aruba 204.6204
2: 1961 Aruba 208.8228
3: 1962 Aruba 226.1181
4: 1963 Aruba 214.8004
5: 1964 Aruba 207.6158
6: 1965 Aruba 185.2040
#filter our data so that we would have only the observations for 1960 and 2016.emission_6016<-dplyr::filter(emission,year %in%c(1960,2016))#let's calculate the change rate.emission_6016<- emission_6016 %>%group_by(country) %>%mutate(change=(((value/lag(value))^(1/(2016-1960)))-1)*100)#let's have a look.head(emission_6016)
# A tibble: 6 × 4
# Groups: country [3]
year country value change
<chr> <fct> <dbl> <dbl>
1 1960 Aruba 205. NA
2 2016 Aruba 8.43 -5.54
3 1960 Afghanistan 0.0461 NA
4 2016 Afghanistan 0.245 3.03
5 1960 Angola 0.101 NA
6 2016 Angola 1.20 4.53
#we only need the "country" and its change rate, so we can remove the other two columns.emission_6016<-emission_6016[,-c(1,3)]#remove the missing values, since it's only one value for each country.emission_6016<-na.omit(emission_6016)#let's have another look.head(emission_6016)
# A tibble: 6 × 2
# Groups: country [6]
country change
<fct> <dbl>
1 Aruba -5.54
2 Afghanistan 3.03
3 Angola 4.53
4 Albania 0.404
5 Arab World 3.73
6 United Arab Emirates 9.77
In addition, let’s consider the average emissions per capita for each country over the same period, and the coefficient of variation as well.
#turning back to our original data, calculate the mean and the coefficient of variation by country and then merge it with the subset with the change rate that we created it before.emission<-inner_join(emission_6016,emission %>%group_by(country) %>%summarise(mean=mean(value,na.rm = T), cv=cv(value,na.rm=T)))
Joining with `by = join_by(country)`
#round our variables to the fourth digitemission$change<-round(emission$change,4)emission$mean<-round(emission$mean,4)emission$cv<-round(emission$cv,4)head(emission)
# A tibble: 6 × 4
# Groups: country [6]
country change mean cv
<fct> <dbl> <dbl> <dbl>
1 Aruba -5.54 105. 0.996
2 Afghanistan 3.03 0.148 0.603
3 Angola 4.53 0.652 0.558
4 Albania 0.404 1.65 0.393
5 Arab World 3.73 3.03 0.396
6 United Arab Emirates 9.77 31.0 0.681
#since the world bank includes regions among countries, I created a vector manually to contain all regions that are not individual countries, since including them might bias our calculations.regions<-c("Arab World","Central Europe and the Baltics","Caribbean small states","East Asia & Pacific (excluding high income)","Early-demographic dividend","East Asia & Pacific","Europe & Central Asia (excluding high income)","Europe & Central Asia","European Union","Fragile and conflict affected situations","High income","Heavily indebted poor countries (HIPC)","IBRD only", "IDA & IBRD total","IDA total","IDA blend","IDA only","Latin America & Caribbean (excluding high income)","Latin America & Caribbean","Least developed countries: UN classification","Low income","Lower middle income","Low & middle income","Late-demographic dividend","Middle East & North Africa","Middle income","Middle East & North Africa (excluding high income)","North America","OECD members","Pre-demographic dividend","Pacific island small states","Other small states","Post-demographic dividend","South Asia","Sub-Saharan Africa (excluding high income)","Sub-Saharan Africa","Small states","East Asia & Pacific (IDA & IBRD countries)","Europe & Central Asia (IDA & IBRD countries)","Latin America & the Caribbean (IDA & IBRD countries)","Middle East & North Africa (IDA & IBRD countries)","South Asia (IDA & IBRD)","Sub-Saharan Africa (IDA & IBRD countries)","Upper middle income","World")#quickly create a new subset to include these regions only, and then filter our data to include only the places that are not in the subset created. there are many ways to do that, I just chose what came to my mind. emission2<-filter(emission,country%in%regions)emission<-emission[!(emission$country %in% emission2$country),]#let's create a table to show our data.datatable(emission, colnames =c("country","percentage change of emissions per capita \n (1960-2016)","Average of emissions per capita (in metric tons) \n (1960-2016)","coefficent of variation"))
We can see which country has the highest and the lowest emission change in the long run.
>#country with the minimum emission change > emission[which.min(emission$change),]
# A tibble: 1 × 4
# Groups: country [1]
country change mean cv
<fct> <dbl> <dbl> <dbl>
1 Aruba -5.54 105. 0.996
>#country with the maximum emission change> emission[which.max(emission$change),]
# A tibble: 1 × 4
# Groups: country [1]
country change mean cv
<fct> <dbl> <dbl> <dbl>
1 United Arab Emirates 9.77 31.0 0.681
Probability distribution of the change rate of CO2 emissions per capita in the long run:
Let’s have a closer look to our variable distribution:
#plot the empirical probability density function.epdfPlot(emission$change,xlab=expression('Change of CO'[2] ~'emissions per capita in the long run'),main=expression('Empirical PDF of the percentage change of CO'[2]~'emissions per capita in the long run'),cex.main=0.95)abline(v=mean(emission$change),col="darkgray")
The dark gray vertical line represents the mean in our data, as you might expected from the code.
#plot of the empirical cumulative distribution function.ecdfPlot(emission$change, xlab =expression('Order Statistics for the change rate of CO'[2]~'emissions per capita'),main=expression('Empirical CDF of the percentage change of CO'[2]~'emissions per capita in the long run'),cex.main=0.95)
We can see that the probability that the emission change rate in the long run is less than or equal to zero is almost \(0\), supporting our hypothesis that the change rate is positive.
Simply, let \(X\) be the change of CO2 emissions per capita, then \(P(X \le 0) \approx 0\), but we didn’t check yet for its significance from zero, which comes later.
Now, let’s “boxplot” it!
boxplot(emission$change,horizontal=T,xlab=expression('Change of CO'[2] ~'emissions per capita in the long run'),varwidth=T)abline(v=0)
It doesn’t seem close to zero but it might not be significant, so let’s test if the change in the long run across countries is significant from zero.
#using t-test.t.test(emission$change)
One Sample t-test
data: emission$change
t = 14.803, df = 152, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
2.149774 2.811979
sample estimates:
mean of x
2.480876
The p-value is extremely small almost equal to \(0\), thus the change in the long run is significant from zero at all levels.
In other words, the carbon dioxide emissions per capita, in metric tons, across countries over the period \(1960-2016\) increased significantly by more than \(2\%\), on average.