I use Cars dataset to apply multinomial logistic regression.
#load the libraries that we need.library(data.table)library(ggplot2)library(ggridges)library(nnet)library(stargazer)library(rpart)library(rpart.plot)library(parttree)#read and assign the data. Please adjust the directory where the data lives.dt_cars <-fread(file ="cars.csv", sep =";")#having a quick look into the dataset.head(dt_cars)
#create a "Drive" variable, to indicate the type of the drive. The command first creates the Drive variable and assign FWD to all the observations, since the data doesn't have a variable for FWD, unlike AWD and RWD. Then change the Drive variable for cars that have AWD or RWD, then the command changes the Drive variable to factor variable.dt_cars[, "Drive":="FWD"][AWD==1, "Drive":="AWD"][RWD==1, "Drive":="RWD"][, "Drive":=as.factor(Drive)]
\(\bullet\)The weight distribution of cars by drive type:
#the weight distribution by drive type. The code will generate a boxplot for each drive type.plot(dt_cars$Drive,dt_cars$Weight,main="Weight distribution of cars by drive type",xlab="Drive",ylab="Weight")
#the same as before, but with colors added.plot(dt_cars$Drive,dt_cars$Weight,col=factor(dt_cars$Drive),main="Weight distribution of cars by drive type",xlab="Drive",ylab="Weight")
#using ggplot to create a histogram for the distribution of weight by drive type with a unique color for each drive type.ggplot(dt_cars,aes(x=Weight,color=Drive))+geom_histogram(bins =40)
#The previous plot in vertically stacked facets using ggplot.ggplot(dt_cars,aes(x=Weight,color=Drive))+geom_histogram(bins =40)+facet_grid(~Drive)
#The previous plot as a ridgeline plotggplot(data = dt_cars, aes(x=Weight, y=Drive, fill = Drive)) +geom_density_ridges(bandwidth=30) +scale_colour_brewer(type ="qual", palette =1) #the "qual" stands for qualitative
#The previous plot as a ridgeline plot, density plots, with median weights for each drive typeggplot(data = dt_cars, aes(x=Weight, y=Drive,vline_color=Drive)) +stat_density_ridges(quantile_lines = T,quantiles =2) #this line will calculate the median of weight for each drive type. First, the quantile lines have to be shown, so it's set to be True, and then the "quantiles" will set the number of quantiles the data should be broken into, and since the median represents the 50th percentile, then the data will be broken into two quantiles. To color each vertical line, I added "vline_color" before, in the aesthetics in ggplot, so that the vertical lines will have different colors based on the drive type.
Picking joint bandwidth of 236
\(\bullet\)Multinomial Logistic (MNL) Regression:
#make "RWD" the reference class. The "relevel" will reorder the Drive variable by the reference, RWD.dt_cars$type<-relevel(dt_cars$Drive,ref="RWD") #run multinomial logistic regression, using the new variable created before.model<-multinom(type~HP,data=dt_cars)
# weights: 9 (4 variable)
initial value 468.008835
iter 10 value 378.849870
final value 378.849859
converged
#resultsstargazer(model,digits =4,type="html")
Dependent variable:
AWD
FWD
(1)
(2)
HP
-0.0054***
-0.0194***
(0.0021)
(0.0023)
Constant
1.1774**
4.9953***
(0.5261)
(0.5279)
Akaike Inf. Crit.
765.6997
765.6997
Note:
p<0.1; p<0.05; p<0.01
The coefficients are interpreted relative to the reference. That is, for AWD, as horse power increases by one unit, the coefficient for AWD relative to RWD will decrease by 0.005, which means if the horse power increases by one, the chances of having RWD vehicle are higher compared to AWD.
The denominator is the reference we specified before. That is, the coefficients are interpreted relative to the reference, RWD.
Then, if we exponentiate each side, we have: \[\begin{equation}
\frac{\mathrm{P}(Drive=AWD \mid x)}{\mathrm{P}(Drive=RWD \mid x)} = e^{1.177-0.005x} = e^{x^{\prime}\beta^{AWD}}
\end{equation}\]
In R , we can exponentiate the intercept and the coefficients:
exp(coef(model))
(Intercept) HP
AWD 3.24581 0.9946333
FWD 147.71311 0.9807867
Now, we can write: \[\begin{equation}
\frac{\mathrm{P}(Drive=AWD \mid x)+\mathrm{P}(Drive=FWD \mid x)}{\mathrm{P}(Drive=RWD \mid x)} = e^{x^{\prime}\beta^{AWD}}+e^{x^{\prime}\beta^{FWD}}
\end{equation}\]
Since the class probabilities have to sum up to \(1\), that means: \[\begin{equation}
\mathrm{P}(Drive=AWD \mid x) + \mathrm{P}(Drive=FWD \mid x) + \mathrm{P}(Drive=RWD \mid x) = 1.
\end{equation}\]
Then if we add \(1\) to both sides, we have: \[\begin{equation}
\frac{1}{\mathrm{P}(Drive=RWD \mid x)} = 1+e^{x^{\prime}\beta^{AWD}}+e^{x^{\prime}\beta^{FWD}}
\end{equation}\]
The probabilities must sum to \(1\). (e.g. each row has to sum up to \(1\)).
For logistic regression, the fitted values represent the probabilities, not the expectations, so, alternatively, we can use the fitted values of the model to get the same probabilities for all observations.
Assume we want to predict the probabilities for each drive type if the horse power equal to \(200\).
The probability that the vehicle will have AWD drive type, conditional on the horse power equal to \(200\) will be: \[\mathrm{P}(Drive=AWD \mid HP=200) = \frac{e^{1.1774-0.0054*200}}{1+e^{1.1774-0.0054*200}+e^{4.9953-0.0194*200}}\]\[\approx 0.2139\]
We can calculate the probability for the other two classes: \[\mathrm{P}(Drive=FWD \mid HP=200) = \frac{e^{4.9953-0.0194*200}}{1+e^{1.1774-0.0054*200}+e^{4.9953-0.0194*200}}\]\[\approx 0.592 \]
The sum of the probabilities is: \(0.2139+0.592+0.1941=1\).
That means, given a vehicle with a horse power of \(200\), the probability that it has AWD or FWD or RWD is approximately \(21\%\), \(59\%\) or \(19\%\), respectively.
#plot horse power against the 3 class probabilities.ggplot(dt_cars,aes(x=HP))+geom_line(aes(y=probability.AWD,color="AWD"))+geom_line(aes(y=probability.FWD,color="FWD"))+geom_line(aes(y=probability.RWD,color="RWD"))+labs(y="Probability") +scale_color_manual(name="Drive",values=c(AWD="red",FWD="blue",RWD="black"))+facet_grid(type~.) #writing the name of the variable before "~" will create the grid in a horizontal format.
\(\bullet\)Classification Tree:
#classification tree for Drive on HP.model_tree<-rpart(Drive~HP,data=dt_cars)#plot the tree.prp(model_tree)
#plot with HP on the x-axis, and its density on the y-axis, with vertical lines that indicate the splits, and the area between the splits are colored by the drive type.ggplot(dt_cars,aes(x=HP))+geom_density()+geom_parttree(data=model_tree,aes(fill=Drive),alpha=0.5)+labs(y="Density")
\(\bullet\)Comparing between the classification tree chart and the MNL chart:
In the classification tree chart, the first region (lower levels of horse power) is green, representing FWD. That is consistent with the predicted probability in the MNL chart: the blue line in the MNL chart (FWD), for the low levels of horse power, is higher than the other drive types (e.g. higher probability).
In the classification tree chart, the last region (higher levels of horse power) is blue, representing RWD. That is consistent, as well, with the predicted probability in the MNL chart: the black line in the MNL chart (RWD), for higher levels of horse power, is higher than the other drive types (e.g. higher probability)
\(\bullet\)Regression Tree:
#regression tree for CityMPG on HP.reg_tree<-rpart(CityMPG~HP,data=dt_cars)#plot the tree.rpart.plot(reg_tree)
The average CityMPG for the whole data is \(20\).
For horse power greater than or equal \(141\), the average CityMPG is \(19\), and that represents \(85\%\) of the data, and so on.
As we go down in the tree, the conditions become stricter, till it ends with five leafs at the end, which combined represents \(100\%\) of the data.
If \(HP \ge 273\), the average CityMPG = \(16\), and this represents \(21\%\) of the data.
If \(183 \le HP < 273\), the average CityMPG =\(19\), and this represents \(46\%\) of the data.
If \(141 \le HP <183\), the average CityMPG = \(21\), and this represents \(18\%\) of the data.
If \(113 \le HP <141\), the average CityMPG=\(27\), and this represents \(11\%\) of the data.
If \(HP<113\), the average CityMPG=\(34\), and this represents \(4\%\) of the data.
We can see the conditions using rpart.rules
rpart.rules(reg_tree)
CityMPG
16 when HP >= 273
19 when HP is 183 to 273
21 when HP is 141 to 183
27 when HP is 113 to 141
34 when HP < 113
#extract the average CityMPG from the leafs, by extracting the data frame for the fitted "rpart" tree. reg_tree$frame
#From it, we can create another data frame that contains the data points of the leafs and "yval". #"yval" is for y-values, which are the the fitted values of CityMPG/the average of CityMPG.City<-data.frame(CityMPG=c(reg_tree$frame[4,5],reg_tree$frame[5,5],reg_tree$frame[6,5],reg_tree$frame[8,5],reg_tree$frame[9,5])) #the numbers in the squared brackets are for the number of row and the number of column, respectively.#plot the data, the linear regression line, and horizontal lines that display the average CityMPG per leaf.ggplot(dt_cars,aes(x=HP,y=CityMPG))+geom_point()+geom_smooth(formula=y~x,method ="lm")+#"lm" is for linear model.geom_hline(yintercept = City$CityMPG,color="red",alpha=0.3)+#"alpha" is for the intensity of the color. The higher, the more intense (less transparent) the color will be. labs(y="City MPG")+scale_y_continuous(breaks =round(sort(c(seq(min(dt_cars$CityMPG,na.rm=TRUE), max(dt_cars$CityMPG,na.rm=TRUE), by=15), City$CityMPG)))) #to display the values of the average CityMPG by using "breaks", but first I round the numbers to the nearest integer, and then sort all the possible values for CityMPG by creating a sequence from its minimum to its maximum alongside the average CityMPG per leaf, and "by" is the difference between each break.