Linear Regression

We will use the linear regression model to analyze the relationship between rate of violence against women and input features which are Education and GDP per capita.

Our features:

Education Rate

GDP per capita

Our target data:

Women Violence Rate

Modelling

Here because to avoid from overfitting problem, we split our to 2 group. One of them is Training Data group which is the %66 of the total data and the other one is Test Data group which is %33 of the total data.

Our sample data contains 79 countries, so Train Data set is formed with 52 countries and Test Data set is formed with 27 countries. These 52 countries in Train Data set will be used to model the decision tree.

The visualization of the training data can be seen below:

 

a

First of all, we look at the relationship between education level and women violence. Because of that we use Least squares line equation formula

 

Least squares line equation formula:        y = β+ βx

  • x is the feature (here this is education rate)
  • y is the response (here this is rate of woman violence)
  • ß0 is the intercept
  • ß1 is the coefficient for x

 

To determine the model coefficients (ß0 and ß1) for the crime and education rate data we use statsmodels module at Python.

Output:

Intercept        48.997949

EducationRate    -0.445282

dtype: float64

It means;

ß0 = 48.997949
ß1 = -0.445282

When we put these values into the equation we get:

y = 48.997949+ (-0.445282) * x 

After that we visualize the least squares line on the training data: 

b

From the graph above we can conclude that our data is not so close to the line because of a few deviation but we may say that there is a correlation between EducationRate and ViolenceRate. However just looking to the least square line graph we cannot come to a strong conclusion. But it gives us some clues about relationship between education rate and violence rate. So let’s predict the violence rate of countries according to their educational rate.

Prediction: 

  Education Rate GDPpercapita Violence Rate
Italy 73.4 4855.800481      7

 

Here we have a data from our sample with its values. We put these values to the least square line equation manually to prediction.

 

y = 48.997949 + (-0.445282)* 73.4

y = 16.3142502

 

The real data of Italy’s violation rate is 7 but the result of the prediction is 16.3142502. As a result, so our prediction is not suitable.

r-squared value of the model:

0.5468778921824813

As we can see that the score is nearly %55 so it means approximately half of the predictions will be success on the training data. So we cannot conclude a clear result about relationship between educational rate and violence rate.

Now we test the hypothesis.

Hypothesis Testing:

Null hypothesis: There is no relationship between violence against women, welfare level and their education level.

Alternative hypothesis: There is a relationship between women’s education level, welfare level and violence against them.

Now, we calculate the confidence intervals and p-values:

 

Confidence Intervals:

 

0 1
Intercept 41.462175 56.533724
EducationRate -0.560415 -0.330150

 

p-values:

Intercept        9.770729e-18

EducationRate    3.802296e-10

dtype: float64

In %95 confidence interval level, p-values should be compared with 0.05

Our p-value for education is 9.770729e-18 so it is nearly 0.

Therefore 0 < 0.05 we reject the null hypothesis, so it means educational level and women violence rate has a relationship.

After this step, here we analyze all features together via Multiple Linear Regression

 

Multiple Linear Regression equation is :

Y = β0 + ( β1 * Education Rate) + (β2 * GDPpercapita)

Again to determine the model coefficients we use the statsmodel module at Python:

Output:

Intercept        48.803631

EducationRate    -0.459610

GDPpercapita      0.000072

dtype: float64

 

β0 = 48.803631
β1 = -0.459610
β2 = 0.000072

Our Regression Results:

c

In a conclusion of this process:

From Regression Result we see that p-values of GDPpercapita is higher than 0.05 but p-values of EducationRate is lower than 0.05. Because of that there is no relationship between GDPpercapita and ViolenceRate, so we can’t reject thenull hypothesis in terms of GDPpercapita. Education rate is relation with Violence rate in a negative way.While r-squared value for Education and Women Violence is 0.5468778921824813, r-squared value with Multiple Features and Women Violence is 0.5523799400521094. It can be seen there is a little increasing, butit is not enough to say something certain.

 

Our Python Code:

Github Link

Leave a comment