Conclusion

As a conclusion, we find that there is a negative correlation between rate of violence against women and educational level that when educational level increases violation rate decreases but this is not a strong correlation. Also, we cannot find a relation between violence rate and GDP per capita.

Then,we construct two models which are decision tree and k-nearest neighbour model. The decision tree does not give a precise conclusion because according to our calculations, the half of them are correct. The regression model gives almost similar proportion but it is not a machine learning technique. The k-nearest model is good for to analyze out data set but it does not give a clear conclusion.

In the project, we tried to examine different data sources and different years but we can only find the violence data in 2014 so we cannot say exact  conclusion.

 

 

 

Decision Tree

We created a decision tree model to predict the woman violence rate of countries with respect to their education rate and GDP.

The input features for the decision tree are Education Rate and GDP,

The target is Woman Violence Rate,

First of all, we formed 4 classes for the violence rates:

  • Higher or equal to 70, it will be in High (value = 3)
  • Between 70 and 40, it will be Upper-Intermediate (value = 2)
  • Between 20 and 40, it will be Lower-Intermediate (value = 1)
  • Below 20, it will be Low (value = 0)

 

  • The output of the decision tree is the predicted violence rates of the country with respect to the input features.

 

Modelling

We will use split in this machine learning analysis; the training data will be consisted of 66% and the test data will be consisted of 33% of the sample data we have. This is very important to avoid from over-fitting and decision trees are so susceptible to it.

f

Firstly, we tested the score of the decision tree on our training data, as expected, output was 1.0

Since we tested same data which we prepared Decision Tree so the accuracy score for the training data must be 100%.

Then, we can test our model by using the test data we have (last 33% of our main data);

Test data’s array sequence according to classes we determined is:

[2, 0, 2, 3, 2, 3, 2, 0, 1, 0, 3, 0, 3, 1, 0, 1, 1, 0, 1, 2, 2, 2, 2, 2, 1, 2, 2]

The predicted classes of the test data is:

[2, 1, 2, 3, 2, 3, 3, 0, 0, 0, 2, 1, 1, 0, 1, 1, 3, 0, 0, 2, 2, 1, 1, 2, 0, 3, 2]

Note that; 0, 1, 2 and 3 represents Low, Lower-Intermediate, Upper-Intermediate and High classes respectively.

As it is seen, some predictions are wrong. When we calculate the score of the decision tree on the test data, the output is following: 0.48148148148148145

 

This shows that the accuracy score for the test data is around 48% and this is relatively low for a decision tree model. Therefore, it seems that it does not work very well.

 

However, in terms of results, this method has a similar to the regression model because the regression model had a similar r-squared value.

 

Why our model does not work well?

  • Firstly, the model may be too specific to be a model because it is created using the training data and all the details may be fitting very well only for that type of data. In this way, over-fitting might be emerged. Therefore, we tried to decrease the layers.

 

Let’s set the maximum depth of the tree to 5:

g

This will decrease the score for training data because we removed some details now. However, the important thing is that we try to capture a generalizable model. Therefore, we should get a nice score for both of the training and the test data.

 

Firstly, we tested the score of the decision tree on our training data output was 0.8653846153846154

Test data’s array sequence according to classes we determined is:

[2, 0, 2, 3, 2, 3, 2, 0, 1, 0, 3, 0, 3, 1, 0, 1, 1, 0, 1, 2, 2, 2, 2, 2, 1, 2, 2]

The predicted classes of the test data is:

[2, 1, 2, 3, 2, 3, 3, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 2, 1, 1, 2, 0, 3, 2])

Note that; 0, 1, 2 and 3 represents Low, Lower-Intermediate, Upper-Intermediate and High classes respectively.

As it is seen, some predictions are wrong. When we calculate the score of our new decision tree on the test data, the output is following: 0.5185185185185185

This shows that the accuracy score for the test data is around 52%. This gave us slightly better score than previous one yet still this is relatively low for a decision tree model. Therefore, it seems that it does not really work as well.

Secondly, the input features that determine the violence classes may not determine the violence rates very well and there may not be very close relationship between these features and the violence rates.

 

Conclusion:

This ML model doesn’t really suitable for our dataset. Because 52 percent of predicting true is not enough for an efficient technique yet it gives a correct prediction until some point.  Main reason of that might be caused by GDP since we found that it is not correlated with woman violence rate. Therefore, the decision tree also shows that the woman violence rates and the features (Education Rate and GDP) that we have chosen does not have a strong relationship.

 

Our Python Code:

Github Link

K-Nearest Neighbors Classification

This algorithm helps us to split our dataset to classes and to find the place of the new data. Here we split our dataset as 2 groups according to their education rate and violence rate. And here we determine our k value as 3.

The visualization of the classification:

X coordinate: Education Rate

Y coordinate: Violence Rate

Red Triangles: First Class

Blue Squares: Second Class

Green Circle: New Coming Data (It will be grouped)

d

 

Predicted class is:  [0.] Red

As we can see from above The new data assign to the first class which is red according to K-Nearest Neighbors algorithms.

Beside, the model works for the all new coming datas more than one.

e

Predicted 5 new classes are (0 for Red, 1 for Blue): [0 0 0 1 1]

It can be seen that if we know the education rate and violence rate, this method can determines the group of this data.

The specialty of red class is:

High or intermediate educational levels and low violence rate

Low educational levels and high or intermediate violence rate

 

Our Python Code:

Github Link

Linear Regression

We will use the linear regression model to analyze the relationship between rate of violence against women and input features which are Education and GDP per capita.

Our features:

Education Rate

GDP per capita

Our target data:

Women Violence Rate

Modelling

Here because to avoid from overfitting problem, we split our to 2 group. One of them is Training Data group which is the %66 of the total data and the other one is Test Data group which is %33 of the total data.

Our sample data contains 79 countries, so Train Data set is formed with 52 countries and Test Data set is formed with 27 countries. These 52 countries in Train Data set will be used to model the decision tree.

The visualization of the training data can be seen below:

 

a

First of all, we look at the relationship between education level and women violence. Because of that we use Least squares line equation formula

 

Least squares line equation formula:        y = β+ βx

  • x is the feature (here this is education rate)
  • y is the response (here this is rate of woman violence)
  • ß0 is the intercept
  • ß1 is the coefficient for x

 

To determine the model coefficients (ß0 and ß1) for the crime and education rate data we use statsmodels module at Python.

Output:

Intercept        48.997949

EducationRate    -0.445282

dtype: float64

It means;

ß0 = 48.997949
ß1 = -0.445282

When we put these values into the equation we get:

y = 48.997949+ (-0.445282) * x 

After that we visualize the least squares line on the training data: 

b

From the graph above we can conclude that our data is not so close to the line because of a few deviation but we may say that there is a correlation between EducationRate and ViolenceRate. However just looking to the least square line graph we cannot come to a strong conclusion. But it gives us some clues about relationship between education rate and violence rate. So let’s predict the violence rate of countries according to their educational rate.

Prediction: 

  Education Rate GDPpercapita Violence Rate
Italy 73.4 4855.800481      7

 

Here we have a data from our sample with its values. We put these values to the least square line equation manually to prediction.

 

y = 48.997949 + (-0.445282)* 73.4

y = 16.3142502

 

The real data of Italy’s violation rate is 7 but the result of the prediction is 16.3142502. As a result, so our prediction is not suitable.

r-squared value of the model:

0.5468778921824813

As we can see that the score is nearly %55 so it means approximately half of the predictions will be success on the training data. So we cannot conclude a clear result about relationship between educational rate and violence rate.

Now we test the hypothesis.

Hypothesis Testing:

Null hypothesis: There is no relationship between violence against women, welfare level and their education level.

Alternative hypothesis: There is a relationship between women’s education level, welfare level and violence against them.

Now, we calculate the confidence intervals and p-values:

 

Confidence Intervals:

 

0 1
Intercept 41.462175 56.533724
EducationRate -0.560415 -0.330150

 

p-values:

Intercept        9.770729e-18

EducationRate    3.802296e-10

dtype: float64

In %95 confidence interval level, p-values should be compared with 0.05

Our p-value for education is 9.770729e-18 so it is nearly 0.

Therefore 0 < 0.05 we reject the null hypothesis, so it means educational level and women violence rate has a relationship.

After this step, here we analyze all features together via Multiple Linear Regression

 

Multiple Linear Regression equation is :

Y = β0 + ( β1 * Education Rate) + (β2 * GDPpercapita)

Again to determine the model coefficients we use the statsmodel module at Python:

Output:

Intercept        48.803631

EducationRate    -0.459610

GDPpercapita      0.000072

dtype: float64

 

β0 = 48.803631
β1 = -0.459610
β2 = 0.000072

Our Regression Results:

c

In a conclusion of this process:

From Regression Result we see that p-values of GDPpercapita is higher than 0.05 but p-values of EducationRate is lower than 0.05. Because of that there is no relationship between GDPpercapita and ViolenceRate, so we can’t reject thenull hypothesis in terms of GDPpercapita. Education rate is relation with Violence rate in a negative way.While r-squared value for Education and Women Violence is 0.5468778921824813, r-squared value with Multiple Features and Women Violence is 0.5523799400521094. It can be seen there is a little increasing, butit is not enough to say something certain.

 

Our Python Code:

Github Link

Project Step2: Hypothesis Testing

As mentioned in the proposal, we believe that the education level that given to the women affects the rate of women violence. To investigate this relation, we need hypothesis testing.

Null hypothesis: There is no relationship between violence against women, welfare level and their education level.

Alternative hypothesis: There is a relationship between women’s education level, welfare level and violence against them.

 

EducationRate

 

GDPpercapita

 

ViolenceRate

EducationRate 1.000000 0.204038 -0.737920
GDPpercapita 0.204038 1.000000 -0.009995
ViolenceRate -0.737920 -0.009995 1.000000

 

 

indir

 

pi

 

According to this graph, when we examine the correlation according to the colors, there is a strong negative correlation between violence and education rate. It means that when the education level increases, rate of violent acts decreases. But when look at the correlation between GDP per capita and violence rate, there is no clear correlation between them.

In conclusion, this result is supported our hypothesis that there is correlation between violence against women and women’s level of illiteracy, however we cannot say anything about welfare level of people and  tendency of violence against women. As a result, we reject null hypothesis.

Our Python Code:

Github Link

CS 210 Project Proposal

Our motivation in this project is to find the relation between the education level of individuals, economic conditions, and violence against women. To achieve that we examine some data sources in UN database which includes level of education for both genders separately for most countries and we have the data that contains the percentage of woman who accepts that sustains husband or partner violence also we possess the data from World Bank that shows us economic situations in our focused countries according to its gross national product per capita. We believe that less educated men tend to use violence against their partners more than well-educated people. Also, in our opinion more educated women seem to be more powerful according to their partners than less educated ones. This kind of the point of view is disincentive factor on men about to use violence against their partners. In the other hand in the case of attitude towards women, we believe that the stress created by financial difficulties and lower wage, the fear of unemployment and the harsh living conditions of the individuals increase the rate of violence. In order to support our hypothesis, we mentioned above, on the subject try to find a correlation between our data.

Data Sources:

Education levels and expected years of schooling for both genders:

Income levels of countries:

Rate of violence:

Data about women:

Group Members:

Buse Çarık – 20691

Mücahit Umut Onat – 20452

Yavuz Selim Karavelioğlu – 19442