We created a decision tree model to predict the woman violence rate of countries with respect to their education rate and GDP.
The input features for the decision tree are Education Rate and GDP,
The target is Woman Violence Rate,
First of all, we formed 4 classes for the violence rates:
- Higher or equal to 70, it will be in High (value = 3)
- Between 70 and 40, it will be Upper-Intermediate (value = 2)
- Between 20 and 40, it will be Lower-Intermediate (value = 1)
- Below 20, it will be Low (value = 0)
- The output of the decision tree is the predicted violence rates of the country with respect to the input features.
Modelling
We will use split in this machine learning analysis; the training data will be consisted of 66% and the test data will be consisted of 33% of the sample data we have. This is very important to avoid from over-fitting and decision trees are so susceptible to it.

Firstly, we tested the score of the decision tree on our training data, as expected, output was 1.0
Since we tested same data which we prepared Decision Tree so the accuracy score for the training data must be 100%.
Then, we can test our model by using the test data we have (last 33% of our main data);
Test data’s array sequence according to classes we determined is:
[2, 0, 2, 3, 2, 3, 2, 0, 1, 0, 3, 0, 3, 1, 0, 1, 1, 0, 1, 2, 2, 2, 2, 2, 1, 2, 2]
The predicted classes of the test data is:
[2, 1, 2, 3, 2, 3, 3, 0, 0, 0, 2, 1, 1, 0, 1, 1, 3, 0, 0, 2, 2, 1, 1, 2, 0, 3, 2]
Note that; 0, 1, 2 and 3 represents Low, Lower-Intermediate, Upper-Intermediate and High classes respectively.
As it is seen, some predictions are wrong. When we calculate the score of the decision tree on the test data, the output is following: 0.48148148148148145
This shows that the accuracy score for the test data is around 48% and this is relatively low for a decision tree model. Therefore, it seems that it does not work very well.
However, in terms of results, this method has a similar to the regression model because the regression model had a similar r-squared value.
Why our model does not work well?
- Firstly, the model may be too specific to be a model because it is created using the training data and all the details may be fitting very well only for that type of data. In this way, over-fitting might be emerged. Therefore, we tried to decrease the layers.
Let’s set the maximum depth of the tree to 5:

This will decrease the score for training data because we removed some details now. However, the important thing is that we try to capture a generalizable model. Therefore, we should get a nice score for both of the training and the test data.
Firstly, we tested the score of the decision tree on our training data output was 0.8653846153846154
Test data’s array sequence according to classes we determined is:
[2, 0, 2, 3, 2, 3, 2, 0, 1, 0, 3, 0, 3, 1, 0, 1, 1, 0, 1, 2, 2, 2, 2, 2, 1, 2, 2]
The predicted classes of the test data is:
[2, 1, 2, 3, 2, 3, 3, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 2, 1, 1, 2, 0, 3, 2])
Note that; 0, 1, 2 and 3 represents Low, Lower-Intermediate, Upper-Intermediate and High classes respectively.
As it is seen, some predictions are wrong. When we calculate the score of our new decision tree on the test data, the output is following: 0.5185185185185185
This shows that the accuracy score for the test data is around 52%. This gave us slightly better score than previous one yet still this is relatively low for a decision tree model. Therefore, it seems that it does not really work as well.
Secondly, the input features that determine the violence classes may not determine the violence rates very well and there may not be very close relationship between these features and the violence rates.
Conclusion:
This ML model doesn’t really suitable for our dataset. Because 52 percent of predicting true is not enough for an efficient technique yet it gives a correct prediction until some point. Main reason of that might be caused by GDP since we found that it is not correlated with woman violence rate. Therefore, the decision tree also shows that the woman violence rates and the features (Education Rate and GDP) that we have chosen does not have a strong relationship.
Our Python Code: