Predicting Cost of Tender with 99.24% Accuracy : Miracle!

Data Science is reaching new levels and so are the models. But reaching a whooping 99.24% accuracy using simple feature engineering and a simple Decision Tree Classifier ?

That's new!

Hello everyone, today I am going to present you my model which can predict value range of a tender in Seattle Trade Permits with a whooping accuracy of 99.24 % (With some obvious caveats which I will discuss in the end).

My Kernel : Yet Another Value Prediction

The Prediction Kernel

BASIC EDA

This time out, I am going to use plotly library in Python. This is literally the best option for interactive plots and if you actually visit the kernel, you will understand why.

First of all, we will focus on checking out the Top Grossing Contractors in the Seattle area who have earned the most out of the tender acquisitions.

This will lead to this interactive graph:

Similarly, one could plot out another graph for Amount earned per project. But another thing which caught my eye was a really high composition of SINGLE FAMILY/DUPLEX category (about 75% of the data)

Well, with this basic EDA planned out, Let's move to feature engineering!

Feature Engineering

First of all, we will encode the Value of a tender by the encoder listed below which is simple enough to understand. It simply categorises the value in 5 categories.

and after this, we move on to One Hot Encode out Category variables.

One Hot Matrix

Next steps were to simply binary encode the status column. After that, If we look at the Action Column, it mentions the presence of 21 unique entries. Now binary encodes of 21 more features isn't plausible. How about using feature hashing?

Now Since all our features are ready, It's time for Model Development!

Model Programming

We will make a simple Decision Tree Classifier with max_depth = 5 to prevent overfitting!

As you can see, it worked out pretty well for me. A whooping 99.24% accuracy!

Caveats!

Now now, in telegram discussion, many people were astounded by the accuracy score. The Recall and Precision worked out fine for this model. But, where could it go wrong?

Now I would like to draw you attention to the fact that this dataset was largely dominated by a single category : SINGLE/DUPLEX and after a little insight into the dataset, I found out that value label distribution was as follows :

This means that about 99.15 % of the dataset is dominated by tenders of a single value label .

So truth be told. If my model would still have gone shooting arrows in the dark, it would still have a 98% chances of correctly stating the label given, the input is a subset of this dataset!

Conclusion

The only thing left for conclusion is that it is really easy to get a high performance classifier in a biased data set like this given that you choice of algorithm is right ( You may need to visit the kernel to actually understand why did I say this)

The Rise of Artificial Intelligence

Search This Blog