Skip to main content

Predicting Cost of Tender with 99.24% Accuracy : Miracle!

Data Science is reaching new levels and so are the models. But reaching a whooping 99.24% accuracy using simple feature engineering and a simple Decision Tree Classifier ?

That's new!


Hello everyone, today I am going to present you my model which can predict value range of a tender in Seattle Trade Permits with a whooping accuracy of 99.24 % (With some obvious caveats which I will discuss in the end).


My Kernel : Yet Another Value Prediction



The Prediction Kernel

BASIC EDA

This time out, I am going to use plotly library in Python. This is literally the best option for interactive plots and if you actually visit the kernel, you will understand why.

First of all, we will focus on checking out the Top Grossing Contractors in the Seattle area who have earned the most out of the tender acquisitions.



This will lead to this interactive graph:

Similarly, one could plot out another graph for Amount earned per project. But another thing which caught my eye was a really high composition of SINGLE FAMILY/DUPLEX category (about 75% of the data)







Well, with this basic EDA planned out, Let's move to feature engineering!


Feature Engineering

First of all, we will encode the Value of a tender by the encoder listed below which is simple enough to understand. It simply categorises the value in 5 categories.




and after this, we move on to One Hot Encode out Category variables.






One Hot Matrix
Next steps were to simply binary encode the status column. After that, If we look at the Action Column, it mentions the presence of 21 unique entries. Now binary encodes of  21 more features isn't plausible. How about using feature hashing?






Now Since all our features are ready, It's time for Model Development!


Model Programming

We will make a simple Decision Tree Classifier with max_depth = 5  to prevent overfitting!







As you can see, it worked out pretty well for me. A whooping 99.24% accuracy!


Caveats!

Now now, in telegram discussion, many people were astounded by the accuracy score. The Recall and Precision worked out fine for this model. But, where could it go wrong?

Now I would like to draw you attention to the fact that this dataset was largely dominated by a single category :  SINGLE/DUPLEX and after a little insight into the dataset, I found out that value label distribution was as follows : 


This means that about 99.15 % of the dataset is dominated by tenders of a single value label .

So truth be told. If my model would still have gone shooting arrows in the dark, it would still have a 98% chances of correctly stating the label given, the input is a subset of this dataset!


Conclusion

The only thing left for conclusion is that it is really easy to get a high performance classifier in a biased data set like this given that you choice of algorithm is right ( You may need to visit the kernel to actually understand why did I say this)

Comments

Total Pageviews

Popular posts from this blog

Kaggle Dataset Analysis : Is your Avocado organic or not?

Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis. This Data set  represents the historical data on avocado prices and sales volume in multiple US markets. Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance! Note : I'd like to extend the kernel contribution to Shivam Negi . All this code belongs to him. Data Visualization This script must procure the following jointplot  While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones. Pre Processing The following script has been used for pre processing the input data. Model Definition and Comparisons We will be looking mostly at three different models, namely ra...

IOT Breakthrough : TensorFlow 1.9 Officially Supports the Raspberry Pi

Hey Readers! Good news for all the "cheap fair power" computer fans, as a result of a major collaboration effort between TensorFlow and Raspberry Pi foundation, one can now install tensorflow precompiled binaries using Python's pip package system !  When TensorFlow was first launched in 2015, they wanted it to be an “ open source machine learning framework for everyone ”. To do that, they needed to run on as many of the platforms that people are using as possible. They have long supported Linux, MacOS, Windows, iOS, and Android, but despite the heroic efforts of many contributors, running TensorFlow on a Raspberry Pi has involved a lot of work. If one is using Rasbian9 they can simply use these 2 commands to install tensorflow on their machine! According to an excerpt from TensorFlow's medium article page :  " We’re excited about this because the Raspberry Pi is used by many innovative developers, and is also widely used in education to ...

Your help in Fashion : 7 layer CNN at your service (~92% accurate)

Hey Readers! Welcome to yet another post where I play with a self designed neural network. This CNN would be tackling a variant of classical MNIST known as Fashion MNIST dataset  . Before we start exploring what is the approach for this dataset, let's first checkout what this dataset really is. Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. The original MNIST dataset contains a lot of handwritten digits. Members of the AI/ML/Data Science community love this dataset and use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset researchers try. ...