Skip to main content

Kaggle Dataset Analysis : Is your Avocado organic or not?

Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis.

This Data set represents the historical data on avocado prices and sales volume in multiple US markets.



Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance!

Note : I'd like to extend the kernel contribution to Shivam Negi. All this code belongs to him.

Data Visualization

This script must procure the following jointplot 

sns.jointplot(x='Large Bags',y='Small Bags',data=avocado)
view raw bagsjp.py hosted with ❤ by GitHub



While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones.



sns.jointplot(x='Small Bags',y='XLarge Bags',data=avocado)
view raw bagsjp2.py hosted with ❤ by GitHub


Pre Processing

The following script has been used for pre processing the input data.


from sklearn.feature_extraction import FeatureHasher
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
avocado = avocado.drop(['Unnamed: 0','Date'],axis=1)
# Using Feature Hashing
fh = FeatureHasher(n_features=5,input_type='string')
hashed_features = fh.fit_transform(avocado['region']).toarray()
avocado = pd.concat([avocado,pd.DataFrame(hashed_features)],axis=1)
avocado = avocado.drop('region',axis=1)
#Simple Train test splits
X = avocado.drop('type',axis=1)
y = avocado['type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


Model Definition and Comparisons

We will be looking mostly at three different models, namely random forest classifiers, KNN and our good old SVM.

Obviously, the Grid search optimization has been used for best results.

Random Forest Classifier


rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
pred1 = rfc.predict(X_test)
#Classification reports
print(classification_report(y_test,pred1))
#Confusion Matrix
print(confusion_matrix(y_test,pred1))
view raw randomforest.py hosted with ❤ by GitHub



You can see that our model is giving (false positive) + (false negatives) = 3. Total 3 inaccurate predictions which is pretty awesome.



KNN Classifier

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,y_train)
pred2 = knn.predict(X_test)
#Classification Report
print(classification_report(y_test,pred2))
#Confusion Matrix
print(confusion_matrix(y_test,pred2))
view raw knn.py hosted with ❤ by GitHub



So, the performance drops by a total 0f 0.02, not bad at all!
For The SVM Part, I'd recommend that you visit the kernel itself (Link is at the top)

Conclusion

In the end, I would just like to say that we have tried Random Forest Classifier, KNN and SVM but most efficient among them is Random Forest Classifier giving 100% accuracy, KNN is also not too bad as it has 98% accuracy but Support Vector Classifier(SVC) is not that much efficient in predicting the values having only 54% accuracy for conventional and 100% accuracy for organic type. Recommended Model for prediction of TYPE(coventional and organic categories) columns is Random Forest Classifier.

Comments

Total Pageviews

9357

Popular posts from this blog

IOT Breakthrough : TensorFlow 1.9 Officially Supports the Raspberry Pi

Hey Readers! Good news for all the "cheap fair power" computer fans, as a result of a major collaboration effort between TensorFlow and Raspberry Pi foundation, one can now install tensorflow precompiled binaries using Python's pip package system !  When TensorFlow was first launched in 2015, they wanted it to be an “ open source machine learning framework for everyone ”. To do that, they needed to run on as many of the platforms that people are using as possible. They have long supported Linux, MacOS, Windows, iOS, and Android, but despite the heroic efforts of many contributors, running TensorFlow on a Raspberry Pi has involved a lot of work. If one is using Rasbian9 they can simply use these 2 commands to install tensorflow on their machine! According to an excerpt from TensorFlow's medium article page :  " We’re excited about this because the Raspberry Pi is used by many innovative developers, and is also widely used in education to ...

Tackling MNIST Dataset : Deep Convolutional Neural Network 99.571% accurate

Hey there! My dear readers.  Today this kernel review is going to be about the famous MNIST data-set, one of the most famous introductory datasets which we encounter along side Iris dataset and the titanic survival challenge data sets.  Since it is a competition kernel, I have decided not to make it public. (yet) Also,  if you want to try your hands at the challenge itself, then you can find the challenge page here :  Note: This kernel has been largely focused on network modelling rather than Exploratory Data Analysis because it's simple, classic stuff. Still, I will try my best to explain that stuff here. Exploratory Data Analysis First indication of a great dataset is the face that it gives all the mentioned labels equal rows in it and this one doesn't disappoint. This data set maintains a fair 4000+ entries per label which actually is a great statistic for a good dataset. And the next thing one needs to know is how...