Hey readers! Today, allow me to present you yet another dataset analysis of a rather gluttony topic, namely Avocado price analysis.
This Data set represents the historical data on avocado prices and sales volume in multiple US markets.
This Data set represents the historical data on avocado prices and sales volume in multiple US markets.
Our prime objectives will be to visualize the dataset, pre-process it and ultimately test multiple sklearn classifiers to checkout which one gives us the best confidence and accuracy for our Avocado's Organic assurance!
Note : I'd like to extend the kernel contribution to Shivam Negi. All this code belongs to him.
Data Visualization
This script must procure the following jointplot
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sns.jointplot(x='Large Bags',y='Small Bags',data=avocado) |
While a similar joint plot can be drawn for conluding the linearly exponent relations between extra large bags and the small ones.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sns.jointplot(x='Small Bags',y='XLarge Bags',data=avocado) |
Pre Processing
The following script has been used for pre processing the input data.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.feature_extraction import FeatureHasher | |
from sklearn.model_selection import train_test_split | |
from sklearn.neighbors import KNeighborsClassifier | |
from sklearn.metrics import classification_report,confusion_matrix | |
from sklearn.ensemble import RandomForestClassifier | |
from sklearn.svm import SVC | |
from sklearn.grid_search import GridSearchCV | |
avocado = avocado.drop(['Unnamed: 0','Date'],axis=1) | |
# Using Feature Hashing | |
fh = FeatureHasher(n_features=5,input_type='string') | |
hashed_features = fh.fit_transform(avocado['region']).toarray() | |
avocado = pd.concat([avocado,pd.DataFrame(hashed_features)],axis=1) | |
avocado = avocado.drop('region',axis=1) | |
#Simple Train test splits | |
X = avocado.drop('type',axis=1) | |
y = avocado['type'] | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) | |
Model Definition and Comparisons
We will be looking mostly at three different models, namely random forest classifiers, KNN and our good old SVM.
Obviously, the Grid search optimization has been used for best results.
Random Forest Classifier
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rfc = RandomForestClassifier(n_estimators=100) | |
rfc.fit(X_train,y_train) | |
pred1 = rfc.predict(X_test) | |
#Classification reports | |
print(classification_report(y_test,pred1)) | |
#Confusion Matrix | |
print(confusion_matrix(y_test,pred1)) |
You can see that our model is giving (false positive) + (false negatives) = 3. Total 3 inaccurate predictions which is pretty awesome.
KNN Classifier
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
knn = KNeighborsClassifier(n_neighbors=5) | |
knn.fit(X_train,y_train) | |
pred2 = knn.predict(X_test) | |
#Classification Report | |
print(classification_report(y_test,pred2)) | |
#Confusion Matrix | |
print(confusion_matrix(y_test,pred2)) |
So, the performance drops by a total 0f 0.02, not bad at all!
For The SVM Part, I'd recommend that you visit the kernel itself (Link is at the top)
Conclusion
In the end, I would just like to say that we have tried Random Forest Classifier, KNN and SVM but most efficient among them is Random Forest Classifier giving 100% accuracy, KNN is also not too bad as it has 98% accuracy but Support Vector Classifier(SVC) is not that much efficient in predicting the values having only 54% accuracy for conventional and 100% accuracy for organic type. Recommended Model for prediction of TYPE(coventional and organic categories) columns is Random Forest Classifier.
Comments
Post a Comment