Which model would you prefer if the wine-tasting experts would like to gain some insights into the model?

Wine-Tasting Machine

In this assignment, we will practice building supervised machine learning with Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machine (SVM), and Decision tree (DT), Random Forest (RF) classifiers, as compared with simple/baseline methods such as OneR and ZeroR. The data for this exercise comes from the wine industry.

Each record represents a sample of a specific wine product, the input attributes include its organoleptic characteristics, and the output denotes the quality class of each wine: {high, low}. The labels have been assigned by human wine-tasting experts, and we can treat that information as “ground truth” in this exercise.

Your job is to build the best model to predict wine quality from its characteristics, so that the winery could replace the costly services of professional sommeliers with your automated alternative, to enable quick and effective quality tracking of their wines at production facilities.

They need to know whether such change is feasible, and what extent of inaccuracies may be involved in using your tool.

You will be asked to run experiments in both WEKA and Python.

You are given two datasets red-wine.csv and white-wine.csv: Dataset folder

Deliverable:

A word doc with answers (including screenshots) to blackboard

Python Notebook to be uploaded to GitHub and shared the link in the above Google doc

WEKA Tasks (50 points)

Load red-wine.csv into WEKA (15 points)

Create a conditional distribution for each of the input variables with respect to output (click the “visualizing all” button, making sure you set the output correctly)

Comparing the plots for Sulphates and Alcohol, which one do you think is more predictive of the wine quality, and why?

Verify your answer by using a logistic regression model, is it consistent with your speculation in (b)? (hint: here you may use univariate logistic regression, the better performance, the more predictive a feature is. AUC is a good score for this purpose)

Fit a model using each of the following methods and report the performance metrics of 10-fold cross-validation using red-wine.csv as the training set (25 points)

Model ZeroR OneR LR NB DT SVM RF

 

AUC N/A N/A          
Accuracy              

 

Obtain the ROC curve for the best-performing model in terms of AUC score from the experiment above, paste a screenshot here and comment on its performance (5 points)

Using the best model obtained above in WEKA and run the model on white-wine.csv and report the AUC score, comment on the performance.  (hint: see WEKA reference section on how to get performance on an external independent test set) ( 5 points)

Python Tasks (50 points)

Submission:  Upload Python Notebook to GitHub links as homework 1

Read  red-wine.csv into Python as a data frame, use a pandas profiling tool (https://github.com/pandas-profiling/pandas-profiling) to create an HTML file, and paste a screenshot of the HTML file here (10 points)

Repeat the same experiments in WEKA Question 2,  and report the same metrics as in Question 2. To receive full credit, you will need to write a script to assemble the result as above in the form of Pandas data frame. Paste a screenshot of your result from your Python notebook here. Please make sure that there is a reasonable number of significant digits in reporting your output. (20 points)

Plot the ROC curve of the Random Forest classifier from the Python package, and paste a screenshot of your ROC curve here (10 points)

Using the best model obtained above in Q2 (python)  and running the model on white-wine.csv and reporting the AUC score, comment on the performance. (5 points)

Suppose all the models have comparable performance, which model would you prefer if the wine-tasting experts would like to gain some insights into the model? Note: there could be multiple model types fitting this criterion. (5 points)

 

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more