AWS Certified Machine Learning Specialty Exam Questions

Amazon

AWS Certified Machine Learning Specialty

160 / 258

Question 160:

You are a machine learning specialist working for a large insurance company. You are building a machine learning model to predict the likelihood of an insured customer committing insurance fraud. Your training dataset has many attributes about the insured, the insurance policy, and their insurance claims. As its prediction, your model needs to produce a continuous value of the probability of fraud for any given customer claim. The feature set of your training data includes labeled outcomes for a set of 100,000 insurance claim observations. When you visualize the training dataset, you see that out of the 100,000 insurance claims, 24,350 claim records show the policy term length of 0 years. The remaining features for these observations show no anomalies. Which feature engineering option will give you the best dataset for your model training?

Answer options:

A.Use k-means clustering to impute the missing policy length features.
B.Use KNN to impute the missing policy length features.
C.Populate the 0 policy length feature value with the mean or median value of the feature.
D.Drop the records from the dataset where policy length is 0.

Answer correct:

Correct Answer: B Option A is incorrect. The k-means algorithm is an unsupervised learning algorithm where we do not have labeled data. The k-means algorithm is used for clustering. This is not the best choice, nor is it a choice used by practicing machine learning specialists for feature imputation. Unsupervised learning using unlabeled data will give inferior results when compared to supervised learning with labeled data. Option B is correct. The K Nearest Neighbor algorithm, when used for classification, is a supervised learning algorithm where we have labeled data. Using KNN, you can impute missing values using feature similarity to predict missing values based on the other non-missing values in the feature. This is a very common approach used by machine learning specialists to impute missing values. Option C is incorrect. While it is common to replace missing feature values with the simple mean or median of the feature, this method is far less accurate than using the KNN approach to impute your missing values. Option D is incorrect. Dropping the records with the missing values is another common approach for dealing with missing feature values. However, this approach reduces your feature set significantly in this scenario. You have missing features in approximately 24% of your training data. Dropping that many records will reduce the accuracy of your predictions. References: Please see the Amazon SageMaker developer guide titled K-Nearest Neighbors (k-NN) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/k-nearest-neighbors.html), The Amazon SageMaker developer guide titled K-Means Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/k-means.html), The Towards Data Science article titled 6 Different Ways to Compensate for Missing Values In a Dataset (Data Imputation with examples) (https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779)

Add to favourites

ExamQuestions.com

Register

Login

Amazon

AWS Certified Machine Learning Specialty

160 / 258