Correct Answer: C
The data skew can be the most common reason for the slower performance of your join or shuffle jobs because of existing asymmetry in your job data.
Being a distributed system in Spark, Data is divided into several pieces, known as partitions, moved into the diverse cluster nodes, and processed in parallel. If a partition gets much larger than the other, the node processing it is likely to face resource issues and slow down the whole execution. This type of data imbalance is known as data skew.
Option A is incorrect. Bucketing does not result in the slow performance of join or shuffle jobs.
Option B is incorrect. Using the Cache option is likely to increase, not decrease the performance.
Option C is correct. The data skew is the most common reason for the slower performance of your join or shuffle jobs.
Option D is incorrect. Enabling Auto scaling can’t be the possible cause of slow performance on Join or Shuffle jobs.
Option E is incorrect. Option C Data Skew is the correct choice.
Reference:
To know more about data-skew and how to resolve data skew problems, please visit the below-given link:
https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-data-lake-tools-data-skew-solutions