AWS Certified Machine Learning Specialty Exam Questions

Amazon

AWS Certified Machine Learning Specialty

159 / 258

Question 159:

You work as a machine learning specialist for a financial services firm. Your firm contracts with market data generation services that deliver 5 TB of market activity record data every minute. To prepare this data for your machine learning models, your team queries the data using Athena. However, the queries perform poorly because they are operating on such a large data stream. You need to find a more performant option. Which file format for your market data records on S3 will give you the best performance?

Answer options:

A.TSV files
B.Compressed LZO files
C.Parquet files
D.CSV files

Answer correct:

Correct Answer: C Option A is incorrect. The TSV file format uses a row-based file structure that uses tabs as an attribute separator. When Athena reads from these types of files, it must read the entire row for every row versus reading in a column when only the attribute in that column is needed for your query. Columnar-based file processing is much more efficient for queries of large datasets. Also, the TSV file format does not support the partitioning of your data. Option B is incorrect. Compressed LZO Files do not support columnar processing nor partitioning. Therefore they will perform poorly when compared to columnar file formats like Parquet. Option C is correct. The Parquet file format is a columnar-based format, and it supports partitioning. The other columnar-based file format supported by Athena is ORC. These columnar-based file formats outperform the tabular formats such as CSV and TSV when Athena works with very large datasets. Option D is incorrect. The CSV file format uses a row-based file structure that uses commas as an attribute separator. When Athena reads from these types of files, it must read the entire row for every row versus reading in a column (columnar-based processing) when only the attribute in that column is needed for your query. Columnar-based file processing is much more efficient for queries of large datasets. Also, the CSV file format does not support the partitioning of your data. References: Please see the Amazon Athena FAQs (refer to the question “How do I improve the performance of my query?”) (https://aws.amazon.com/athena/faqs/#:~:text=Amazon%20Athena%20supports%20a%20wide,%2C%20LZO%2C%20and%20GZIP%20formats.), The AWS Big Data blog titled Top 10 Performance Tuning Tips for Amazon Athena (https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/), The Amazon Athena user guide titled Compression Formats (https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html)

Add to favourites

ExamQuestions.com

Register

Login

Amazon

AWS Certified Machine Learning Specialty

159 / 258