top 10 performance tips (same best practices can be applied to EMR too)
Storage optimisations:
Partition your data (define virtual columns at table creation to keep related data together and reduce the amount of scanned data per query)
Bucket your data within a single partition
Use compression ( format like Apache Parquet or ORC are recommended because they compress data by default and are splittable - meaning that can be read in parallel by the execution engine)
Optimise file sizes (files smaller than 128 MB could take longer due to the overhead of opening s3 files, listing directories , getting object metadata and so on).
Optimise columnar data store generation
Query optimisations:
Optimise ORDER BY (by using Limit)
Optimise joins (specifying larger table on the left side of the joins)
Optimise GROUP BY (reducing columns in select to reduce amount of memory and by ordering the coluns by the highest cardinality - most unique values )
Use approximate functions
Only include the columns that you need