- Create new features (eg: averaging, BMI etc )
- Visualize distribution with boxplot, pairplot of dataset to see if Transformation is necessary (eg: log transformation)
- Normalize/Standardize/Scale features
- Encoding : Convert categories into numeric data
- One-hot encoding : Explainable features, create N columns for N categories
- Dummy encoding : Necessary information without duplication, create N-1 columns for N categories
- Merge low frequent categorical values (uncommon categories) into one single category (eg: `other`)
- Binarise numeric values (eg: from `num_violations` to `violation_boolean`)
- Deal with missing values:
- drop missing values that are beyond threshold (>30% of dataset)
- fill completely random missing values (with mean, median, mode, `Other`, sorted next present value)
- Deal with outliers
- Validate numeric columns
- remove characters from numeric data (eg: `$` or `,` sign for currency)
- make sure the column is in proper datatype (eg: `float`, `int` etc)
- For text processing : Generate numeric features
1. Remove unwanted/non-letter characters
2. Standardize text : convert to lowercase / uppercase
3. Generate Feature, Mean word length : average length of words in text = character_count / word_count
4. Generate Feature, Bag of words : Word Count Vector = number of times a word appeared in a text
5. Generate Feature, Normalized significance of words : Calculate TF-IDF = normalization of word vector (significance of word in a document compared to all words in all documents)
6. Generate Feature, contextual n-gram significance of word sequence : Calculate TF-IDF = normalization of word vector (significance of word in a document compared to all words in all documents)