- Remove features with 0 variance : they add no information
- Remove features with very high or very low cardinality : they add no information
- Remove irrelevant column : they add noise into the model
- Inspect features with seaborn pairplot to remove duplicate columns
- Drop highly correlated features if you are confident that they may add bias into the model
- look at both pairplot and heatmap and prearson's correlation value
- Drop scaled variance up to a threshold : Very low variance may add noise to the dataset
- Drop columns that have missing values beyond a threshold (generally 30%)
- Extract features for seemmingly same correlated features :
- Use PCA
- Visualize the contribution of features with t-sne
- Use t-sne on numeric features and visualize them in 2D
- use categorical features as `hue` of scatterplot for transformed t-sne to identify driver features
- Discard less important features of a model by filtering out with a threshold co-efficient value:
- Recursive feature elimination
- train the model, drop the feature with lowest co-efficient
- train the model again, drop the next feature with lowest co-efficient
- continue until a desired number of features remain
- Voting from many models:
- Perform RFE on many models.
- Do votes on existing features for all models
- The features that survive most of the time are the desired features
- Note : make sure the dataset is standardized, regularized, cross-validated
- Use trees to find out important features
- Generate new features from existing features:
- eg: average arm length column from left arm column and right arm column
- eg: generate bmi column from weight and height