xxxxxxxxxx
# importing libraries
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
# initialising the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
# learning the statistical parameters for each of the data and transforming
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])
After rescaling see that all of the values are in the range between 0 and 1.
Output:
[[ 0.353 0.744 0.59 0.354 0.0 0.501 0.234 0.483]
[ 0.059 0.427 0.541 0.293 0.0 0.396 0.117 0.167]
[ 0.471 0.92 0.525 0. 0.0 0.347 0.254 0.183]
[ 0.059 0.447 0.541 0.232 0.111 0.419 0.038 0.0 ]
[ 0.0 0.688 0.328 0.354 0.199 0.642 0.944 0.2 ]]
2. Binarize Data (Make Binary)
We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.
We can create new binary attributes in Python using scikit-learn with the Binarizer class.
Code: Python code for binarization
# import libraries
from sklearn.preprocessing import Binarizer
import pandas
import numpy
# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values
# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
binarizer = Binarizer(threshold = 0.0).fit(X)
binaryX = binarizer.transform(X)
# summarize transformed data
numpy.set_printoptions(precision = 3)
print(binaryX[0:5,:])
We can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.
Output:
[[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 0. 1. 1. 1.]
[ 1. 1. 1. 0. 0. 1. 1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1.]
[ 0. 1. 1. 1. 1. 1. 1. 1.]]
3. Standardize Data
Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
We can standardize data using scikit-learn with the StandardScaler class.
Code: Python code to Standardize data (0 mean, 1 stdev)
# importing libraries
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values
# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision = 3)
print(rescaledX[0:5,:])
The values for each attribute now have a mean value of 0 and a standard deviation of 1.
Output:
[[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
[-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
[ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
[-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
[-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]]
References:
https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
https://www.xenonstack.com/blog/data-preprocessing-data-wrangling-in-machine-learning-deep-learning
xxxxxxxxxx
### One-hot encoding : turn categorical values into numeric dummy values
df_dummies = pd.get_dummies(df["target"], drop_first=True)
df_numeric = pd.concat([df, df_dummies], axis=1)
df_ready = df_numeric.drop("target", axis=1)
### drop Missing values
print(df.isna().sum().sort_values())
df = df.dropna(subset=column_list)
### Impute missing values
from sklearn.impute import SimpleImputer
y = df["target"].values
X_cat = df["cat_col"].values.reshape(-1, 1)
X_num = df.drop(["cat_col", "target"], axis=1).values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=12)
# Imputing most frequent category for categorical column imputation
imp_cat = SimpleImputer(strategy="most_frequent")
X_train_cat = imp_cat.fit_transform(X_train_cat) # Both fit and transform
X_test_cat = imp_cat.transform(X_test_cat) # Only transform since the dataset was fit previously
# Imputing default median for numeric column imputation
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num) # Both fit and transform
X_test_num = imp_num.transform(X_test_num) # Only transform since the dataset was fit previously
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)
### Scaling (Only happens on X, never touch y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
xxxxxxxxxx
# Inspect dataset
df.head()
df.info()
df.describe() # Summary stats
# DEAL WITH MISSING VALUES
df.drop([1, 2, 3]) # Drop specific rows
df.dropna(thresh=2) # keep at least 2 non-missing values in each row
df.dropna(subset=['C']) # Drop missing values of specified column
# Convert column types
df["C"] = df["C"].astype("float")
# Verify class imbalance
y.value_counts()
# Split into training and testing data (Consider class imbalance)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# STANDARDIZE DATASET
df.var() # Detect high variance difference in columns are candidates of log normalization
# FEATURE ENGINEERING
# TEXT PROCESSING (cleaning / vectorizing / regular expression )
xxxxxxxxxx
- Comes after data cleaning and Exploratory Data Analysis (EDA)
- pre-requisite for modeling
- Helps to:
- produce more reliable results
- Improve model performance
- inspect dataset
- See summary statistics
- Deal with missing values
- Convert to specified column types
- Split into training and testing set (Take class imbalance into account)
- Data leakage : non-training data is used to train the model
- Standardize data : Transform numeric data to make it normally distributed
- Non-normal data introduce bias for some features due to its high variance
- Non-normal data introduce model underfitting due to difference in scales among different features
- Log-normalization, standard scaling
- Tree-based models can be trained without standardization
- The other models like linear models or dataset with high dimensions requires standardization
- Feature Engineering (Creating new features):
- eg : averaging similar features
- eg : vector of text
- eg : resampling time data (changing time granularity : from second to week, month etc)
- eg : one-hot encoding
- eg: regular expression
- Feature Selection
- Remove duplicate features (They add bias into the model)
- Remove features with strong correlation (They add bias into the model)
- Remove noisy features (irrelevant feature)
- dimension reduction (eg: PCA, LDA, RFE)
xxxxxxxxxx
'''Preprocessing dataset: (Suppose your .py file is inside Dataset folder
and Labels are folder's names: Example => C:\Users\Dataset\Cats\cat1.png)'''
import os, cv2 as cv, numpy as np, random
DIR = os.getcwd()
Labels = [i for i in os.listdir(DIR) if not('.') in i]
data, Cut, X, y, New_W, New_H = [], [], [], [], 50, 50
for Root, dir, files in os.walk(DIR):
if Root!=DIR: Cut.append(len(os.listdir(Root))) # Normalize samples
for R, d, f in os.walk(DIR):
if R!= DIR: # Loads img as grayscale, resizing and appending => [img, label]
lbl = [(Labels[id]) for id, _ in enumerate(Labels) if Labels[id] in R]
Imgs = [(cv.imread(f'{R}\\{img}', 0)/255.0) for img in os.listdir(R)]
for img in Imgs[:min(Cut)]:
data.append([cv.resize(img, (New_W, New_H)), lbl[0]])
print(f'Num of imgs for {lbl} label ==> {len(Imgs[:min(Cut)])}')
random.shuffle(data)
for img, label in data:
X.append(img)
y.append(label)
np.array(X).reshape(-1, New_W, New_H, 1) #1 if gray; 3 if rgb
np.save(f"{DIR}\\X.npx", X) # Save file in DIR directory
np.save(f"{DIR}\\y.npx", y)