preprocessing data in python

Comment

1

Tip Imaginathan 1 GREPCC

# importing libraries
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler

# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values

# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]

# initialising the MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
# learning the statistical parameters for each of the data and transforming
rescaledX = scaler.fit_transform(X)

# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

After rescaling see that all of the values are in the range between 0 and 1. 

Output: 

[[ 0.353  0.744  0.59   0.354  0.0    0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.0    0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.0    0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.0  ]
 [ 0.0    0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]
 
2. Binarize Data (Make Binary)  

We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.
We can create new binary attributes in Python using scikit-learn with the Binarizer class.
Code: Python code for binarization 

# import libraries
from sklearn.preprocessing import Binarizer
import pandas
import numpy

# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values

# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
binarizer = Binarizer(threshold = 0.0).fit(X)
binaryX = binarizer.transform(X)

# summarize transformed data
numpy.set_printoptions(precision = 3)
print(binaryX[0:5,:])

We can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1. 

Output: 

[[ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  0.  1.  1.  1.]
 [ 1.  1.  1.  0.  0.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.  1.  1.  1.]
 [ 0.  1.  1.  1.  1.  1.  1.  1.]]
3. Standardize Data  

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
We can standardize data using scikit-learn with the StandardScaler class.
Code: Python code to Standardize data (0 mean, 1 stdev)  


# importing libraries
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
 
# data set link
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
# data parameters
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
 
# preparating of dataframe using the data at given link and defined columns list
dataframe = pandas.read_csv(url, names = names)
array = dataframe.values
 
# separate array into input and output components
X = array[:, 0:8]
Y = array[:, 8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
 
# summarize transformed data
numpy.set_printoptions(precision = 3)
print(rescaledX[0:5,:])
The values for each attribute now have a mean value of 0 and a standard deviation of 1. 

Output: 

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]
 
References:  

https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/
https://www.xenonstack.com/blog/data-preprocessing-data-wrangling-in-machine-learning-deep-learning

xxxxxxxxxx

# importing libraries

import pandas

import scipy

import numpy

from sklearn.preprocessing import MinMaxScaler

# data set link

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# data parameters

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list

dataframe = pandas.read_csv(url, names = names)

array = dataframe.values

# separate array into input and output components

X = array[:,0:8]

Y = array[:,8]

# initialising the MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

# learning the statistical parameters for each of the data and transforming

rescaledX = scaler.fit_transform(X)

# summarize transformed data

numpy.set_printoptions(precision=3)

print(rescaledX[0:5,:])

After rescaling see that all of the values are in the range between 0 and 1.

Output:

[[ 0.353  0.744  0.59   0.354  0.0    0.501  0.234  0.483]

 [ 0.059  0.427  0.541  0.293  0.0    0.396  0.117  0.167]

 [ 0.471  0.92   0.525  0.     0.0    0.347  0.254  0.183]

 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.0  ]

 [ 0.0    0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]

2. Binarize Data (Make Binary)

We can transform our data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.

This is called binarizing your data or threshold your data. It can be useful when you have probabilities that you want to make crisp values. It is also useful when feature engineering and you want to add new features that indicate something meaningful.

We can create new binary attributes in Python using scikit-learn with the Binarizer class.

Code: Python code for binarization

# import libraries

from sklearn.preprocessing import Binarizer

import pandas

import numpy

# data set link

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# data parameters

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list

dataframe = pandas.read_csv(url, names = names)

array = dataframe.values

# separate array into input and output components

X = array[:, 0:8]

Y = array[:, 8]

binarizer = Binarizer(threshold = 0.0).fit(X)

binaryX = binarizer.transform(X)

# summarize transformed data

numpy.set_printoptions(precision = 3)

print(binaryX[0:5,:])

We can see that all values equal or less than 0 are marked 0 and all of those above 0 are marked 1.

Output:

[[ 1.  1.  1.  1.  0.  1.  1.  1.]

 [ 1.  1.  1.  1.  0.  1.  1.  1.]

 [ 1.  1.  1.  0.  0.  1.  1.  1.]

 [ 1.  1.  1.  1.  1.  1.  1.  1.]

 [ 0.  1.  1.  1.  1.  1.  1.  1.]]

3. Standardize Data

Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.

We can standardize data using scikit-learn with the StandardScaler class.

Code: Python code to Standardize data (0 mean, 1 stdev)

# importing libraries

from sklearn.preprocessing import StandardScaler

import pandas

import numpy

# data set link

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"

# data parameters

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# preparating of dataframe using the data at given link and defined columns list

dataframe = pandas.read_csv(url, names = names)

array = dataframe.values

# separate array into input and output components

X = array[:, 0:8]

Y = array[:, 8]

scaler = StandardScaler().fit(X)

rescaledX = scaler.transform(X)

# summarize transformed data

numpy.set_printoptions(precision = 3)

print(rescaledX[0:5,:])

The values for each attribute now have a mean value of 0 and a standard deviation of 1.

Output:

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]

 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]

 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]

 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]

 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]

References:

https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/

https://www.xenonstack.com/blog/data-preprocessing-data-wrangling-in-machine-learning-deep-learning

Popularity 9/10 Helpfulness 3/10 Language python

Source: Grepper

Tags: python

Share

Link to this answer
Share Copy Link

Contributed on Jun 19 2022

Imaginathan

0 Answers Avg Quality 2/10

preprocessing python

Comment

7

Tip Innocent Iguana 1 GREPCC

### One-hot encoding : turn categorical values into numeric dummy values
df_dummies = pd.get_dummies(df["target"], drop_first=True)
df_numeric = pd.concat([df, df_dummies], axis=1)
df_ready = df_numeric.drop("target", axis=1)

### drop Missing values
print(df.isna().sum().sort_values())
df = df.dropna(subset=column_list)

### Impute missing values
from sklearn.impute import SimpleImputer
y = df["target"].values
X_cat = df["cat_col"].values.reshape(-1, 1)
X_num = df.drop(["cat_col", "target"], axis=1).values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=12)
# Imputing most frequent category for categorical column imputation
imp_cat = SimpleImputer(strategy="most_frequent")
X_train_cat = imp_cat.fit_transform(X_train_cat) # Both fit and transform
X_test_cat = imp_cat.transform(X_test_cat) # Only transform since the dataset was fit previously
# Imputing default median for numeric column imputation
imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num) # Both fit and transform
X_test_num = imp_num.transform(X_test_num) # Only transform since the dataset was fit previously
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)

### Scaling (Only happens on X, never touch y)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

xxxxxxxxxx

### One-hot encoding : turn categorical values into numeric dummy values

df_dummies = pd.get_dummies(df["target"], drop_first=True)

df_numeric = pd.concat([df, df_dummies], axis=1)

df_ready = df_numeric.drop("target", axis=1)

### drop Missing values

print(df.isna().sum().sort_values())

df = df.dropna(subset=column_list)

### Impute missing values

from sklearn.impute import SimpleImputer

y = df["target"].values

X_cat = df["cat_col"].values.reshape(-1, 1)

X_num = df.drop(["cat_col", "target"], axis=1).values

X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)

X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=12)

# Imputing most frequent category for categorical column imputation

imp_cat = SimpleImputer(strategy="most_frequent")

X_train_cat = imp_cat.fit_transform(X_train_cat) # Both fit and transform

X_test_cat = imp_cat.transform(X_test_cat) # Only transform since the dataset was fit previously

# Imputing default median for numeric column imputation

imp_num = SimpleImputer()

X_train_num = imp_num.fit_transform(X_train_num) # Both fit and transform

X_test_num = imp_num.transform(X_test_num) # Only transform since the dataset was fit previously

X_train = np.append(X_train_num, X_train_cat, axis=1)

X_test = np.append(X_test_num, X_test_cat, axis=1)

### Scaling (Only happens on X, never touch y)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

Popularity 8/10 Helpfulness 5/10 Language python

Source: Grepper

Tags: python

Share

Link to this answer
Share Copy Link

Contributed on Dec 20 2023

Innocent Iguana

0 Answers Avg Quality 2/10

preprocessing data python

Comment

10

Tip Innocent Iguana 1 GREPCC

# Inspect dataset
df.head()
df.info()
df.describe() # Summary stats

# DEAL WITH MISSING VALUES
df.drop([1, 2, 3]) # Drop specific rows
df.dropna(thresh=2) # keep at least 2 non-missing values in each row
df.dropna(subset=['C']) # Drop missing values of specified column

# Convert column types
df["C"] = df["C"].astype("float")

# Verify class imbalance
y.value_counts()

# Split into training and testing data (Consider class imbalance)
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# STANDARDIZE DATASET
df.var() # Detect high variance difference in columns are candidates of log normalization

# FEATURE ENGINEERING
# TEXT PROCESSING (cleaning / vectorizing / regular expression )

xxxxxxxxxx

# Inspect dataset

df.head()

df.info()

df.describe() # Summary stats

# DEAL WITH MISSING VALUES

df.drop([1, 2, 3]) # Drop specific rows

df.dropna(thresh=2) # keep at least 2 non-missing values in each row

df.dropna(subset=['C']) # Drop missing values of specified column

# Convert column types

df["C"] = df["C"].astype("float")

# Verify class imbalance

y.value_counts()

# Split into training and testing data (Consider class imbalance)

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# STANDARDIZE DATASET

df.var() # Detect high variance difference in columns are candidates of log normalization

# FEATURE ENGINEERING

# TEXT PROCESSING (cleaning / vectorizing / regular expression )

Popularity 6/10 Helpfulness 5/10 Language python

Source: Grepper

Tags: python

Share

Link to this answer
Share Copy Link

Contributed on Jan 16 2024

Innocent Iguana

0 Answers Avg Quality 2/10