Feature Engineering

Feature engineering is a critcal stage in any machine learning project. It’s main purpose is to extract features from the raw data and generate numerical values that can be used to train a machine learning model. In this notebook, we will explore the different types of features and how to generate them.

Label encoding

Label encoding generates number for category:

import sklearn.preprocessing as preprocessing
targets = np.array(["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"])
labelenc = preprocessing.LabelEncoder()
labelenc.fit(targets)
targets_trans = labelenc.transform(targets)
print("The original data")
print(targets)
print("The transform data using LabelEncoder")
print(targets_trans)
The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Monn' 'Venus']
The transform data using LabelEncoder
[3 3 2 0 1 4]

The label encoding operation must be performed on both the train and test dataset at the same time. We can use .astype("category") and pandas.Series.cat.coded to do the same:

df = pd.DataFrame({"col1": ["Sun", "Sun", "Moon", "Earth", "Monn", "Venus"]})
print("The original types of DataFrame")
print(df.dtypes)
print("*"*30)
df["col1"] = df["col1"].astype("category")
print("The new types of DataFrame")
print(df.dtypes)
print("*"*30)
df["col1_label_encoding"] = df["col1"].cat.codes
print("The new column.")
df
The original types of DataFrame
col1    object
dtype: object
******************************
The new types of DataFrame
col1    category
dtype: object
******************************
The new column.
col1 col1_label_encoding
0 Sun 3
1 Sun 3
2 Moon 2
3 Earth 0
4 Monn 1
5 Venus 4

One-hot encoding

Even though we generated numbers from catogory, most times these numbers have no order (i.e. they are nominal, not ordinal). In that case, proper way of encoding is to use one-hot encoding.

import sklearn.preprocessing as preprocessing

targets = np.array(["Sun", "Sun", "Moon", "Earth", "Moon",
                    "Venus"])
labelEnc = preprocessing.LabelEncoder()
new_target = labelEnc.fit_transform(targets)
onehotEnc = preprocessing.OneHotEncoder()
onehotEnc.fit(new_target.reshape(-1, 1))
targets_trans = onehotEnc.transform(new_target.reshape(-1, 1))
print("The original data")
print(targets)
print("The transform data using OneHotEncoder")
print(targets_trans.toarray())
The original data
['Sun' 'Sun' 'Moon' 'Earth' 'Moon' 'Venus']
The transform data using OneHotEncoder
[[0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]]
df_new = pd.get_dummies(df, columns=["col1"], prefix="Planet")
df_new
col1_label_encoding Planet_Earth Planet_Monn Planet_Moon Planet_Sun Planet_Venus
0 3 0 0 0 1 0
1 3 0 0 0 1 0
2 2 0 0 1 0 0
3 0 1 0 0 0 0
4 1 0 1 0 0 0
5 4 0 0 0 0 1

Count encoding

Count encoding is a prevalent encoding method for tree-based models. Drawback is that it might map different categories to the same value:

df["planet_count"] = df["col1"].map(df["col1"].value_counts().to_dict())
df.head()
col1 col1_label_encoding planet_count
0 Sun 3 2
1 Sun 3 2
2 Moon 2 1
3 Earth 0 1
4 Monn 1 1

Mean Encoding

Mean encoding maps each category to the mean of the target variable for that category (we could use statistic other then mean too like variance and stdev):

df = pd.DataFrame({
    "col1": ["Sun", "Moon", "Sun", "Moon", "Moon", "Mars"],
    "price": [20, 30, 30, 35, 40, 55]
})
d = df.groupby(["col1"])["price"].mean().to_dict()
df["col1_price_mean"] = df["col1"].map(d)
df
col1 price col1_price_mean
0 Sun 20 25.0
1 Moon 30 35.0
2 Sun 30 25.0
3 Moon 35 35.0
4 Moon 40 35.0
5 Mars 55 55.0

Weight of Evidence (WOE) Encoding

WOE is a technique used to encode categorical features for classification tasks. We assign several probabilites to the categories. For example for binary classification problems, WOE is defined as: \[ WOE = \log \frac{p(1)}{p(0)} \]

df = pd.DataFrame({
    "col1": ["Moon", "Sun", "Moon", "Sun", "Sun"],
    "Target": [1, 1, 0, 1, 0]
})
df["Target"] = df["Target"].astype("float64")
print("The original dataset")
print(df)
print("*" * 30)
d = df.groupby(["col1"])["Target"].mean().to_dict()
df["p1"] = df["col1"].map(d)
df["p0"] = 1 - df["p1"]
df["woe"] = np.log(df["p1"] / df["p0"])
print("The new transform dataset")
print(df)
The original dataset
   col1  Target
0  Moon     1.0
1   Sun     1.0
2  Moon     0.0
3   Sun     1.0
4   Sun     0.0
******************************
The new transform dataset
   col1  Target        p1        p0       woe
0  Moon     1.0  0.500000  0.500000  0.000000
1   Sun     1.0  0.666667  0.333333  0.693147
2  Moon     0.0  0.500000  0.500000  0.000000
3   Sun     1.0  0.666667  0.333333  0.693147
4   Sun     0.0  0.666667  0.333333  0.693147

Feature interaction

Is a method where new features can be generated by interaction (merging, adding, etc.) of two or more existing features:

df = pd.DataFrame({
    "fea1": ["a", "b", "a", "b", "a"],
    "fea2": ["red", "yellow", "white", "blue", "red"]
})

print("The original feature matrix is")
print(df)

print("=" * 30)

df["fea1_fea2"] = df["fea1"].astype('str') + "_" + df["fea2"].astype('str')
print("The new feature matrix is")
print(df)
The original feature matrix is
  fea1    fea2
0    a     red
1    b  yellow
2    a   white
3    b    blue
4    a     red
==============================
The new feature matrix is
  fea1    fea2 fea1_fea2
0    a     red     a_red
1    b  yellow  b_yellow
2    a   white   a_white
3    b    blue    b_blue
4    a     red     a_red

Datatime features

We encode datetime into year/month/day/hour/etc. features:

df = pd.DataFrame({"col1": [1549720105, 1556744905, 1569763805, 1579780105]})
df["col1"] = pd.to_datetime(df['col1'], unit='s')
print("The original dataset")
print(df)
print("*" * 30)
df["month"] = df["col1"].dt.month
df["week"] = df["col1"].dt.isocalendar().week
df["hour"] = df["col1"].dt.hour
print("The new transform dataset")
print(df)
The original dataset
                 col1
0 2019-02-09 13:48:25
1 2019-05-01 21:08:25
2 2019-09-29 13:30:05
3 2020-01-23 11:48:25
******************************
The new transform dataset
                 col1  month  week  hour
0 2019-02-09 13:48:25      2     6    13
1 2019-05-01 21:08:25      5    18    21
2 2019-09-29 13:30:05      9    39    13
3 2020-01-23 11:48:25      1     4    11
df = pd.DataFrame({
    "col1": [1529720105, 1536744905, 1529763805, 1519780105],
    "col2": [1549720105, 1556744905, 1569763805, 1579780105]
})
print("The original dataset")
print(df)
df["col1"] = pd.to_datetime(df['col1'], unit='s')
df["col2"] = pd.to_datetime(df['col2'], unit='s')
df["interval"] = df["col2"] - df["col1"]
df["interval_days"] = df["interval"].dt.days
print("The new transform dataset")
print(df)
The original dataset
         col1        col2
0  1529720105  1549720105
1  1536744905  1556744905
2  1529763805  1569763805
3  1519780105  1579780105
The new transform dataset
                 col1                col2          interval  interval_days
0 2018-06-23 02:15:05 2019-02-09 13:48:25 231 days 11:33:20            231
1 2018-09-12 09:35:05 2019-05-01 21:08:25 231 days 11:33:20            231
2 2018-06-23 14:23:25 2019-09-29 13:30:05 462 days 23:06:40            462
3 2018-02-28 01:08:25 2020-01-23 11:48:25 694 days 10:40:00            694