Statistics

import matplotlib.pyplot as plt

Correlation

Paerson coefficient between columns is a way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables: \(\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{(X,Y)}{\sigma_X \sigma_Y}\)

image.png

Label Encoder

LabelEncoder can be used to normalize labels.

>>>from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

Convert type

df[column].apply(str) df[column].apply(float)

Describe

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
df.describe()
a b c d e
count 10.000000 10.000000 10.000000 10.000000 10.000000
mean 0.077571 0.075314 0.330951 -0.290290 0.082054
std 0.943048 1.092469 1.120891 1.014170 1.097351
min -1.931986 -2.253685 -0.828068 -2.402229 -2.275520
25% -0.184616 -0.435616 -0.301013 -0.810305 -0.367190
50% 0.245204 0.308878 0.022463 0.225803 0.077744
75% 0.579545 0.909817 0.609062 0.402518 0.814547
max 1.300892 1.315679 2.812638 0.529603 1.571569

Missing values visualization

df.iloc[1,2] = np.nan
df.iloc[3,4] = np.nan
import missingno as msno
msno.matrix(df)
<Axes: >

df['c'].fillna(0, inplace=True)
msno.matrix(df)
<Axes: >

Counter

df.dtypes
a    float64
b    float64
c    float64
d    float64
e    float64
dtype: object
df.a = df.a.apply(str)
df.dtypes
a     object
b    float64
c    float64
d    float64
e    float64
dtype: object
from collections import Counter
Counter(df.dtypes)
Counter({dtype('O'): 1, dtype('float64'): 4})

Skew

from scipy.stats import skew
def plot_skew(array):
    fig, axs = plt.subplots(1,2, figsize=(10,3))
    axs[0].plot(array)
    values, ranges = np.histogram(array, 100)
    axs[1].plot(ranges[:-1], values)
    axs[0].set_title(f'Skew: {skew(array)}')
    axs[1].set_title('Distribution')
    plt.show()
np.random.seed(42)
plot_skew(np.random.randn(10))
plot_skew(np.random.randn(1000))
plot_skew(np.random.randn(1000000))
plot_skew(np.arange(1000))

Box-Cox transformation

from scipy.special import boxcox1p
from scipy.special import boxcox

Box-Cox transformation is a way to transform non-normal dependent variables into a normal shape.

a = np.random.beta(1, 3, 5000)
plot_skew(a)
plot_skew(boxcox(a, 0.5))

One-hot encoding

s = pd.Series(list('abca'))
s
0    a
1    b
2    c
3    a
dtype: object
pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0

Exponential

Calculate exp(x) - 1 for all elements in the array: np.expm1