# Linear Regression

Inspiration by Daniel Voigt Godoy’s books

# Linear regression

``````import platform
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import torch.optim as optim
import torch.nn as nn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from shared.step_by_step import StepByStep
from scipy.linalg import norm

from torchviz import make_dot
plt.style.use('fivethirtyeight')``````
``````The autoreload extension is already loaded. To reload it, use:

# Generate some data

we’ll use numpy for this, and also need to split the data, can also use numpy for this

``````np.random.seed(43)

b_true = 2.
w_true = -0.5
N = 100

x = np.random.rand(N,1)
epsilon = 0.05 * np.random.randn(N,1)
y = w_true*x + b_true + epsilon

plt.plot(x,y,'.')
plt.show()``````

# Linear regression with sklearn

Of course we can make a fit using sklearn:

``````reg = LinearRegression().fit(x, y)
r2_coef = reg.score(x, y)
print(reg.coef_, reg.intercept_, r2_coef)``````
``[[-0.52894853]] [2.01635764] 0.9014715901595961``

but the point is to learn PyTorch and solve much bigger problems.

• data set is the object that holds features and labels together,
• split the data into train and valid,
• convert to pytorch tensors,
• create datasets,
``````np.random.seed(43)
indices = np.arange(N)
np.random.shuffle(indices)
train_indices = indices[:int(0.8*N)]
val_indices = indices[int(0.8*N):]

device = 'cuda' if torch.cuda.is_available() else 'cpu'

train_x = torch.tensor(x[train_indices], dtype=torch.float32, device=device)
train_y = torch.tensor(y[train_indices], dtype=torch.float32, device=device)
val_x = torch.tensor(x[val_indices], dtype=torch.float32, device=device)
val_y = torch.tensor(y[val_indices], dtype=torch.float32, device=device)

train_dataset = TensorDataset(train_x, train_y)
val_dataset = TensorDataset(val_x, val_y)

# Model, loss, and optimizer

``````torch.random.manual_seed(42)
model = torch.nn.Linear(1,1, bias=True, device=device)
optimizer = optim.SGD(model.parameters(), lr=0.1)
loss_fn = nn.MSELoss()``````

# Train

``````model.reset_parameters()
sbs = StepByStep(model, optimizer, loss_fn)
sbs.train(30)``````
``sbs.model.state_dict()``
``OrderedDict([('weight', tensor([[-0.5267]])), ('bias', tensor([2.0177]))])``
``sbs.plot_losses()``

Note btw that `alex` and `sbs.model` are the same object:

``assert id(sbs.model) == id(model)``

# Predict

``````test = np.random.rand(100,1)
test_predictions = sbs.predict(test)
plt.plot(x,y,'.')
plt.plot(test,test_predictions,'.')
plt.show()``````

``sbs.save_checkpoint('linear.pth')``
``sbs.load_checkpoint('linear.pth')``

# Visualize model

One can use `make_dot(yhat)` locally. I can’t make graphviz work on GitHub, but the output looks like this:

# Set up tensorboard

One can add tensorboard to monitor losses, this will be important when having long training. We can start tensorboard from terminal using `tensorboard --logdir runs` (or from notebook if using extension via `%load_ext tensorboard`). The tensorboard should be running at http://localhost:6006/ (ignore `"TensorFlow installation not found"` message, we don’t need it). Make sure path is right, tensorboard will be empty if it can’t find the `runs` folder.

# Tips dataset

Let’s study the linear regression model from classical perspective. Let’s load a `tips` dataset where independent variables are: `total_bill, sex, smoker, day, size, time`, and depended variable is `tips`. First we simplify model by keeping only 1 independed variable, `total_bill`:

``````# Load the dataset
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Let’s split the data:

``train, valid = train_test_split(tips, test_size=0.2, random_state=42)``
``````plt.subplots(figsize=(20,12))
a = sns.heatmap(train.corr(numeric_only=False), cmap="seismic", annot=True, vmin=-1, vmax=1, fmt='.1f',square = True)``````
```---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[188], line 2
1 plt.subplots(figsize=(20,12))
----> 2 a = sns.heatmap(train.corr(numeric_only=False), cmap="seismic", annot=True, vmin=-1, vmax=1, fmt='.1f',square = True)

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/frame.py:10307, in DataFrame.corr(self, method, min_periods, numeric_only)
10305 cols = data.columns
10306 idx = cols.copy()
> 10307 mat = data.to_numpy(dtype=float, na_value=np.nan, copy=False)
10309 if method == "pearson":
10310     correl = libalgos.nancorr(mat, minp=min_periods)

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/frame.py:1843, in DataFrame.to_numpy(self, dtype, copy, na_value)
1841 if dtype is not None:
1842     dtype = np.dtype(dtype)
-> 1843 result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
1844 if result.dtype is not dtype:
1845     result = np.array(result, dtype=dtype, copy=False)

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/internals/managers.py:1770, in BlockManager.as_array(self, dtype, copy, na_value)
1768             arr = arr.astype(dtype, copy=False)
1769 else:
-> 1770     arr = self._interleave(dtype=dtype, na_value=na_value)
1771     # The underlying data was copied within _interleave
1772     copy = False

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/internals/managers.py:1829, in BlockManager._interleave(self, dtype, na_value)
1823 rl = blk.mgr_locs
1824 if blk.is_extension:
1825     # Avoid implicit conversion of extension blocks to object
1826
1827     # error: Item "ndarray" of "Union[ndarray, ExtensionArray]" has no
1828     # attribute "to_numpy"
-> 1829     arr = blk.values.to_numpy(  # type: ignore[union-attr]
1830         dtype=dtype,
1831         na_value=na_value,
1832     )
1833 else:
1834     arr = blk.get_values(dtype)

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/arrays/base.py:513, in ExtensionArray.to_numpy(self, dtype, copy, na_value)
482 def to_numpy(
483     self,
484     dtype: npt.DTypeLike | None = None,
485     copy: bool = False,
486     na_value: object = lib.no_default,
487 ) -> np.ndarray:
488     """
489     Convert to a NumPy ndarray.
490
(...)
511     numpy.ndarray
512     """
--> 513     result = np.asarray(self, dtype=dtype)
514     if copy or na_value is not lib.no_default:
515         result = result.copy()

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/arrays/_mixins.py:85, in ravel_compat.<locals>.method(self, *args, **kwargs)
82 @wraps(meth)
83 def method(self, *args, **kwargs):
84     if self.ndim == 1:
---> 85         return meth(self, *args, **kwargs)
87     flags = self._ndarray.flags
88     flat = self.ravel("K")

File ~/mambaforge/envs/website/lib/python3.10/site-packages/pandas/core/arrays/categorical.py:1609, in Categorical.__array__(self, dtype)
1607 ret = take_nd(self.categories._values, self._codes)
1608 if dtype and not is_dtype_equal(dtype, self.categories.dtype):
-> 1609     return np.asarray(ret, dtype)
1610 # When we're a Categorical[ExtensionArray], like Interval,
1611 # we need to ensure __array__ gets all the way to an
1612 # ndarray.
1613 return np.asarray(ret)

ValueError: could not convert string to float: 'No'```
``````X_train = train.total_bill.values.reshape(-1,1)
y_train = train.tip.values.reshape(-1,1)
X_valid = valid.total_bill.values.reshape(-1,1)
y_valid = valid.tip.values.reshape(-1,1)
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape``````
``((195, 1), (195, 1), (49, 1), (49, 1))``

## sklearn.LinearRegression

Let’s first use very simple linear regression model from sklearn, and only 2 columns:

``````lr = LinearRegression()
lr.fit(X_train, y_train)
r2 = lr.score(X_valid, y_valid)
sns.regplot(data=tips, x="total_bill", y="tip", line_kws={"color":"r","alpha":0.7,"lw":5})
plt.title(f'\$w_0\$ = {float(lr.intercept_):0.3f}, \$w_1\$ = {float(lr.coef_):0.3f}, \$r^2\$ = {r2:0.3f}')
plt.show()``````

So we see that on average people left 10.5% tip. The r2 score is 0.597.

## Manual

``````w0 = 0.5
w1 = 0.5
N = len(X_train)
alpha = 2e-4

w1_list = [w0]
w0_list = [w1]
for i in range(50):
y_new = w0 + X_train*w1
cost_function = 1/N *np.sum((y_new - y_train)**2)

w0 -= alpha*(1/N * 2 * np.sum(y_new - y_train))
w1 -= alpha*(1/N * 2 * np.sum((y_new - y_train)*X_train))

w0_list.append(w0)
w1_list.append(w1)

sns.lineplot(x=range(len(w0_list)), y=w0_list, color='red', label='w0')
sns.lineplot(x=range(len(w1_list)), y=w1_list, color='blue', label='w1')

plt.title(f'\$w_0\$ = {w0:0.3f}, \$w_1\$ = {w1:0.3f}')
plt.show()``````

Interesting that w1 is slightly off 12.1% vs 10% (regularization is not a culprit). Let’s plot together with the scatter plot:

``````# Create scatterplot
sns.scatterplot(x="total_bill", y="tip", data=tips)

# Add title and axis labels
plt.title("Tips vs Total Bill")
plt.xlabel("Total Bill")
plt.ylabel("Tip")

# seaborn plot a line
sns.lineplot(x=X_train.flatten(), y=w0 + w1*X_train.flatten(), color='red', label='Linear Regression')
# Show the plot
plt.show()``````

## Multivariate linear regression

Let’s now include all variables:

``_ = sns.boxplot(x="day", y="total_bill", data=tips)``
``_ = sns.boxplot(x="sex", y="tip", data=tips)``

This graph might not mean males are bigger tipers, since it might have been that more males ate in bigger groups as well. Plotting relative tip (i.e. tip/total_bill) might be more informative:

``````tips['relative_tip'] = tips['tip'] / tips['total_bill']
_ = sns.boxplot(x="sex", y="relative_tip", data=tips)``````

Indeed, women left larger percentage of tip (then again, they might have had smaller portionsl there are many angles one can look at this data). How about compare group sizes:

``_ = sns.violinplot(x="sex", y="size", data=tips)``

That seems very similar distribution.

``_ = sns.boxplot(x="size", y="tip", data=tips)``

# Multivariable linear regression

Let’s run multivariable linear regression, we first need to encode all the categorical variables into numerical:

``tips.dtypes``
``````total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object``````
``train.head()``
total_bill tip sex smoker day time size
228 13.28 2.72 Male No Sat Dinner 2
208 24.27 2.03 Male Yes Sat Dinner 2
96 27.28 4.00 Male Yes Fri Dinner 2
167 31.71 4.50 Male No Sun Dinner 4
84 15.98 2.03 Male No Thur Lunch 2
``````train_n = pd.get_dummies(train, columns=["sex"], prefix="sex")
train_n = pd.get_dummies(train_n, columns=["smoker"], prefix="smoker")
train_n = pd.get_dummies(train_n, columns=["time"], prefix="time")
train_n = pd.get_dummies(train_n, columns=["day"], prefix="day")``````
``````valid_n = pd.get_dummies(valid, columns=["sex"], prefix="sex")
valid_n = pd.get_dummies(valid_n, columns=["smoker"], prefix="smoker")
valid_n = pd.get_dummies(valid_n, columns=["time"], prefix="time")
valid_n = pd.get_dummies(valid_n, columns=["day"], prefix="day")``````
``````X_train = train_n.drop('tip', axis=1).values
y_train = train_n.tip.values.reshape(-1,1)
X_valid = valid_n.drop('tip', axis=1).values
y_valid = valid_n.tip.values.reshape(-1,1)
X_train.shape, y_train.shape, X_valid.shape, y_valid.shape``````
``((195, 12), (195, 1), (49, 12), (49, 1))``

Let’s train the model:

``````lr = LinearRegression()
lr.fit(X_train, y_train)
lr.coef_``````
``````array([[ 0.09469974,  0.23348393,  0.01440964, -0.01440964, -0.09617663,
0.09617663,  0.04747858, -0.04747858, -0.07564606,  0.10407492,
-0.08171038,  0.05328153]])``````

and these are the coefficients. Let’s predict on valid dataset:

``y_valid_pred = lr.predict(X_valid)``

and let’s plot the predicitons and targets:

``````# write a code to plot y_valid_pred vs y_valid
plt.scatter(y_valid, y_valid_pred)
plt.xlabel('Actual output')
plt.ylabel('Predicted output')
plt.title('y_valid_pred vs y_valid')
plt.show()``````

Measure of accuracy is R^2 (i.e. coefficient of determinaltion):

``````r2 = lr.score(X_valid, y_valid)
r2``````
``0.43730181943482505``

Which is the same as:

``````u = ((y_valid_pred - y_valid)**2).sum()
v = ((y_valid - y_valid.mean()) ** 2).sum()
1 - u/v``````
``0.43730181943482505``

And this is worse then univariable linear regression, meaning that in case of LinearRegression adding other variables made predictions worse.