PyTorch Tutorial for Beginners: Build a Multiple Regression Model from Scratch

For some time before LLMs became hyped, there was an almost visible line separating Machine Learning frameworks from Deep Learning frameworks.

The talk was concentrated on Scikit-Learn, XGBoost, and similar for ML, while PyTorch and TensorFlow dominated the scene when Deep Learning was the matter.

After the AI explosion, though, I have been seeing PyTorch dominating the scene much more than TensorFlow. Both frameworks are really powerful, enabling Data Scientists to solve different kinds of problems, Natural Language Processing being one of them, therefore increasing the popularity of Deep Learning once again.

Well, in this post, my idea is not to talk about NLP, but instead, I will work with a multivariable linear regression problem with two objectives in mind:

Teaching how to create a model using PyTorch
Sharing knowledge about Linear Regression that is not always found in other tutorials.

Let’s dive in.

Preparing the Data

Alright, let me spare you from a fancy definition of Linear Regression. You probably saw that too many times in countless tutorials all over the Internet. So, enough to say that when you have a variable Y that you want to predict and another variable X that can explain Y’s variation using a straight line, that is, in essence, Linear Regression.

Dataset

For this exercise, let’s use the Abalone dataset [1].

Nash, W., Sellers, T., Talbot, S., Cawthorn, A., & Ford, W. (1994). Abalone [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C55C7W.

According to the dataset documentation, the age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope, a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age.

So, let us go ahead and load the data. Additionally, we will One Hot Encode the variable Sex, since it is the only categorical one.

# Data Load
from ucimlrepo import fetch_ucirepo
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from feature_engine.encoding import OneHotEncoder

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

# One Hot Encode Sex
ohe = OneHotEncoder(variables=['Sex'])
X = ohe.fit_transform(X)

# View
df = pd.concat([X,y], axis=1)

Here’s the dataset.

So, in order to create a better model, let’s explore the data.

Exploring the Data

The first steps I like to perform when exploring a dataset are:

1. Checking the target variable’s distribution.

# Looking at our Target variable
plt.hist(y)
plt.title('Rings [Target Variable] Distribution');

The graphic shows that the target variable is not normally distributed. That can impact the regression, but usually can be corrected with a power transformation, such as log or Box-Cox.

Target variable distribution. Image by the author.

2. Look at the statistical description.

The stats can show us important information like mean, standard deviation, and easily spot some discrepancies in terms of minimum or maximum values. The explanatory variables are pretty much ok, within a smaller range, and same scale. The target variable (Rings) is in a different scale.

# Statistical description
df.describe()

Statistical description. Image by the author.

Next, let’s check the correlations.

# Looking at the correlations
(df
 .drop(['Sex_M', 'Sex_I', 'Sex_F'],axis=1)
 .corr()
 .style
 .background_gradient(cmap='coolwarm')
)

The explanatory variables have a moderate to strong correlation with Rings. We can also see that there is some collinearity between Whole_weight with Shucked_weight, Viscera_weight, and Shell_weight. Length and Diameter are also collinear. We can test removing them later.

sns.pairplot(df);

When we plot the pairs scatterplots and look at the relationship of the variables with Rings, we can quickly identify some problems

The assumption of homoscedasticity is violated. This means that the relationship is not homogeneous in terms of variance.
Look how the plots form a cone shape, increasing the variance of Y as the X values increase. When estimating the value of Rings for higher values of the X variables, the estimate will not be very accurate.
The variable Height has at least two outliers that are very visible when Height > 0.3.

Pairplots no transformation. Image by the author.

Removing the outliers and transforming the target variable to logarithms will result in the next plot of the pairs. It is better, but still doesn’t solve the homoscedasticity problem.

Pairplots after transformation. Image by the author.

Another quick exploration we can do is plotting some graphics to check the relationship of the variables when grouped by the Sex variable.

The variable Diameter has the most linear relationship when Sex=I, but that’s all.

# Create a FacetGrid with scatterplots
sns.lmplot(x="Diameter", y="Rings", hue="Sex", col="Sex", order=2, data=df);

On the other hand, Shell_weight has too much dispersion for high values, distorting the linear relationship.

# Create a FacetGrid with scatterplots
sns.lmplot(x="Shell_weight", y="Rings", hue="Sex", col="Sex", data=df);

Shell_weight x Rings. Image by the author.

All of this shows that a Linear Regression model would be really challenging for this dataset, and will probably fail. But we still want to do it.

By the way, I don’t remember seeing a post where we actually go through what went wrong. So, by doing this, we can also learn valuable lessons.

Modeling: Using Scikit-Learn

Let’s run the sklearn model and evaluate it using Root Mean Squared Error.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

df2 = df.query('Height < 0.3 and Rings > 2 ').copy()
X = df2.drop(['Rings'], axis=1)
y = np.log(df2['Rings'])

lr = LinearRegression()
lr.fit(X, y)

predictions = lr.predict(X)

df2['Predictions'] = np.exp(predictions)
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.2383762717104916

If we look at the header, we can confirm that the model struggles with estimates for higher values (e.g., rows 0, 6, 7, and 9).

Header with predictions. Image by the author.

One Step Back: Trying Other Transformations

Alright. So what can we do now?

Probably remove more outliers and try again. Let’s try using an unsupervised algorithm to find some more outliers. We will apply the Local Outlier Factor, dropping 5% of the outliers.

We will also remove the multicollinearity, dropping Whole_weight and Length.

from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# fetch dataset
abalone = fetch_ucirepo(id=1)

# data (as pandas dataframes)
X = abalone.data.features
y = abalone.data.targets

# One Hot Encode Sex
ohe = OneHotEncoder(variables=['Sex'])
X = ohe.fit_transform(X)

# Drop Whole Weight and Length (multicolinearity)
X.drop(['Whole_weight', 'Length'], axis=1, inplace=True)

# View
df = pd.concat([X,y], axis=1)

# Let's create a Pipeline to scale the data and find outliers using KNN Classifier
steps = [
('scale', StandardScaler()),
('LOF', LocalOutlierFactor(contamination=0.05))
]
# Fit and predict
outliers = Pipeline(steps).fit_predict(X)

# Add column
df['outliers'] = outliers

# Modeling
df2 = df.query('Height < 0.3 and Rings > 2 and outliers != -1').copy()
X = df2.drop(['Rings', 'outliers'], axis=1)
y = np.log(df2['Rings'])

lr = LinearRegression()
lr.fit(X, y)

predictions = lr.predict(X)

df2['Predictions'] = np.exp(predictions)
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.238174395913869

Same result. Hmm….

Ok. we can keep playing with the variables and feature engineering, and we will start seeing some improvements here and there, like when we add the squared of Height, Diameter, and Shell_weight. That added to the outliers treatment will drop the RMSE to 2.196.

# Second Order Variables
X['Diameter_2'] = X['Diameter'] ** 2
X['Height_2'] = X['Height'] ** 2
X['Shell_2'] = X['Shell_weight'] ** 2

Certainly, it is fair to note that every variable added in Linear Regression models will impact the R² and sometimes inflate the result, giving a false idea that the model is improving, when it is not. In this case, the model is actually improving, since we are adding some non-linear components to it with the second order variables. We can prove that by calculating the adjusted R². It went from 0.495 to 0.517.

# Adjusted R²
from sklearn.metrics import r2_score

r2 = r2_score(df2['Rings'], df2['Predictions'])
n= df2.shape[0]
p = df2.shape[1] - 1
adj_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)
print(f'R²: {r2}')
print(f'Adjusted R²: {adj_r2}')

On the other hand, bringing back Whole_weight and Length can improve a little more the numbers, but I would not recommend it. If we do that, we are adding multicolinearity and inflating the importance of some variables’ coefficients, leading to potential estimation errors in the future.

Modeling: Using PyTorch

Ok. Now that we have a base model created, the idea is to create a Linear model using Deep Learning and try to beat the RMSE of 2.196.

Right. To start, let me state this upfront: Deep Learning models work better with scaled data. However, as our X variables are all in the same scale, we won’t need to worry about that. So let’s keep moving.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

We need to prepare the data for modeling with PyTorch. Here, we need some adjustments to make the data acceptable by the PyTorch framework, as it won’t take regular pandas dataframes.

Let’s use the same data frame from our base model.
Split X and Y
Transform the Y variable to log
Transform both to numpy arrays, since PyTorch won’t take dataframes.

df2 = df.query('Height < 0.3 and Rings > 2 and outliers != -1').copy()
X = df2.drop(['Rings', 'outliers'], axis=1)
y = np.log(df2[['Rings']])

# X and Y to Numpy
X = X.to_numpy()
y = y.to_numpy()

Next, using TensorDataset, we make X and Y become a Tensor object, and print the result.

# Prepare with TensorData
# TensorData helps us transforming the dataset to Tensor object
dataset = TensorDataset(torch.tensor(X).float(), torch.tensor(y).float())

input_sample, label_sample = dataset[0]
print(f'** Input sample: {input_sample}, \n** Label sample: {label_sample}')

** Input sample: tensor([0.3650, 0.0950, 0.2245, 0.1010, 0.1500, 1.0000, 
0.0000, 0.0000, 0.1332, 0.0090, 0.0225]), 
** Label sample: tensor([2.7081])

Then, using the DataLoader function, we can create batches of data. This means that the Neural Network will deal with a batch_size amount of data at a time.

# Next, let's use DataLoader
batch_size = 500
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

PyTorch models are best defined as classes.

The class is based on the nn.Module, which is PyTorch’s base class for neural networks.
We define the model layers we want to use in the init method.
- super().__init__() ensures the class will behave like a torch object.
The forward method describes what happens to the input when passed to the model.

Here, we pass it through Linear layers that we defined in the init method, and use ReLU activation functions to add some non-linearity to the model in the forward pass.

# 2. Creating a class
class AbaloneModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.linear1 = nn.Linear(in_features=X.shape[1], out_features=128)
    self.linear2 = nn.Linear(128, 64)
    self.linear3 = nn.Linear(64, 32)
    self.linear4 = nn.Linear(32, 1)

  def forward(self, x):
    x = self.linear1(x)
    x = nn.functional.relu(x)
    x = self.linear2(x)
    x = nn.functional.relu(x)
    x = self.linear3(x)
    x = nn.functional.relu(x)
    x = self.linear4(x)
    return x

# Instantiate model
model = AbaloneModel()

Next, let’s try the model for the first time using a script that simulates a Random Search.

Create an error criterion for model evaluation
Create a list to hold the data from the best model and setup the best_loss as a high value, so it will be replaced by better loss numbers during the iteration.
Setup the range for the learning rate. We will use power factors from -2 to -4 (e.g. from 0.01 to 0.0001).
Setup a range for the momentum from 0.9 to 0.99.
Get the data
Zero the gradient to clear gradient calculations from previous iterations.
Fit the model
Compute the loss and register the best model’s numbers.
Compute the weights and biases with the backward pass.
Iterate N times and print the best model.

# Mean Squared Error (MSE) is standard for regression
criterion = nn.MSELoss()

# Random Search
values = []
best_loss = 999
for idx in range(1000):
  # Randomly sample a learning rate factor between 2 and 4
  factor = np.random.uniform(2,5)
  lr = 10 ** -factor

  # Randomly select a momentum between 0.85 and 0.99
  momentum = np.random.uniform(0.90, 0.99)

  # 1. Get Data
  feature, target = dataset[:]
  # 2. Zero Gradients: Clear old gradients before the backward pass
  optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)
  optimizer.zero_grad()
  # 3. Forward Pass: Compute prediction
  y_pred = model(feature)
  # 4. Compute Loss
  loss = criterion(y_pred, target)
  # 4.1 Register best Loss
  if loss < best_loss:
    best_loss = loss
    best_lr = lr
    best_momentum = momentum
    best_idx = idx

  # 5. Backward Pass: Compute gradient of the loss w.r.t W and b'
  loss.backward()
  # 6. Update Parameters: Adjust W and b using the calculated gradients
  optimizer.step()
  values.append([idx, lr, momentum, loss])

print(f'n: {idx},lr: {lr}, momentum: {momentum}, loss: {loss}')

n: 999,lr: 0.004782946959508322, momentum: 0.9801209929050066, loss: 0.06135804206132889

Once we get the best learning rate and momentum, we can move on.

# --- 3. Loss Function and Optimizer ---

# Mean Squared Error (MSE) is standard for regression
criterion = nn.MSELoss()

# Stochastic Gradient Descent (SGD) with a small learning rate (lr)
optimizer = optim.SGD(model.parameters(), lr=0.004, momentum=0.98)

Then, we will re-train this model, using the same steps as before, but this time keeping the same learning rate and momentum.

Fitting a PyTorch model needs a longer script than the regular fit() method from Scikit-Learn. But it is not a big deal. The structure will always be similar to these steps:

Activate the model.train() mode
Create a loop for the number of iterations you want. Each iteration is called an epoch.
Zero the gradients from previous passes with optimizer.zero_grad().
Get the batches from the dataloader.
Compute the predictions with model(X)
Calculate the loss using criterion(y_pred, target).
Do the Backward Pass to compute the weights and bias: loss.backward()
Update the weights and bias with optimizer.step()

We will train this model for 1000 epochs (iterations). Here, we are only adding a step to get the best model at the end, so we make sure to use the model with the best loss.

# 4. Training
torch.manual_seed(42)
NUM_EPOCHS = 1001
loss_history = []
best_loss = 999

# Put model in training mode
model.train()

for epoch in range(NUM_EPOCHS):
  for data in dataloader:

    # 1. Get Data
    feature, target = data

    # 2. Zero Gradients: Clear old gradients before the backward pass
    optimizer.zero_grad()

    # 3. Forward Pass: Compute prediction
    y_pred = model(feature)

    # 4. Compute Loss
    loss = criterion(y_pred, target)
    loss_history.append(loss)

    # Get Best Model
    if loss < best_loss:
      best_loss = loss
      best_model_state = model.state_dict()  # save best model

    # 5. Backward Pass: Compute gradient of the loss w.r.t W and b'
    loss.backward()

    # 6. Update Parameters: Adjust W and b using the calculated gradients
    optimizer.step()

    # Load the best model before returning predictions
    model.load_state_dict(best_model_state)

  # Print status every 50 epochs
  if epoch % 200 == 0:
    print(epoch, loss.item())
    print(f'Best Loss: {best_loss}')

0 0.061786893755197525
Best Loss: 0.06033024191856384
200 0.036817338317632675
Best Loss: 0.03243456035852432
400 0.03307393565773964
Best Loss: 0.03077109158039093
600 0.032522525638341904
Best Loss: 0.030613820999860764
800 0.03488151729106903
Best Loss: 0.029514113441109657
1000 0.0369877889752388
Best Loss: 0.029514113441109657

Nice. The model is trained. Now it is time to evaluate.

Evaluation

Let’s check if this model did better than the regular regression. For that, I will put the model in evaluation mode by using model.eval(), so PyTorch knows that it needs to change the behavior from training and get into inference mode. It will turn off layer normalization and dropouts, for example.

# Get features
features, targets = dataset[:]

# Get Predictions
model.eval()
with torch.no_grad():
  predictions = model(features)

# Add to dataframe
df2['Predictions'] = np.exp(predictions.detach().numpy())

# RMSE
print(root_mean_squared_error(df2['Rings'], df2['Predictions']))

2.1108551025390625

The improvement was modest, about 4%.

Let’s look at some predictions from each model.

Predictions from both models. Image by the author.

Both models are getting very similar results. They struggle more as the number of Rings becomes higher. That is due to the cone shape of the target variable.

If we think that through for a moment:

As the number of Rings increases, there is more variance coming from the explanatory variable.
An Abalone with 15 rings will be within a much wider range of values than another one with 4 rings.
This confuses the model because it needs to draw a single line in the middle of the data that is not that linear.

Before You Go

We learned a lot in this project:

How to explore data.
How to check if the linear model would be a good option.
How to create a PyTorch model for a multivariable Linear Regression.

In the end, we saw that a target variable that is not homogeneous, even after power transformations, can lead to a low-performing model. Our model is still better than shooting the average value for all the predictions, but the error is still high, staying about 20% of the mean value.

We tried to use Deep Learning to improve the result, but all that power was not enough to lower the error considerably. I would probably go with the Scikit-Learn model, since it is simpler and more explainable.

Other options to try to improve the results would be creating a custom ensemble model with a Random Forest + Linear Regression. But that is a task that I leave to you, if you want.

If you liked this content, find me on my website.

https://gustavorsantos.me