Neural Network

Neural networks are a fundamental technology driving many advancements in artificial intelligence (AI) and machine learning. Modeled after the structure and functionality of the human brain, neural networks excel at recognizing patterns, making decisions, and solving complex problems. Their applications span diverse fields, including healthcare, finance, entertainment, and beyond.

At its core, a neural network is a computational system inspired by the biological networks of neurons in the brain. It consists of layers of interconnected nodes (neurons), each designed to process and transmit information. Neural networks are typically categorized into three main layers:

  • Input Layer: This layer receives the raw data input.
  • Hidden Layers: These intermediate layers process the data using mathematical functions. The more layers and neurons, the more complex patterns the network can identify—a concept known as deep learning.
  • Output Layer: This layer produces the final result, whether it’s a prediction, classification, or decision.

Neural networks operate by passing data through layers of neurons. Each connection between neurons has a weight, representing its importance. The steps involved include:

  • Forward Propagation: Input data flows through the network, with each neuron applying an activation function to determine the output.
  • Loss Calculation: The difference between the predicted output and the actual value (error) is measured using a loss function.
  • Back Propagation: The network adjusts the weights of its connections by minimizing the error, using optimization techniques like gradient descent.

Neuron

The concept of a neuron is at the heart of artificial intelligence (AI), particularly in neural networks. Inspired by biological neurons in the human brain, an artificial neuron is a mathematical function that processes and transmits data. This seemingly simple unit is the fundamental building block of complex AI systems, enabling machines to learn, recognize patterns, and make decisions.

An artificial neuron, often called a “node” or “unit,” mimics the functionality of its biological counterpart. In the brain, neurons receive signals, process them, and transmit them to other neurons. Similarly, in AI, artificial neurons take numerical inputs, apply a transformation, and pass the result to other neurons in a network.

An artificial neuron has three main components:

Inputs

Represent features or data points.
Each input px , ix is associated with a weight wx , signifying its importance. The input px is connected to a neuron in previous layer and input ix is connected to the raw input data of the network depending on the layer.

Node

Aggregates the weighted inputs wx and adds a bias term b.
The aggregation z is often a summation:

z = \sum (w_x \: p_x) + b = n

Activation Functions

Applies a nonlinear transformation to the aggregated input.
Determines the neuron’s output, making the network capable of learning complex patterns.

The activation function introduces non-linearity, enabling the network to model complex relationships in data. Based on the activation function F the result will be buffered either in p or a.
Common activation functions include:

Sigmoid

Outputs values between 0 and 1. Ideal for binary classification.

f(z) = p = \frac{1}{1+exp\left(-z\:\frac{1}{T}\right)} 

The Sigmoid function f(z) is shown below with parameter T=1.

The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values. This is represented by an S-shape of the Sigmoid activation function. The derivative of the function is:

\frac{\partial p}{\partial z} = \frac{1}{T}\left(f(z) (1 - f(z)\right)

The loss function commonly associated with the Sigmoid activation function is the binary entropy loss L, particularly for binary classification problems. Here’s how it is calculated:

L(y,p)=-\frac{1}{M}\sum_{m=0}^{M-1}\left(y_m\:log(p_m)+(1-y_m)\:log(1-p_m)\right)

The formula calculates the loss for each individual net output sample p and then averages these losses over all samples.

ReLU (Rectified Linear Unit)

Outputs the input directly if positive; otherwise, outputs zero. Used extensively in deep networks due to its simplicity and efficiency.

f(z) = p = max(0,z)

The neurons in following layers will be deactivated if the output of the linear transformation is less than 0. Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.

ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.

All the negative input values become zero immediately, which decreases the model’s ability to fit or train from the data properly. Due to this reason, during the Backpropagation process, the weights and biases for some neurons are not updated. This can create dead neurons which never get activated. This is called as Dying ReLU problem.

Leaky ReLU

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem as it has a small positive slope in the negative area.

f(z) = p = max(az,z) \xrightarrow{a=0.01} max(0.01z,z)

The advantages of Leaky ReLU are same as that of ReLU, in addition to the fact that it does enable Backpropagation, even for negative input values.

By making this minor modification for negative input values, the gradient of the left side of the graph comes out to be a non-zero value. Therefore dead neurons will be no longer encountered in that region.

PLU (Piecewise Linear Unit)

The Piecewise Linear Unit (PLU) is a hybrid of tanh and ReLU and shown to outperform the ReLU on a variety of tasks while avoiding the vanishing gradients issue of the tanh.

f(z) = p = sgn(z) \cdot min(|z|, (a \: |z|) + b) \xrightarrow{a=0.01,\:b=1-a} sgn(z) \cdot min(|z|, (0.01 \: |z|) + 0.99)

Piecewise linear functions handle the depth of modern neural networks well because they do not compress large values, preserving information and allowing the network to learn complex representations.

SoftMax

The SoftMax function converts a vector of raw scores (logits) into a probability distribution, where the probabilities of each class sum to 1. This makes the SoftMax function ideal for multi-class classification problems. SoftMax is typically used in the output layer of a neural network shown in block “s” below for multi-class classification. It assigns probabilities to each class (k=4 classes shown in below figure), and the class with the highest probability is often selected as the prediction.

The SoftMax function is defined as:

f(s)=p_k=\frac{e^{a_k}}{\sum\limits_{m=0}^{M-1}e^{a_m}}

The output of the SoftMax function for any input vector is a probability distribution. Each output lies between 0 and 1 and the sum of all outputs equals 1.

The loss function commonly associated with the SoftMax activation function is the categorical cross entropy loss L, particularly for multi-class classification problems. Here’s how it is calculated:

L(y,p)=\sum_{m=0}^{M-1}-y_m\:log(p_m)

When combining SoftMax with the cross entropy loss, the derivative simplifies significantly. For a single data point, the gradient of the loss with respect to the SoftMax result p is:

\frac{\partial L}{\partial z}=-y_k+p_k\sum_{m=0}^{M-1}y_m=p_k-y_k

Back Propagation

Back Propagation is a fundamental algorithm used to train neural networks by minimizing the error or loss between predicted and actual outputs. It calculates the gradient of the loss function with respect to the network’s weights and biases, enabling efficient weight updates via gradient descent or related optimization algorithms.

Output Layer Gradients

Calculate the delta of the loss with respect to the output layer’s activations. Multiply this delta value by the derivative of the output layer’s activation function to get the gradient:

\nabla_k=(p-y)\:\frac{\partial p}{\partial z}\bigg|_k

Hidden Layer Gradients

For each hidden layer use the chain rule to propagate gradients where l is the layer index:

\nabla_j^{(l-1)}=\frac{\partial p^{(l-1)}}{\partial z}\bigg|_j\:\sum_kw_{j,k}^{(l)}\nabla_k^{(l)}

Continue propagating gradients backward through all layers until reaching the input layer.

Weight and Bias Updates

Initialize weights and bias values by random values. The size of the weights values should be decreased by increasing number of net nodes.

Using the calculated gradients, update the weights w and biases b with an optimization algorithm such as Stochastic Gradient Descent (SGD):

w_{j,k}^{(l)}[n]=w_{j,k}^{(l)}[n-1]-\beta\:\nabla_k^{(l)}p_j^{(l-1)}
b_{k}^{(l)}[n]=b_{k}^{(l)}[n-1]-\beta\:\nabla_k^{(l)}

Where the learning rate β is defined by a value between 0 … 1.

Rules for Building Neural Network

  1. Scale the data so that min/max values of the inputs have been scaled to -1/1 or 0/1 based on the activation functions used
  2. Start with one or two hidden layers based on complexity. For less number of inputs (less complexity) start with one hidden layer.
  3. Number of the first hidden layer depends on the complexity of the input data. For high complexity increase the number of first hidden layer. The next layer size as half of the previous.
  4. Use ReLU, Leaky ReLU or PLU for intermediate layers.
  5. For output layer use Sigmoid for binary or multi-label classification, SoftMax for multi-class classifier, and Linear for regression.
  6. Split data for training and for testing the network.
  7. For binary classification (Sigmoid activation function) use binary cross entropy.
    For multi-class, use categorical cross entropy. For regression (Linear activation function) use mean squared error.

Examples

Following examples are using datasets from Kaggle community to demonstrate prediction and classification tasks solved by neural networks.

Heart Disease Prediction

An example to demonstrate prediction task has been obtained by following dataset of the Kaggle community:

https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

The task is to predict the chance of a heart attack based on different parameters shown below by:

agesexchest pain typeresting blood pressurecholesterol levelfasting blood sugarresting ecgmax heart rateexercise induced anginaST depressionslope ST segmentnumber of major vesselsthalassemia typestarget
71001121490112501.61021
45131102640113201.21030
34011182100119200.72021
58001702251014612.81210
::::::::::::::

For the interpretation of the columns see the dataset description.
Based on the rules described in previous section the neural network for this task has been designed as followed:

13 Input nodes → 20 PLU → 10 PLU → 5 PLU → 1 Sigmoid

Note that different designs can have different impact on the quality of the result. Also the time needed to complete training successfully can vary.
With this neural network design a drop of the loss can be observed after about 180000 epochs using 230 data rows.

The performance of the trained network by 5 randomly chosen data rows will be shown below:

A deviation in 1 of the 5 data rows can be observed. Instead of a high chance (value 1) of heart attack the network rated it as 0.606 only.

agesexchest pain typeresting blood pressurecholesterol levelfasting blood sugarresting ecgmax heart rateexercise induced anginaST depressionslope ST segmentnumber of major vesselsthalassemia typestarget shalltarget is
64021403130113300.220310.988
46121502310114703.610200.001
6410120246009612.201200
48111302450018000.210210.606
390294199111790020210.988

The neural network simulator used for this example can be downloaded here.

Neural networks are highly effective for solving prediction tasks across a wide range of domains, thanks to their ability to model complex, non-linear relationships in data. Their effectiveness depends on the type of task, the quality of the data, and the architecture of the network.

Weather Type Classification

An example to demonstrate classification task has been obtained by following dataset of the Kaggle community:

https://www.kaggle.com/datasets/nikhil7280/weather-type-classification

The task is to predict the weather type based on different parameters shown below by:

temperaturehumiditywind speedprecipitation (%)cloud coveratmospheric pressureuv indexseasonvisibility (km)locationweather type
14739.582partly cloudy1010.822winter3.5inlandrainy
39968.571partly cloudy1011.437spring10inlandcloudy
3064716clear1018.725spring5.5mountainsunny
-9491.558partly cloudy1132.28spring16.5mountainsnowy
:::::::::::

For the interpretation of the columns see the dataset description.
Based on the rules described in previous section the neural network for this task has been designed as followed:

10 Input nodes → 12 PLU → 8 PLU → 4 SoftMax

Note that different designs can have different impact on the quality of the result. Also the time needed to complete training successfully can vary.
With this neural network design a drop of the loss can be observed after about 15000 epochs using 180 data rows.

The performance of the trained network by 5 randomly chosen data rows will be shown below:

A deviation in 1 of the 5 data rows can be observed. Instead of 100% rainy the network rated it to 47.5% rainy only.

temperaturehumiditywind speedprecipitation (%)cloud coveratmospheric pressureuv indexseasonvisibility (km)locationweather type shallweather type is
34686.551 (clear)1021.0261 (spring)50 (inland)100% sunny100% sunny
4410945.51072 (overcast)1005143 (autumn)11 (mountain)100% rainy47.5% rainy
52.4% sunny
0.1% cloudy
22511.5352 (overcast)1005.2830 (winter)8.51 (mountain)100% cloudy100% cloudy
297914.5952 (overcast)994.4922 (summer)2.50 (inland)100% rainy99.9% rainy
0.1% sunny
07612540 (partly cloudy)993.5410 (winter)1.51 (mountain)100% snowy0.1% cloudy
99.9% snowy

The neural network simulator used for this example can be downloaded here.

Neural networks are exceptionally well-suited for solving classification tasks across diverse domains, often delivering state-of-the-art performance. Their effectiveness stems from their ability to learn complex, non-linear patterns and extract hierarchical features directly from raw data. Here’s an assessment of their strengths, challenges, and performance for classification tasks.

Similar Posts

  • Regression Models

    Regression analysis is a fundamental statistical method used to understand the relationship between one dependent variable and…

Leave a Reply

Your email address will not be published. Required fields are marked *