Table of content

  1. Motivation
  2. Introduction
  3. Finding weights and bias
  4. Activation function
  5. Loss function
  6. Gradient descent
  7. Algorithm
  8. Conclusion

 

We mostly use ready-made deep learning packages such as  like Scikit-learn, Keras, TensorFlow etc. and also most of our ML work is easily handled by this.

But for a machine learning practitioner it’s a must to know math and algorithms behind these libraries to get motivation and to stand more in the field.

 

Why do we need mathematical understanding ?

  1. It helps us in selecting the right algorithm , right number of  parameters, number of features etc.
  2. AI is in it’s developing phase. Everyday we introduce a number of advancements in it, so mathematical understanding helps us to grasp these techniques.
  3. It helps us in identifying underfitting and overfitting by Bia_Variance tradeoff.

So here, we will go through some basic math behind artificial neural networks which you will use throughout your learning.

 

Introduction

An artificial neural network is a computing system that works like our human brain. (Don’t worry if you don’t know the biological working of our brain.)

ANN is purely mathematical, so don’t bother about it.

Here is an artificial neuron

math-ann-formula-1

 math-ann-neural-network

 

Let's say you want to predict the chance that you or your friends will go to watch a movie at cinema or not.

Input feature may be like

  1. Whether friends is coming with or not. (Note: 0 denotes No and 1 denotes Yes)
  2. Is cinema hall near.
  3. Whether weather is cool or hot

 

Weight helps us to know which feature carries, how much importance in decision-making.

Like weights for a guy A who likes friend’s company may be like

w1=5; w2=3; w3=1

 

And for a guy B who is more concerned about weather condition and distance of cinema hall

w1=2 ; w2= 3; w3=3

 

Let us compute our output(whether a person will go to watch movie or not) as :

math-ann-formula-2

And  let us choose our decision boundary as 5, means a person will go to the cinema hall if Z >= 5.

Then we can conclude based on this condition that

  1. Person A will go if only his friends join him.
  2. Person B will go if friend is with him and cinema hall is near

                or

     weather is cool and cinema hall is near

 

We would be able to guess the 3rd condition when B will go.

So we can conclude:

math-ann-formula-3

Usually in AI, ML texts this (-5) refers as bias ‘b’ so finally

math-ann-formula-4

Finding weights and bias

In the practical world we face dataset whose weights are not known to us such that we can correctly classify our target.

Like for the below dataset , to classify flower species we need weights for each feature and a bias.

(NOTE: As above,  we discussed 2 class classification and it can be extended to multiclass using more layers.)

sample-features-target

So here we will discuss one mostly used technique for finding weights : Gradient descent method

 

Let we have a dataset containing input features as X = [x1 x2 x3 . . . . xn]

 

Here :

n = no. of feature

X= input matrix

And target variable as Y.

 

For Iris dataset we may take x1=sepal length, x2=sepal width ….. and y as class label.

And it will be mostly such that predicted output will get huge value, so we will use activation function such as Sigmoid to overcome it.

 

Activation Function

There are 3 most commonly  used activation function such as:

  1. Rectified Linear Activation (RELU)
  2. Logistic (Sigmoid)
  3. Hyperbolic Tangent (Tanh) 

 

We will discuss Sigmoid in this brief discussion as we go ahead.

 

Sigmoid Activation Function

sample-features-target

Mathematically

math-ann-formula-5

by analyzing above equation we can say that for large value of z , this will return value near to one.  

 

Note:

For further use,we will use weighted sum as

math-ann-formula-6

And predicted target variable as

math-ann-formula-7

Loss Function

To ensure that our model is going to get correct predicted value or not ,we use LOSS FUNCTION. Loss function measures error between predicted value and actual value.

There are many loss function we use in deep learning.

Here we will use cross-entropy loss function J(w,b) described as:

math-ann-formula-8

Note : here m = number of inputs.

 

For understanding let assume

math-ann-formula-9

math-ann-formula-10

So by above analysis, we can conclude that our goal is to minimize the loss function J(w,b) with proper choice of w and b. This process is called optimization.

 

Gradient Descent

One of optimization is Gradient descent which we will discuss here. let suppose our loss function graph looks like below and our goal is to find w , b such it’s value is minimum:

math-ann-loss function1

Step 1:

Initialize weights and bias randomly and find

math-ann-formula-11

 

math-ann-loss function2

Step 2 :

by above graph we can understand that: 

math-ann-formula-12

and by sufficient number of iteration we get our value of w that will have least error.

we will follow same procedure for finding bias(b).

 

Step 3:

After finding weight and bias we can simply get our output using these values.

 

Algorithm

Now we will develop a simple algorithm based on our mathematical understanding:

let we have dataset like below :

X1

X2

Y

0

0

0

0

1

0

1

0

0

1

1

1

 

math-ann-formula-13

math-ann-optimization-hand-notes

math-ann-formula-14

So Finally :

import math

x1=[0,0,1,1] #input x1
x2=[0,1,0,1] #input x2
y=[0,0,0,1]   #output

w1=0 #initialization of weight and bias
w2=0
b=0
L=0
dw1=0
dw2=0
db=0

for i in range(10000):
    a=[]
    z=[]
    for i in range(4):
        z.append(w1*x1[i]+w2*x2[i]+b) #weighted sum
        a.append(1/(1+math.exp(-z[i]))) #activation function
        L=L+(-(y[i]*math.log(a[i])+(1-y[i])*math.log(1-a[i]))) #Loss function
        dz=a[i]-y[i]
        dw1=dw1+dz*x1[i]  #weight and bias update
        dw2=dw2+dz*x2[i]
        db=db+dz
    dw1=dw1/4
    dw2=dw2/4
    db=db/4
    w1=w1-0.01*dw1
    w2=w2-0.01*dw2
    b=b-0.01*db

 

w1 #value of weights after optimization

3.514136610374613

w2

3.514136610374613

b

-5.475789101071454

#testing weight on input
yd=[]
for i in range(4):
    yd.append(x1[i]*w1+x2[i]*w2+b)

yd #negative values corresponds to 0 and positive to 1

[-5.475789101071454,
 -1.9616524906968413,
 -1.9616524906968413,
 1.5524841196777714]

          

Conclusion

In above discussion we developed a 1 layer neural network with 2 input algorithm purely based on maths and this understanding will help in getting hands on multi-layer neural network.

But it have a drawback that it can be applied only on linearly seperable datasets like:

math-ann-optimization

math-ann-not-linearly-seperable

However, with use of Radial basis function we can transform our features from linearly not separable to linearly separable. You can read more about Artificial Neural Networks for Classification.

 

 

About the Author's:

Pankaj Singariya

Pankaj Singariya

Pankaj Singariya is an ECE final year student of NIT Meghalaya. He loves to learn the mathematical concept behind data science and artificial intelligence and wants to research and work in this field. He has executed projects on data science and machine learning.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.