Table of Content

## Decision Tree

Decision tree is a predictive model which is widely used in machine learning. In decision tree, the given data is continuously divided based on the given input. It is a supervised machine learning technique in which the data is divided into nodes and leaves similar to a tree, in which nodes are the input(question) and leaves are the output(answer). Mostly it starts with single node which gets divided into possible outcome. It is used in statistics and data mining to solve classification or regression problem in the form of tree. It is very effective for non-linear dataset. It allows the user to choose from several action.

## Terms used in Decision tree

Root node - It is also called as starting node. It only has child nodes.

Leaf node - It is also called as end node. It does not have any child nodes.

Internal node - All the nodes between root node or leaf node are known as internal nodes.

Splitting - the process of dividing a single node into many nodes is known as splitting.

Branch - It is a subsection of decision tree.

## Types of Decision Tree

### Regression tree

It is also known as continuous variable decision tree. As the name suggest it only deals with numerical data where the input and output are basically numbers.

### Classification Tree

It is also known as Categorical Variable decision tree. As the name suggest it deals with categorical data. For example - predicting the car price as low, medium, high.

## Working of decision tree in machine learning

The process of predicting the target variable in machine learning is as follows: -

1. Provide a dataset which contains number of training instances along with a target and features.
2. Apply decision tree classification or regression model by using Decisiontreeclasifier() or Decisiontreeregressor () based on the dataset. Don’t forget to add criteria while building the model. (and also if overfitting is happening)
3. Your decision tree model is ready. To visualize the decision tree use Graphviz.

## Advantages of a decision tree

1. It is easy to implement and visualise.
2. It can handle both continuous and categorical data.
3. Less data cleaning needed when using decision tree.
4. It makes the user identify the relation between the variables easily. Hence it can be useful in data exploration. It also indicates which field is more important for prediction.

1. Sometimes decision tree overfits the data.
2. Not accurate for continuous data.
3. If a single data is changed in dataset the whole structure of the tree gets changed.

## Application of Decision tree

It is the basic algorithm which is used for classification and regression model. In decision tree where it can visualize the output , it makes easy for user to draw insight from the modeling flow process. Few examples where decision tree can be used are

2. Bank - fraud detection.
3. Hospital - wrong diagnosis.

## Creating Decision Tree from basics using Python

### Step 1 : Import all the Libraries

```    import pandas as pd
import numpy as np
import sklearn.datasets as ds  #for internal data set
from sklearn.metrics import classification_report #for classification report and analysis
from sklearn.model_selection import train_test_split #for train test splitting
from sklearn.tree import DecisionTreeClassifier #for decision tree object

```

### Step 2 : Load DataSet

Load dataset on which you want to create your decision tree model. For reference I am using Iris dataset which is available in python itself.

```    IRIS = ds.load_iris()
iris_df = pd.DataFrame(IRIS.data,columns=IRIS.feature_names)
iris_df['species'] = IRIS.target

``` Figure 1 : Iris data Set top 5 record using the head function

### Step 3 : Exploratory Data Analysis

Now lets perform some basic operations to know if the following dataset has null or Nan values.

```    iris_df.info()

``` Figure 2 : Use the info function to check the iris data set for null values

```    iris_df.shape
```

(150, 5)

```    iris_df.isnull().any()
```

sepal length (cm)    False
sepal width (cm)     False
petal length (cm)    False
petal width (cm)     False
species              False
dtype: bool

### Step 4 : Data Set Preparation

Now we will split the dataset into train and test. Train will have 80% of the data and test will have 20% of the data from the iris dataset. Simultaneously we will be creating another set with 50% of train data and 50% as test.

```    # Seggregate the features and target into seperate objects
X = iris_df.iloc[:,0:4]
Y = iris_df.iloc[:,4]

# Splitting the data - 80:20 ratio
X_train_1, X_test_1, Y_train_1, Y_test_1 = train_test_split(X , Y, test_size = 0.2, random_state = 42)
print("Training set 1 split input- ", X_train_1.shape)
print("Testing set 1 split input- ", X_test_1.shape)

# Splitting the data - 50:50 ratio
X_train_2, X_test_2, Y_train_2, Y_test_2 = train_test_split(X , Y, test_size = 0.5, random_state = 42)
print("Training set 2 split input- ", X_train_2.shape)
print("Testing set 2 split input- ", X_test_2.shape)

```

Training set 1 split input-  (120, 4)
Testing set 1 split input-  (30, 4)

Training set 2 split input-  (75, 4)
Testing split 2 input-  (75, 4)

### Step 5 : Build the Decision tree model

We will now build the decision tree model on the train data set. Later we would be testing the model on the test data set. As of now, we have kept the train data set at 80% which is ideally not a good proportion. Decision trees have a drawback of over fitting on the data. Ideally in situations like this we should feed a relatively lower proportion of data as training data. But, the objective over here is to simulate and check the results.

```    # Defining the decision tree algorithm
decisiontree_1 = DecisionTreeClassifier(random_state=0)
# Training the DT Algorithm on first train set
decisiontree_1.fit(X_train_1, Y_train_1)

# Defining the decision tree algorithm
decisiontree_2 = DecisionTreeClassifier(random_state=0)
# Training the DT Algorithm on second train set
decisiontree_2.fit(X_train_2, Y_train_2)

```

### Step 6 : Use the models to predict the data in test environment

Now we will predict the accuracy of model on the respective test datasets.

```    # Predicting the values for the first data set
y_pred_decisiontree_1 = decisiontree_1.predict(X_test_1)
print("Classification report for first model- \n", classification_report(Y_test_1,y_pred_decisiontree_1))

# Predicting the values for the second data set
y_pred_decisiontree_2 = decisiontree_2.predict(X_test_2)
print("Classification report for second model- \n", classification_report(Y_test_2,y_pred_decisiontree_2))

``` Figure 3 : Classification report for decision tree model

## Conclusion and Summary

As we can see , the efficiency of a decision tree is higher than 90% even if we give it 50% training data. In real production environment, because of complexities and variation in the feature values, this may differ. Also one important aspect of a decision tree is its tendency to overlearn from the data. As a precaution, its always good to feed it with substantially reduced instances of learning data sets. You can use grid search to optimize your ML models, have a look at this implementation for reference on how to use grid search in Machine Learning using python GridSearchCV. 