Table of Content

Before creating survival analysis model we have to understand what is survival analysis and how it can help us.  So, survival analysis is a branch of statistics which helps in modeling the time that a particular event may take to occur. With the help of survival analysis we can give the answers of many questions, like when the machine system failure will take place, survival time of patients affected by disease and how much time it will take to cure the disease. It is also helpful in insurance sector as it can predict the survival of policy holder or actuaries fix an appropriate insurance premium in underwriting.

## Terms Used in Survival Analysis

Events :  It is the activity which is currently going on or going to happen like death of a person from disease, time taken to cure from disease, time taken to cure by vaccine, failure of machines, etc.

Time :  It is a time taken from beginning of the observation to the end of it or we can say till the time when event is going to happen.

Censoring or Censored Observation : If the subject on which we are doing survival analysis doesn’t get affected by the defined event is known as censored. A censored subject may or may not have any event after the end of observation time.

There are three types of Censoring:

Right Censored : If we are not sure what happened to subject after certain point in time it is known as Right Censored. Its main reason is that the actual event time is greater than the censored time.

Ex: A patient who has not been cured at the time of observation

Left Censored : If we are not sure what happened to subject before certain point of time is known as Left Censored. Its main reason is that the actual event time is less than the censored time.

Interval Censored : If we are sure that event has occurred in a time interval but don’t know when it is actually occurred is known as Interval Censored.

Survival Function : It is the probability function that depends upon the time “t” of the study. The survival function gives the probability that the subject survived more than time “t”.

## Types of Survival Analysis

The process of survival analysis can be performed through various techniques like:

1.  Kaplan-Meier Estimator
2.  Cox Proportional Hazards Regression Analysis
3. Survivors and Hazards function rate
4. Survival Tree
5. Survival Random Forest
6. Parametric survival analytic model

## Kaplan-Meier Estimator

It is a non-parametric statistic used to estimate person surviving probability (survival function) from lifetime data. For example, survival time of patient after the diagnosis or from the time when the treatment starts.

It is one of the most used methods of survival analysis. The estimate may be useful to examine the probability of death or the recovery rate.

Three assumptions must be made before using Kaplan-Meier estimator:

1. Subjects that are censored have the same survival prospects as those who continue to be followed.
2. Survival probability is the same for all the subjects.
3. The event of interest happens at the specified time. The estimated survival time can be more accurately measured if the examination happens frequently.

The visual representation of this function is known as Kaplan-Meier Curve.

Mathematically Kaplan-Meier estimator represented as below: ni = number of subjects at risk before time t .

d= number of the desired event  at time t .

Parametric survival models and the Cox proportional hazards model may be useful to estimate covariate-adjusted survival rather than Kaplan-Meier estimator.

## Cox Proportional Hazards Regression analysis

It is basically a regression model commonly used in healthcare to investigate the relation between survival time of patients with other variables.

Where Kaplan-Meier estimator is used for univariate analysis, Cox proportional hazards is used for multivariate analysis.

It is expressed by the hazard function denoted by h(t) and it can be interpreted as the risk of dying at time t.

Mathematically it will be represented as: where,

t = survival time,

h(t) = hazard function determined by a set of n covariates,

h0 = baseline hazard.

The exponential expression called as Hazard Ratio(HR), if its value is greater than 1 it will indicate a covariate that is positively correlated with event probability which mean it is negatively correlated with survival length.

If,

HR > 1: Increase in Hazards

HR = 1: No effect

HR < 1: Reduction in Hazards

Let’s see how to build the survival analysis model using Haberman’s Survival dataset.

The dataset contains a set of 306 patients whose survival time is given in years from 1958 to 1970. It contains 4 features:

Age of patient at time of operation as Age.

Patient’s year of operation as Year_of_Operation.

No. of positive axillary nodes detected as Axillary_nodes.

Survival status :

1 = Patient survive 5 years or longer.

2 = Patient died within 5 years.

## Implementation of Survival Model

First we have to install lifelines library which contains survival analysis models.

```    # Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing Dataset

Patient_Age  Year_of_Operation  Axillary_nodes  Survival Status
0           30                 64               1                1
1           30                 62               3                1
2           30                 65               0                1
3           31                 59               2                1
4           31                 65               4                1

data.info()

RangeIndex: 306 entries, 0 to 305
Data columns (total 4 columns):
#   Column             Non-Null Count  Dtype
---  ------             --------------  -----
0   Patient_Age        306 non-null    int64
1   Year_of_Operation  306 non-null    int64
2   Axillary_nodes     306 non-null    int64
3   Survival Status    306 non-null    int64
dtypes: int64(4)
memory usage: 9.7 KB

data.describe()

Patient_Age  Year_of_Operation  Axillary_nodes  Survival Status
count   306.000000         306.000000      306.000000       306.000000
mean     52.457516          62.852941        4.026144         1.264706
std      10.803452           3.249405        7.189654         0.441899
min      30.000000          58.000000        0.000000         1.000000
25P      44.000000          60.000000        0.000000         1.000000
50P      52.000000          63.000000        1.000000         1.000000
75P      60.750000          65.750000        4.000000         2.000000
max      83.000000          69.000000       52.000000         2.000000

```
```    # Lets check the distribution of the target class in the data
data['Survival Status'].value_counts()

1    225
2     81
Name: Survival Status, dtype: int64

```

We will be needing the lifelines package. If you are using Jupyter Notebook then you need to

```    !pip install lifelines

```

else if you are using anaconda distribution, then open the anaconda prompt and give the following command in your terminal

```    pip install lifelines --user

# in both the cases you would be seeing the system executing a bunch of downloads and check to build your environment for survival analytics. Something like this

(base) C:\Users\Mohan Rai>pip install lifelines --user
Collecting lifelines
|????????????????????????????????| 348 kB 3.3 MB/s
Requirement already satisfied: matplotlib>=3.0 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from lifelines) (3.3.2)
Collecting formulaic<0.3,>=0.2.2
|????????????????????????????????| 55 kB 584 kB/s
Requirement already satisfied: numpy>=1.14.0 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from lifelines) (1.19.2)
Requirement already satisfied: pandas>=0.23.0 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from lifelines) (1.1.3)
Requirement already satisfied: scipy>=1.2.0 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from lifelines) (1.5.2)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (1.3.0)
Requirement already satisfied: cycler>=0.10 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (0.10.0)
Requirement already satisfied: python-dateutil>=2.1 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (2.8.1)
Requirement already satisfied: pillow>=6.2.0 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (8.0.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (2.4.7)
Requirement already satisfied: certifi>=2020.06.20 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from matplotlib>=3.0->lifelines) (2020.6.20)
Collecting interface-meta>=1.2
Collecting astor
Requirement already satisfied: wrapt in c:\users\Mohan Rai\anaconda3\lib\site-packages (from formulaic<0.3,>=0.2.2->lifelines) (1.12.1)
Requirement already satisfied: pytz>=2017.2 in c:\users\Mohan Rai\anaconda3\lib\site-packages (from pandas>=0.23.0->lifelines) (2020.1)
Requirement already satisfied: six in c:\users\Mohan Rai\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.0->lifelines) (1.15.0)
Building wheel for autograd (setup.py) ... done
Stored in directory: c:\users\Mohan Rai\appdata\local\pip\cache\wheels\85\f5\d2\3ef47d3a836b17620bf41647222825b065245862d12aa62885
Building wheel for autograd-gamma (setup.py) ... done
Stored in directory: c:\users\Mohan Rai\appdata\local\pip\cache\wheels\16\a2\b6\582cfdfbeeccd469504a01af3bb952fd9e7eccba40995eafea

(base) C:\Users\Mohan Rai>

```
```    # Lets plot Kaplan-Meier Curve
from lifelines import KaplanMeierFitter
durations = data['Patient_Age']
event_observed = data['Survival Status']
kmf = KaplanMeierFitter()
kmf.fit(durations, event_observed=event_observed)
kmf.plot()

``` Figure 1 : Plot of Kaplan Meier Curve

In above graph, y-axis represents the probability of the subject still having not experienced the event after time t, where y = 1 means the patient is alive and y = 0 means patient didn’t survived and x-axis represents the time t.

```    # creating two cohorts
groups = data['Axillary_nodes']
i1 = (groups >= 1)
i2 = (groups < 1)

kmf_2 = KaplanMeierFitter()
## fit the model for 1st cohort
kmf_2.fit(durations[i1], event_observed[i1], label='at least one positive axillary detected')
a = kmf_2.plot()

## fit the model for 2nd cohort
kmf_2.fit(durations[i2], event_observed[i2], label='no positive axillary nodes detected')
kmf_2.plot(ax=a)

``` Figure 2 : Plot of Fitting Kaplan on two cohorts

From the above curve, we can say that approximately before 45 years both curve are overlapped which means a patient has high probability to survive if he is less than 45 years irrespective of surgery and for greater than 45 years, those patients who undergone through at least one surgery are more likely to not die sooner than the others who have not done any surgery yet.

```    kmf_2.event_table

removed  observed  censored  entrance  at_risk
event_at
0.0             0         0         0       136      136
30.0            1         1         0         0      136
33.0            1         1         0         0      135
34.0            2         2         0         0      134
35.0            1         1         0         0      132
..             ..        ..        ..        ..       ..
..             ..        ..        ..        ..       ..
..             ..        ..        ..        ..       ..
..             ..        ..        ..        ..       ..
70.0            4         4         0         0       11
72.0            3         3         0         0        7
73.0            2         2         0         0        4
74.0            1         1         0         0        2
76.0            1         1         0         0        1

```
```    kmf.predict(45)

0.7091503267973858

```
```    # Creating Cox Proportional Hazards Model
from lifelines import CoxPHFitter
cph = CoxPHFitter()

# Fit the data to train the model
cph.fit(data, 'Patient_Age', event_col='Survival Status')

# Have a look at the significance of the features
cph.print_summary()

model       = lifelines.CoxPHFitter
duration col = 'Patient_Age'
event col = 'Survival Status'
baseline estimation = breslow
number of observations = 306
number of events observed = 306
partial log-likelihood = -1446.95
time fit was run = 2021-06-15 12:01:43 UTC

---
coef  exp(coef)   se(coef)   coef lower 95%   coef upper 95%  exp(coef) lower 95%  exp(coef) upper 95%
covariate
Year_of_Operation  -0.02       0.98       0.02            -0.06             0.01                 0.94                 1.01
Axillary_nodes      0.01       1.01       0.01            -0.00             0.03                 1.00                 1.03

z    p   -log2(p)
covariate
Year_of_Operation -1.39 0.16       2.61
Axillary_nodes     1.70 0.09       3.47
---
Concordance = 0.53
Partial AIC = 2897.90
log-likelihood ratio test = 4.50 on 2 df
-log2(p) of ll-ratio test = 3.25

```
```    # Have a look at the significance of the features
cph.plot()

``` Figure 3 : Plot of Covariate Significance for Predicting Survival Risk

Above summary shows the significance of the covariates in predicting the survival risk. Both features play a significant role in predicting the survival. The large CI indicates that more data would be needed.

Now, let’s see the probability of survival at patient level.

```    patient = [50,100,280]
pred = data.iloc[patient,1:3]

```

As we have seen that only year of operation and axillary nodes are important for model so we have selected only these feature for prediction of survival probability. Let’s predict and visualize the probability graph.

```    cph.predict_survival_function(pred).plot()

``` Figure 4 : Plot of Survival Prediction Function

Figure 4, shows that patient number 280 has higher chance of survival than other two patients at any time t.

## Conclusion

In this article we have learnt about the theory of Survival Analysis Model and implemented it using python. We have seen survival probability using Kaplan-Meier estimator and Cox Proportional Hazards Regression Model and at last we have done prediction for random patients and seen their probability of survival at any time t. Check this interesting application of Human Pose Estimation Using OpenCV in Python. 