Heart Disease Prediction Using Support Vector Machine (SVM)

Using Heart Disease Cleveland Dataset from UCI

UTSAV VORA
5 min readFeb 25, 2021

Introduction

Heart Diseases, also called Cardiovascular diseases, are the leading cause
of death in all areas of the world except Africa. According to WHO, these diseases are a group of disorders found in heart and blood vessels that includes coronary heart disease, cerebrovascular disease, rheumatic heart disease, etc. Together these diseases resulted in around 17.9 million deaths(31% of all global deaths) since 2016, up from 12.3 million deaths (25.8% of all global deaths) since 1990.

In recent years, we have seen that Machine Learning has grown as an emerging technology in the world. It has its implementation almost everywhere such as Image Recognition, Speech Recognition, Traffic Recognition, Self-driving cars, and many more.

“ Machine intelligence is the last invention that humanity will ever need to make ” — Nick Bostrom

In this article, I will be applying the Support Vector Machine, one of the most popular machine learning techniques to predict whether a person is suffering from Heart Disease or not using the Cleveland Dataset available here.

This project is entirely done using Google colab and Python programming language. The Github repository is available here.

Exploratory Data Analysis (EDA)

The Heart Disease Cleveland Dataset is one of the most famous datasets, containing about 76 attributes, but among them, 14 are highly used by most of the published experiments. The 14 attributes are:

  1. age: in years
  2. sex: gender(1 = male; 0 = female)
  3. cp: chest pain type
    — Value 0: typical angina
    — Value 1: atypical angina
    — Value 2: non-anginal pain
    — Value 3: asymptomatic
  4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
  5. chol: serum cholesterol in mg/dl
  6. fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
  7. restecg: resting electrocardiographic results
    — Value 0: normal
    — Value 1: having ST-T wave abnormality
    — Value 2: showing probable or definite left ventricular hypertrophy
  8. thalach: maximum heart rate achieved
  9. exang: exercise-induced angina (1 = yes; 0 = no)
  10. oldpeak: ST depression induced by exercise relative to rest
  11. slope: the slope of the peak exercise ST segment
    — Value 0: upsloping
    — Value 1: flat
    — Value 2: downsloping
  12. ca: number of major vessels (0–3) colored by fluoroscopy
  13. thal: 0 = normal; 1 = fixed defect; 2 = reversible defect
    and the label
  14. target: (1 = disease; 0 = no disease)

Let’s look at the values for each attribute using the read_csv function of the pandas library,

Before proceeding with data analysis, we have to perform data preprocessing. At first, we will convert all the non-numeric data of the Target attribute into numeric one using the get_ dummies function of the pandas library.

Then, to check for null values(if any) in the dataset, we will use the isnull function of pandas library.

Now, let us observe that at what age the diseases are highly discovered. Here target = 1 indicates person having heart disease and target = 0 indicates person having no heart disease.

It can be seen that most of the person having age in between 55 and 65 are suffering from Heart diseases. Well this is obvious since most of the person in this range of age usually suffer from problems such as High Cholesterol, Blood Pressure, Diabetes, etc, which are among the prominent factors of Heart diseases.

In case of sex, generally Heart diseases are found more in male than female. There are several reasons that may lead to this such as angry or hostile nature of men, a low amount of testosterone in men, etc.

Training and Testing Sets

Now let us divide the dataset into training and testing set. In this project, I have divided the dataset into 70:30 ratio, that is, training set containing 70% of the data and testing set containing 30% of the data.

At first we divide the attributes of dataset into X and y, where y contains the target attribute on the basis of which the data has to be partitioned and X contains those attributes which defines the characteristics of Heart Disease. Well one may take all the attributes or some selected ones, but in this project, I have used all the 13 attributes as X and the target attribute as y.

Finally, in order to split the dataset, we use train_test_split function of Sklearn Model Selection.

Support Vector Machine

The Support Vector Machine (SVM) algorithm, a popular supervised learning tool, is one of the most robust prediction methods and can assist in both classification and regression problems.

Now as we have divided the dataset, let’s apply SVM and generate our prediction model.

At first, we import SVM Classifier available in sklearn and then fit the model. But in order to get better results, we must tune the parameters of SVM such as C, gamma, etc.

Here, we have tuned C and gamma only and stored them into param_grid. After this, we apply Grid Search and fit the model again.

Results

Below is the confusion matrix and classification report after and before applying parameter tuning.

Before Parameter Tuning
After Parameter Tuning

It can be inferred from the above results that SVM method with parameter tuning performs way better and gets an accuracy of 82%.

Conclusion

Heart Diseases are among the major problems in today’s society, hence it is important to get cured as early as possible.

Manually understanding may not be enough sometimes and therefore machine learning techniques can be highly useful.

--

--

UTSAV VORA

I am an aspiring computer science engineer with keen interest to acquire knowledge on emerging technologies.