Exploratory Data AnalysIs (EDA) of the smokIng_health_data_fInal dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The necessary libraries have been imported.

df = pd.read_csv('smoking_health_data_final.csv')

Our dataset has been assigned to the variable df.

df.head()
LABELagesexcurrent_smokerheart_rateblood_pressurecigs_per_daychol
054maleyes95110/72NaN219.0
145maleyes64121/72NaN248.0
258maleyes81127.5/76NaN235.0
342maleyes90122.5/80NaN225.0
442maleyes62119/80NaN226.0
df.isna().sum()
LABEL0
age0
sex0
current_smoker0
heart_rate0
blood_pressure0
cigs_per_day14
chol7

dtype: int64

In the [cigs_per_day] and [chol] columns, out of 3900 values each, we have only 14 and 7 missing values, respectively.

df = df.dropna()
df.count()
LABEL0
age3879
sex3879
current_smoker3879
heart_rate3879
blood_pressure3879
cigs_per_day3879
chol3879

dtype: int64

The missing values have been removed from the dataset (a total of 21 values).

df.describe()
LABELageheart_ratecigs_per_daychol
count3879.0000003879.0000003879.0000003879.000000
mean49.54318175.6991499.163702236.629286
std8.56595512.02301312.03520144.413846
min32.00000044.0000000.000000113.000000
25%42.00000068.0000000.000000206.000000
50%49.00000075.0000000.000000234.000000
75%56.00000082.00000020.000000263.000000
max70.000000143.00000070.000000696.000000

The average age in the dataset is 50. The youngest person is at least 32 years old, and the oldest is 70 years old.

sns.histplot(data=df, x="age")
sns.set_style("whitegrid")
plt.show()
df["age"].describe()
LABELage
count3879.000000
mean49.543181
std8.565955
min32.000000
25%42.000000
50%49.000000
75%56.000000
max70.000000

dtype: float64

sns.boxplot(y=df["age"])
plt.title("Age Boxplot")
plt.show()

There are no outlier values in the age column of our dataset. The age data appears to be normally distributed.

size = df['sex'].value_counts()
labels = size.index
plt.figure(figsize=(6,6))
plt.pie(size, labels=labels, autopct='%1.1f%%', startangle=90) plt.title("Gender Distribution")
plt.show()

We can see that 53.6% of our dataset is female, and 46.4% is male.

fig = plt.figure(figsize=(12,12))
plt.subplot(2,3,1)
sns.boxplot(y=df["heart_rate"]) plt.title("Heart Rate Boxplot") plt.subplot(2,3,2)
sns.boxplot(y=df["cigs_per_day"]) plt.title("Cigs Per Day Boxplot") plt.subplot(2,3,3)
sns.boxplot(y=df["chol"]) plt.title("Cholesterol Boxplot") plt.subplots_adjust(hspace=0.5, wspace=0.4)
plt.show()

heart_rate

  • The median is around 75.
  • Most values are clustered between 70–85.
  • There are some low outliers (around 40–50 bpm) and some high outliers (120–140 bpm).
  • The majority of values fall within normal limits.

cigs_per_day (cigarettes per day)

  • The median is close to 0 (very low in the middle of the box).
  • Most data points are in the 0–20 cigarettes/day range.
  • There are a few extreme outliers, e.g., people smoking 50, 60, or even 70 cigarettes/day.
  • A few individuals are heavy smokers.

chol (cholesterol)

  • The median is around 240.
  • Most values are in the 200–270 mg/dL range.
  • There are very low outliers (around 100) and very high outliers (600–700 mg/dL).
  • Normal limits (generally <200 mg/dL) are exceeded; many individuals have high cholesterol.
sns.histplot(data=df, x="cigs_per_day", hue="sex") plt.title("Relationship between smoking intensity and gender")
plt.show()

We can see that the majority of individuals who smoke more than 20 cigarettes per day are male.

non_smoker = df[df["cigs_per_day"] == 0]

We are selecting individuals who do not smoke at all.

smoker = df[df["cigs_per_day"] > 20]

We are selecting individuals who smoke more than 20 cigarettes per day.

fig = plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.histplot(data=non_smoker, x="chol", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Non-Smoker")
plt.subplot(2,2,2)
sns.histplot(data=smoker, x="chol", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Smokes more than 20 cigarettes a day")
plt.subplot(2,2,3)
sns.histplot(data=non_smoker, x="heart_rate", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Non-Smoker")
plt.subplot(2,2,4)
sns.histplot(data=smoker, x="heart_rate", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Smokes more than 20 cigarettes a day") plt.subplots_adjust(hspace=0.5)
plt.show()

There is no noticeable difference in heart rate and cholesterol between individuals who smoke 20 cigarettes per day and those who do not smoke at all. This may be due to the small number of data points in the dataset.

fig = plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 30) & (df["age"] < 40)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 30 and 40 years old")
plt.subplot(2,2,2)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 40) & (df["age"] < 50)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 40 and 50 years old")
plt.subplot(2,2,3)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 50) & (df["age"] < 60)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 50 and 60 years old")
plt.subplot(2,2,4)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 60) & (df["age"] < 70)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 60 and 70 years old")
plt.subplots_adjust(hspace=0.5)
plt.show()

These bar plots allow us to observe cigarette consumption across age groups in comparison with gender. In particular, we conclude that the majority of smokers between the ages of 50 and 70 are male.

fig = plt.figure(figsize=(6,4))
sns.set_style("whitegrid")
sns.lineplot(x="age", y="cigs_per_day", data=df[(df["cigs_per_day"] > 0) & (df["age"] > 50)], errorbar=None)
plt.title("The relationship between age and cigarette smoking") plt.show()

We can see that overall cigarette consumption decreases after the age of 50.

sns.heatmap(df.corr(numeric_only=True), annot=True) plt.title("Correlation Heatmap")
plt.show()

After examining the heatmap, we found that:

  • There is a positive correlation between age and cholesterol, meaning that as age increases, cholesterol also tends to increase.
  • There is a negative correlation between age and the number of cigarettes smoked per day, meaning that as age increases, the amount of cigarettes consumed decreases.
https://www.kaggle.com/datasets/jaceprater/smokers-health-data/
Exploratory Data AnalysIs (EDA) by

ali irfan doğan
github
linkedin

Yorum bırakın