Exploratory Data AnalysIs (EDA) of the smokIng_health_data_fInal dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

The necessary libraries have been imported.

df = pd.read_csv('smoking_health_data_final.csv')

Our dataset has been assigned to the variable df.

df.head()

LABEL	age	sex	current_smoker	heart_rate	blood_pressure	cigs_per_day	chol
0	54	male	yes	95	110/72	NaN	219.0
1	45	male	yes	64	121/72	NaN	248.0
2	58	male	yes	81	127.5/76	NaN	235.0
3	42	male	yes	90	122.5/80	NaN	225.0
4	42	male	yes	62	119/80	NaN	226.0

df.isna().sum()

LABEL	0
age	0
sex	0
current_smoker	0
heart_rate	0
blood_pressure	0
cigs_per_day	14
chol	7

dtype: int64

In the [cigs_per_day] and [chol] columns, out of 3900 values each, we have only 14 and 7 missing values, respectively.

df = df.dropna()

df.count()

LABEL	0
age	3879
sex	3879
current_smoker	3879
heart_rate	3879
blood_pressure	3879
cigs_per_day	3879
chol	3879

dtype: int64

The missing values have been removed from the dataset (a total of 21 values).

df.describe()

LABEL	age	heart_rate	cigs_per_day	chol
count	3879.000000	3879.000000	3879.000000	3879.000000
mean	49.543181	75.699149	9.163702	236.629286
std	8.565955	12.023013	12.035201	44.413846
min	32.000000	44.000000	0.000000	113.000000
25%	42.000000	68.000000	0.000000	206.000000
50%	49.000000	75.000000	0.000000	234.000000
75%	56.000000	82.000000	20.000000	263.000000
max	70.000000	143.000000	70.000000	696.000000

The average age in the dataset is 50. The youngest person is at least 32 years old, and the oldest is 70 years old.

sns.histplot(data=df, x="age")
sns.set_style("whitegrid")
plt.show()

df["age"].describe()

LABEL	age
count	3879.000000
mean	49.543181
std	8.565955
min	32.000000
25%	42.000000
50%	49.000000
75%	56.000000
max	70.000000

dtype: float64

sns.boxplot(y=df["age"])
plt.title("Age Boxplot")
plt.show()

There are no outlier values in the age column of our dataset. The age data appears to be normally distributed.

size = df['sex'].value_counts()
labels = size.index
plt.figure(figsize=(6,6))
plt.pie(size, labels=labels, autopct='%1.1f%%', startangle=90) plt.title("Gender Distribution")
plt.show()

We can see that 53.6% of our dataset is female, and 46.4% is male.

fig = plt.figure(figsize=(12,12))
plt.subplot(2,3,1)
sns.boxplot(y=df["heart_rate"]) plt.title("Heart Rate Boxplot") plt.subplot(2,3,2)
sns.boxplot(y=df["cigs_per_day"]) plt.title("Cigs Per Day Boxplot") plt.subplot(2,3,3)
sns.boxplot(y=df["chol"]) plt.title("Cholesterol Boxplot") plt.subplots_adjust(hspace=0.5, wspace=0.4)
plt.show()

heart_rate

The median is around 75.
Most values are clustered between 70–85.
There are some low outliers (around 40–50 bpm) and some high outliers (120–140 bpm).
The majority of values fall within normal limits.

cigs_per_day (cigarettes per day)

The median is close to 0 (very low in the middle of the box).
Most data points are in the 0–20 cigarettes/day range.
There are a few extreme outliers, e.g., people smoking 50, 60, or even 70 cigarettes/day.
A few individuals are heavy smokers.

chol (cholesterol)

The median is around 240.
Most values are in the 200–270 mg/dL range.
There are very low outliers (around 100) and very high outliers (600–700 mg/dL).
Normal limits (generally <200 mg/dL) are exceeded; many individuals have high cholesterol.

sns.histplot(data=df, x="cigs_per_day", hue="sex") plt.title("Relationship between smoking intensity and gender")
plt.show()

We can see that the majority of individuals who smoke more than 20 cigarettes per day are male.

non_smoker = df[df["cigs_per_day"] == 0]

We are selecting individuals who do not smoke at all.

smoker = df[df["cigs_per_day"] > 20]

We are selecting individuals who smoke more than 20 cigarettes per day.

fig = plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.histplot(data=non_smoker, x="chol", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Non-Smoker")
plt.subplot(2,2,2)
sns.histplot(data=smoker, x="chol", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Smokes more than 20 cigarettes a day")
plt.subplot(2,2,3)
sns.histplot(data=non_smoker, x="heart_rate", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Non-Smoker")
plt.subplot(2,2,4)
sns.histplot(data=smoker, x="heart_rate", hue="sex", palette={"female": "purple", "male": "cyan"})
plt.title("Smokes more than 20 cigarettes a day") plt.subplots_adjust(hspace=0.5)
plt.show()

There is no noticeable difference in heart rate and cholesterol between individuals who smoke 20 cigarettes per day and those who do not smoke at all. This may be due to the small number of data points in the dataset.

fig = plt.figure(figsize=(12,10))
plt.subplot(2,2,1)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 30) & (df["age"] < 40)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 30 and 40 years old")
plt.subplot(2,2,2)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 40) & (df["age"] < 50)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 40 and 50 years old")
plt.subplot(2,2,3)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 50) & (df["age"] < 60)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 50 and 60 years old")
plt.subplot(2,2,4)
sns.barplot(x="age", y="cigs_per_day", hue="sex", data=df[(df["age"] > 60) & (df["age"] < 70)], errorbar=None, palette={"female": "purple", "male": "cyan"})
plt.title("Between 60 and 70 years old")
plt.subplots_adjust(hspace=0.5)
plt.show()

These bar plots allow us to observe cigarette consumption across age groups in comparison with gender. In particular, we conclude that the majority of smokers between the ages of 50 and 70 are male.

fig = plt.figure(figsize=(6,4))
sns.set_style("whitegrid")
sns.lineplot(x="age", y="cigs_per_day", data=df[(df["cigs_per_day"] > 0) & (df["age"] > 50)], errorbar=None)
plt.title("The relationship between age and cigarette smoking") plt.show()

We can see that overall cigarette consumption decreases after the age of 50.

sns.heatmap(df.corr(numeric_only=True), annot=True) plt.title("Correlation Heatmap")
plt.show()

After examining the heatmap, we found that:

There is a positive correlation between age and cholesterol, meaning that as age increases, cholesterol also tends to increase.
There is a negative correlation between age and the number of cigarettes smoked per day, meaning that as age increases, the amount of cigarettes consumed decreases.

https://www.kaggle.com/datasets/jaceprater/smokers-health-data/

Exploratory Data AnalysIs (EDA) by

ali irfan doğan
github
linkedin

Exploratory Data AnalysIs (EDA) of the smokIng_health_data_fInal dataset

Exploratory Data AnalysIs (EDA) by

Bunu paylaş:

Yorum bırakın Cevabı iptal et