17 Thu
TIL
ํ๋ก๊ทธ๋๋จธ์ค AI ์ค์ฟจ 1๊ธฐ
3์ฃผ์ฐจ DAY 4
4. Exploratory Data Analysis
ํ์์ ๋ฐ์ดํฐ ๋ถ์์ ํตํด ๋ฐ์ดํฐ๋ฅผ ํต๋ฌํด๋ด ์๋ค. with Titanic Data
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค๋น
๋ถ์์ ๋ชฉ์ ๊ณผ ๋ณ์ ํ์ธ
๋ฐ์ดํฐ ์ ์ฒด์ ์ผ๋ก ์ดํด๋ณด๊ธฐ
๋ฐ์ดํฐ์ ๊ฐ๋ณ ์์ฑ ํ์ ํ๊ธฐ
๋ฐฉ๋ฒ๋ก ์ ์ง์คํ๋ค๋ณด๋ฉด ๋ฐ์ดํฐ์ ๋ณธ์ง์ ์๋ฏธ๋ฅผ ํผ์ํ๊ฑฐ๋ ๋ง๊ฐํ ์ ์์ EDA๋ ๋ฐ์ดํฐ ๊ทธ ์์ฒด๋ง์ผ๋ก ์ธ์ฌ์ดํธ๋ฅผ ์ป์ด๋ด๋ ์ ๊ทผ๋ฒ!
Titanic : Machine Learning from Disaster
EDA ๊ด์ ์์ ์ข์ ๋ฐ์ดํฐ
1. ๋ถ์์ ๋ชฉ์ ๊ณผ ๋ณ์ ํ์ธ
I. ๋ถ์์ ๋ชฉ์ ํ์ธ
์ด์๋จ์ ์ฌ๋๋ค์ ์ด๋ค ํน์ง์ ๊ฐ์ง๊ณ ์์์๊น?
II. ๋ณ์ ํ์ธ
๋ณ์๋ ์ด 10๊ฐ
Variable : col name
Definition : col information
Key : encoding
survival : 1 ์์กด, 0 ์ฌ๋ง pclass : ticket class sex : sex age : age in years and fractional(๋ถ์) less than 1 and estimated is .5 sibsp : sibling or spouses aboard the titanic parch : parents or children aboard the titanic ticket : ticket number fare : fare cabin : cabin number embarked : port of Embarkation(์น์ ์ง)
C : Cherbourg, Q : Queenstown, S = Southampton
0. ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค๋น
## ๋ผ์ด๋ธ๋ฌ๋ฆฌ ๋ถ๋ฌ์ค๊ธฐ
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
## ๋์ผ ๊ฒฝ๋ก์ "train.csv"๊ฐ ์๋ค๋ ๊ฐ์
## ๋ฐ์ดํฐ ๋ถ๋ฌ์ค๊ธฐ
titanic_df = pd.read_csv("./train.csv")
1. ๋ถ์์ ๋ชฉ์ ๊ณผ ๋ณ์ ํ์ธ
ํ์ดํ๋ ํธ์์ ์์กดํ ์์กด์๋ค์ ์ด๋ค ์ฌ๋๋ค์ผ๊น?
## ์์ 5๊ฐ ๋ฐ์ดํฐ ํ์ธํ๊ธฐ
titanic_df.head(5)
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
๊ฒฐ์ธก์น๋ฅผ ๋ฉ๊ฟ์ผ ํ ์๋ ์๊ณ ์ ๊ฑฐํ๊ฑฐ๋ ํน์ ๋ฐฉ๋ฒ์ผ๋ก ์ฒ๋ฆฌํด์ผ ํ ์๋ ์์
## ๊ฐ Column์ ๋ฐ์ดํฐ ํ์
ํ์ธํ๊ธฐ
titanic_df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
2. ๋ฐ์ดํฐ ์ ์ฒด์ ์ผ๋ก ์ดํด๋ณด๊ธฐ
๋ฐ์ดํฐ๊ฐ์ ์๊ด๊ด๊ณ๊ฐ ์๋์ง
NA(๊ฒฐ์ธก์น)๊ฐ ์๋์ง
DATA SIZE๊ฐ ์ ์ ํ์ง(์ผ๋ฐํ ๊ฐ๋ฅ?)
## ๋ฐ์ดํฐ ์ ์ฒด ์ ๋ณด๋ฅผ ์ป๋ ํจ์ : .describe()
titanic_df.describe() # ์์นํ ๋ฐ์ดํฐ์ ๋ํ ์์ฝ๋ง์ ์ ๊ณต(Cabin์ด๋ Embarkation ์ ๋ณด๋ ์๋ ๊ฒ์ ์ ์ ์์)
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
passenger id : ํฐ ์๋ฏธ๊ฐ ์์๊น? survived : mean์ ๋ณด๋ ์๊ฐ๋ณด๋ค ์ฃฝ์ ์ฌ๋์ด ๋ง๋ค pclass : ํฐ ์๋ฏธ๊ฐ ์์๊น? age : min = 0.42์ธ ์๊ธฐ๊ฐ ํ์นํ๋ค sibsp : max = 8์ธ ๋๊ฐ์กฑ์ด ํ์นํ๋ค parch : max = 6์ธ ๋๊ฐ์กฑ์ด ํ์นํ๋ค fare : min = 0, max = 512 => mean = 32์ธ๋ฐ max = 512์ด๋ฏ๋ก outlier ์ผ ๊ฐ๋ฅ์ฑ์ด ํฌ๋ค.(outlier : ๋ฐ์ดํฐ ๋ถํฌ๋์์ ๋ง์ด ๋ฒ์ด๋ ๋ฐ์ดํฐ)
## ์๊ด๊ณ์ ํ์ธ!
titanic_df.corr()
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
PassengerId
1.000000
-0.005007
-0.035144
0.036847
-0.057527
-0.001652
0.012658
Survived
-0.005007
1.000000
-0.338481
-0.077221
-0.035322
0.081629
0.257307
Pclass
-0.035144
-0.338481
1.000000
-0.369226
0.083081
0.018443
-0.549500
Age
0.036847
-0.077221
-0.369226
1.000000
-0.308247
-0.189119
0.096067
SibSp
-0.057527
-0.035322
0.083081
-0.308247
1.000000
0.414838
0.159651
Parch
-0.001652
0.081629
0.018443
-0.189119
0.414838
1.000000
0.216225
Fare
0.012658
0.257307
-0.549500
0.096067
0.159651
0.216225
1.000000
main_diagonal์ ํญ์ 1 ๊ธ์ก๊ณผ ๋ฑ๊ธ์ ๋ฐ๋น๋ก. ๋ฑ๊ธ์ด ๋์์๋ก ์์กด๋ฅ ์ด ๋์ง ์์๊น?
โ Correlation is NOT Causation
์๊ด์ฑ : A up, B up, ... ์ธ๊ณผ์ฑ : A -> B
## ๊ฒฐ์ธก์น ํ์ธ
## ๋น์ด์๋์ง๋ฅผ ํ์ธํ ์ ์๊ณ , ๋น์ด์๋ ๊ฒ์ ๋ํด ์๋ฏธ ๋ถ์ฌ ๊ฐ๋ฅ
titanic_df.isnull().sum()
# Age, Cabin ,Embarked ์์ ๊ฒฐ์ธก์น ํ์ธ!
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
3. ๋ฐ์ดํฐ์ ๊ฐ๋ณ ์์ฑ ํ์
ํ๊ธฐ
๊ฐ๊ฐ์ feature๊ฐ ๋ฌด์์ธ์ง
ํน์ ๋ฐ์ดํฐ๊ฐ ํน์ column์ ๋ฐ๋ผ ๋ฌ๋ผ์ง๋ ์๋ฏธ
์์ฑ์ด ์ ์ ํ๊ฒ ๋งค์นญ๋์ด ์๋๊ฐ (์์ฑ์ ๋ฐ๊ฟ์ค ํ์๊ฐ ์๋๊ฐ)
I. Survived Column
## ์์กด์, ์ฌ๋ง์ ๋ช
์๋?
# titanic_df['Survived'].sum()
# True ์ธ ๊ฐ์๋ง ์ถ๋ ฅ
# ์ ์ฒด ์ถ๋ ฅ
titanic_df['Survived'].value_counts()
0 549
1 342
Name: Survived, dtype: int64
## ์์กด์ ์์ ์ฌ๋ง์ ์๋ฅผ Barplot์ผ๋ก ๊ทธ๋ ค๋ณด๊ธฐ sns.coutnplot()
sns.countplot(x='Survived', data=titanic_df) # ์นดํ
๊ณ ๋ฆฌ ๋ณ๋ก ์นด์ดํธ ๋ ๋ชจ์ต์ ์ถ๋ ฅ
plt.show()

II. Pclass
# Pclass์ ๋ฐ๋ฅธ ์ธ์ ํ์
titanic_df[['Pclass', 'Survived']].groupby(['Pclass']).count()
Survived
Pclass
1
216
2
184
3
491
# ์์กด์ ์ธ์?
titanic_df[['Pclass', 'Survived']].groupby(['Pclass']).sum()
Survived
Pclass
1
136
2
87
3
119
# ์์กด ๋น์จ?
titanic_df[['Pclass', 'Survived']].groupby(['Pclass']).mean() # sum / count
Survived
Pclass
1
0.629630
2
0.472826
3
0.242363
# ํํธ๋งต ํ์ฉ
sns.heatmap(titanic_df[['Pclass', 'Survived']].groupby(['Pclass']).mean())
plt.plot()
[]

III.Sex
titanic_df.groupby(['Sex', 'Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
# sns.catplot
sns.catplot(x='Sex', col='Survived', kind='count', data=titanic_df)
plt.show()

IV. Age
Remind : ๊ฒฐ์ธก์น ์กด์ฌ!
titanic_df.describe()['Age']
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
## Survived 1, 0๊ณผ Age์ ๊ฒฝํฅ์ฑ
## figure (๋๋ฉด) -> axis (ํ) -> plot (๊ทธ๋ํ)
fig, ax = plt.subplots(1, 1, figsize=(10, 5)) # ๊ฐ๋ก 1๊ฐ, ์ธ๋ก 1๊ฐ, figsize
sns.kdeplot(x=titanic_df[titanic_df.Survived == 1]['Age'], ax=ax)
sns.kdeplot(x=titanic_df[titanic_df.Survived == 0]['Age'], ax=ax)
plt.legend(['Survived', 'Dead'])
plt.show()

Appendix I. Sex + Pclass vs Survived
sns.catplot(x='Pclass', y='Survived', hue='Sex', kind='point', data=titanic_df)
plt.show()

Apendix II. Age + Pclass
## Age graph with Pclass
titanic_df['Age'][titanic_df.Pclass == 1].plot(kind='kde')
titanic_df['Age'][titanic_df.Pclass == 2].plot(kind='kde')
titanic_df['Age'][titanic_df.Pclass == 3].plot(kind='kde')
plt.legend(['1st class', '2nd class', '3rd class'])
plt.show()

Last updated
Was this helpful?