17 Thu
TIL
Last updated
Was this helpful?
TIL
Last updated
Was this helpful?
ํ์์ ๋ฐ์ดํฐ ๋ถ์์ ํตํด ๋ฐ์ดํฐ๋ฅผ ํต๋ฌํด๋ด ์๋ค. with
๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ค๋น
๋ถ์์ ๋ชฉ์ ๊ณผ ๋ณ์ ํ์ธ
๋ฐ์ดํฐ ์ ์ฒด์ ์ผ๋ก ์ดํด๋ณด๊ธฐ
๋ฐ์ดํฐ์ ๊ฐ๋ณ ์์ฑ ํ์ ํ๊ธฐ
๋ฐฉ๋ฒ๋ก ์ ์ง์คํ๋ค๋ณด๋ฉด ๋ฐ์ดํฐ์ ๋ณธ์ง์ ์๋ฏธ๋ฅผ ํผ์ํ๊ฑฐ๋ ๋ง๊ฐํ ์ ์์ EDA๋ ๋ฐ์ดํฐ ๊ทธ ์์ฒด๋ง์ผ๋ก ์ธ์ฌ์ดํธ๋ฅผ ์ป์ด๋ด๋ ์ ๊ทผ๋ฒ!
Titanic : Machine Learning from Disaster
EDA ๊ด์ ์์ ์ข์ ๋ฐ์ดํฐ
I. ๋ถ์์ ๋ชฉ์ ํ์ธ
์ด์๋จ์ ์ฌ๋๋ค์ ์ด๋ค ํน์ง์ ๊ฐ์ง๊ณ ์์์๊น?
II. ๋ณ์ ํ์ธ
๋ณ์๋ ์ด 10๊ฐ
Variable : col name
Definition : col information
Key : encoding
survival : 1 ์์กด, 0 ์ฌ๋ง pclass : ticket class sex : sex age : age in years and fractional(๋ถ์) less than 1 and estimated is .5 sibsp : sibling or spouses aboard the titanic parch : parents or children aboard the titanic ticket : ticket number fare : fare cabin : cabin number embarked : port of Embarkation(์น์ ์ง)
C : Cherbourg, Q : Queenstown, S = Southampton
ํ์ดํ๋ ํธ์์ ์์กดํ ์์กด์๋ค์ ์ด๋ค ์ฌ๋๋ค์ผ๊น?
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
๊ฒฐ์ธก์น๋ฅผ ๋ฉ๊ฟ์ผ ํ ์๋ ์๊ณ ์ ๊ฑฐํ๊ฑฐ๋ ํน์ ๋ฐฉ๋ฒ์ผ๋ก ์ฒ๋ฆฌํด์ผ ํ ์๋ ์์
๋ฐ์ดํฐ๊ฐ์ ์๊ด๊ด๊ณ๊ฐ ์๋์ง
NA(๊ฒฐ์ธก์น)๊ฐ ์๋์ง
DATA SIZE๊ฐ ์ ์ ํ์ง(์ผ๋ฐํ ๊ฐ๋ฅ?)
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
714.000000
891.000000
891.000000
891.000000
mean
446.000000
0.383838
2.308642
29.699118
0.523008
0.381594
32.204208
std
257.353842
0.486592
0.836071
14.526497
1.102743
0.806057
49.693429
min
1.000000
0.000000
1.000000
0.420000
0.000000
0.000000
0.000000
25%
223.500000
0.000000
2.000000
20.125000
0.000000
0.000000
7.910400
50%
446.000000
0.000000
3.000000
28.000000
0.000000
0.000000
14.454200
75%
668.500000
1.000000
3.000000
38.000000
1.000000
0.000000
31.000000
max
891.000000
1.000000
3.000000
80.000000
8.000000
6.000000
512.329200
passenger id : ํฐ ์๋ฏธ๊ฐ ์์๊น? survived : mean์ ๋ณด๋ ์๊ฐ๋ณด๋ค ์ฃฝ์ ์ฌ๋์ด ๋ง๋ค pclass : ํฐ ์๋ฏธ๊ฐ ์์๊น? age : min = 0.42์ธ ์๊ธฐ๊ฐ ํ์นํ๋ค sibsp : max = 8์ธ ๋๊ฐ์กฑ์ด ํ์นํ๋ค parch : max = 6์ธ ๋๊ฐ์กฑ์ด ํ์นํ๋ค fare : min = 0, max = 512 => mean = 32์ธ๋ฐ max = 512์ด๋ฏ๋ก outlier ์ผ ๊ฐ๋ฅ์ฑ์ด ํฌ๋ค.(outlier : ๋ฐ์ดํฐ ๋ถํฌ๋์์ ๋ง์ด ๋ฒ์ด๋ ๋ฐ์ดํฐ)
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
PassengerId
1.000000
-0.005007
-0.035144
0.036847
-0.057527
-0.001652
0.012658
Survived
-0.005007
1.000000
-0.338481
-0.077221
-0.035322
0.081629
0.257307
Pclass
-0.035144
-0.338481
1.000000
-0.369226
0.083081
0.018443
-0.549500
Age
0.036847
-0.077221
-0.369226
1.000000
-0.308247
-0.189119
0.096067
SibSp
-0.057527
-0.035322
0.083081
-0.308247
1.000000
0.414838
0.159651
Parch
-0.001652
0.081629
0.018443
-0.189119
0.414838
1.000000
0.216225
Fare
0.012658
0.257307
-0.549500
0.096067
0.159651
0.216225
1.000000
main_diagonal์ ํญ์ 1 ๊ธ์ก๊ณผ ๋ฑ๊ธ์ ๋ฐ๋น๋ก. ๋ฑ๊ธ์ด ๋์์๋ก ์์กด๋ฅ ์ด ๋์ง ์์๊น?
โ Correlation is NOT Causation
์๊ด์ฑ : A up, B up, ... ์ธ๊ณผ์ฑ : A -> B
๊ฐ๊ฐ์ feature๊ฐ ๋ฌด์์ธ์ง
ํน์ ๋ฐ์ดํฐ๊ฐ ํน์ column์ ๋ฐ๋ผ ๋ฌ๋ผ์ง๋ ์๋ฏธ
์์ฑ์ด ์ ์ ํ๊ฒ ๋งค์นญ๋์ด ์๋๊ฐ (์์ฑ์ ๋ฐ๊ฟ์ค ํ์๊ฐ ์๋๊ฐ)
Survived
Pclass
1
216
2
184
3
491
Survived
Pclass
1
136
2
87
3
119
Survived
Pclass
1
0.629630
2
0.472826
3
0.242363
Remind : ๊ฒฐ์ธก์น ์กด์ฌ!