6 Sun

ํ˜„์—… ์‹ค๋ฌด์ž์—๊ฒŒ ๋ฐฐ์šฐ๋Š” Kaggle ๋จธ์‹ ๋Ÿฌ๋‹ ์ž…๋ฌธ

๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree) ์†Œ๊ฐœ

๊ฒฐ์ • ํŠธ๋ฆฌ

  • ๋ฐ์ดํ„ฐ ๋งˆ์ด๋‹์—์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๋ก 

  • ๋ช‡๋ช‡ ์ž…๋ ฅ ๋ณ€์ˆ˜๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ชฉํ‘œ ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ์˜ˆ์ธก ํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ

  • ์žฅ์ 

    • ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋™์ž‘๊ณผ์ •์ด ์ง๊ด€์ 

    • ํ•™์Šต ์‹œ๊ฐ„์ด ๋น ๋ฆ„

    • ๊ฐœ๋ณ„ ํŠน์ง•๋“ค์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ํŒ๋‹จํ•˜๋ฏ€๋กœ ์ •๊ทœํ™”๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค

  • ๋‹จ์ 

    • ์˜ค๋ฒ„ํ”ผํŒ…์— ๋น ์ง€๊ธฐ ์‰ฝ๋‹ค

      • ํŠนํžˆ ํŠธ๋ฆฌ์˜ ๊นŠ์ด๊ฐ€ ๊นŠ์–ด์งˆ ์ˆ˜๋ก

  • ๊ตฌํ˜„

    • sklearn.tree.DecisionTreeClassifer

    • sklearn.tree.DecisionTreeRegressor

Titanic ์‚ฌ๊ณ  ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

  • 1912๋…„ ํƒ€์ดํƒ€๋‹‰ ์‚ฌ๊ณ  ๋‹น์‹œ์˜ ์Šน๊ฐ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ

  • Binary Classification

    • 1 : ์ƒ์กด

    • 0 : ์‚ฌ๋ง

  • ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ : 891

  • ํŠน์ง•

๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ(Categorical Column) & ์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ(Numerical Column)

๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ

  • ๊ฐ’์ด [1, 2, 3], ["๋‚ด๋ถ€", "์™ธ๋ถ€"] ์™€ ๊ฐ™์ด ํ•œ์ •๋˜๋Š” ๋ฐ์ดํ„ฐ

  • [sex, embarked, class, who, adult_male, deck, embark-town, alive, alone]

์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ

  • ๊ฐ’์ด 1, 2, 3, 5, ... ๋˜๋Š” 1.2, 4.51, 3.1415, .. ์™€ ๊ฐ™์ด ์ˆซ์ž ์ถ•์œผ๋กœ ๋ฌดํ•œํžˆ ์œ„์น˜ํ•  ์ˆ˜ ์ž‡๋Š” ๋ฐ์ดํ„ฐ

  • [age, sibsp, parch, fare]

Categorical Column ๋‹ค๋ฃจ๊ธฐ - LabelEncoder

LabelEncoder

  • ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ string ํ˜•ํƒœ์˜ ๊ฐ’์€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๋‹ค. ์ˆซ์žํ˜• ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•ด์ค˜์•ผ๋งŒ ํ•˜๋ฉฐ scikit-learn์—์„œ ์ œ๊ณตํ•˜๋Š” preprocessing.LabelEncoder ํด๋ž˜์Šค๋ฅผ ์ด์šฉํ•ด์„œ string ํ˜•ํƒœ์˜ ๊ฐ’์„ ์ˆซ์žํ˜• ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋‹ค.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

๊ฒฐ์ • ํŠธ๋ฆฌ(Decision Tree)๋ฅผ ์ด์šฉํ•ด์„œ ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด์ž ์˜ˆ์ธกํ•ด๋ณด๊ธฐ

์‚ฌ์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • DecisionTreeClassifier

์ถ”๊ฐ€์ ์ธ ์ ์šฉ๊ธฐ๋ฒ•

  • EDA, Exploratory Data Analysis

  • Data Cleansing, ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

df.info()

  • ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ๋ณ„ ํƒ€์ž…๊ณผ ๊ฐ’์ด ์žˆ๋Š” ํ–‰์˜ ๊ฐฏ์ˆ˜ ๋“ฑ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

  • seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์•ˆ์— ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ์…‹์ด ์ €์žฅ๋˜์–ด ์žˆ๋‹ค.

  • titanic_df = sns.load_dataset('titanic')

PART 1. EDA

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ๊ณผ, ์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. (์ถ”ํ›„ ๋ถ„์„ ์‹œ ๋ฐ˜๋ณต๋˜๋Š” ์ฝ”๋“œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์–ด์š”)

  • ๋ฒ”์ฃผํ˜•(categorical) ๋ฐ์ดํ„ฐ๋Š” ๊ฐ’์ด [1, 2, 3], ["๋‚ด๋ถ€", "์™ธ๋ถ€"]์™€ ๊ฐ™์ด ๋ช‡ ๊ฐ€์ง€ ๋ถ„๋ฅ˜๋กœ ํ•œ์ •๋˜๋Š” ๋ฐ์ดํ„ฐ ์ž…๋‹ˆ๋‹ค.

  • ์ˆ˜์น˜ํ˜•(numerical) ๋ฐ์ดํ„ฐ๋Š” ๊ฐ’์ด 1,2,3,5,..., 1.2, 4.51, 3.1415์™€ ๊ฐ™์ด ์ˆซ์ž ์ถ•์œผ๋กœ ๋ฌดํ•œํžˆ ์œ„์น˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์ž…๋‹ˆ๋‹ค.

categorical_cols = ["sex", "embarked", "class", "who", "adult_male", "deck", "embark_town", "alive", "alone"]
numerical_cols = ["age","sibsp","parch","fare"]

๋ฐ์ดํ„ฐ์˜ ํ†ต๊ณ„๋Ÿ‰ ์‚ดํŽด๋ณด๊ธฐ

.describe() ํ•จ์ˆ˜๋กœ ๊ฐ ์—ด์— ๋Œ€ํ•œ ๋Œ€๋žต์ ์ธ ํ†ต๊ณ„ ๊ฐ’๋“ค์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (ํ‰๊ท , ์ƒ์œ„ 25/50/75% ๊ฐ’, ์ตœ๋Œ€/์ตœ์†Œ ๊ฐ’ ๋“ฑ)

titanic_df.describe()

survived

pclass

age

sibsp

parch

fare

count

891.000000

891.000000

714.000000

891.000000

891.000000

891.000000

mean

0.383838

2.308642

29.699118

0.523008

0.381594

32.204208

std

0.486592

0.836071

14.526497

1.102743

0.806057

49.693429

min

0.000000

1.000000

0.420000

0.000000

0.000000

0.000000

25%

0.000000

2.000000

20.125000

0.000000

0.000000

7.910400

50%

0.000000

3.000000

28.000000

0.000000

0.000000

14.454200

75%

1.000000

3.000000

38.000000

1.000000

0.000000

31.000000

max

1.000000

3.000000

80.000000

8.000000

6.000000

512.329200

# .value_counts()๋ฅผ ํ†ตํ•ด ๊ฐ ์ปฌ๋Ÿผ๋ณ„๋กœ ๋ช‡ ๊ฐœ์˜ row๊ฐ€ ์žˆ๋Š”์ง€ ์…€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
for col in categorical_cols:
    print(col + " ์นด์šดํŠธ::")
    print(titanic_df.loc[:, col].value_counts())
    print()
sex ์นด์šดํŠธ::
male      577
female    314
Name: sex, dtype: int64

embarked ์นด์šดํŠธ::
S    644
C    168
Q     77
Name: embarked, dtype: int64

class ์นด์šดํŠธ::
Third     491
First     216
Second    184
Name: class, dtype: int64

who ์นด์šดํŠธ::
man      537
woman    271
child     83
Name: who, dtype: int64

adult_male ์นด์šดํŠธ::
True     537
False    354
Name: adult_male, dtype: int64

deck ์นด์šดํŠธ::
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

embark_town ์นด์šดํŠธ::
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

alive ์นด์šดํŠธ::
no     549
yes    342
Name: alive, dtype: int64

alone ์นด์šดํŠธ::
True     537
False    354
Name: alone, dtype: int64

๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ ๋ˆˆ์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ

์ˆ˜์น˜ํ˜• ์ปฌ๋Ÿผ๋“ค์˜ ๋ถ„ํฌ๋ฅผ ๊ทธ๋ ค๋ด…์‹œ๋‹ค. ํ†ต๊ณ„๋Ÿ‰์€ boxplot์œผ๋กœ ์‚ดํŽด๋ณด๊ณ , ๋ถ„ํฌ๋Š” histplot์œผ๋กœ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

  • ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ•ด ๋ณผ๊นŒ์š”? ์ด ์ฝ”๋“œ์—์„œ๋Š” ๋ฐ˜๋ณต๋ฌธ์„ ์ด์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

  • plt.subplots๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋„ํ™”์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (nrows ร— ncols)

  • for๋ฌธ ์•ˆ์—์„œ๋Š” ๊ฐ ๋„ํ™”์ง€(ax)์— seaborn์œผ๋กœ ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค. figure๋Š” ๊ทธ๋ฆผ ์ „์ฒด๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

figure, ax_list = plt.subplots(nrows=1, ncols=4)
figure.set_size_inches(12,5)

for i in range(4):
    col = numerical_cols[i]
    sns.boxplot(data=titanic_df, y=col, showfliers=True, ax=ax_list[i])
    ax_list[i].set_title(f"distribution  '{col}'")
figure, ax_list = plt.subplots(nrows=1, ncols=4)
figure.set_size_inches(12,3)

for i in range(4):
    sns.histplot(data=titanic_df.loc[:, numerical_cols[i]], ax=ax_list[i])
    ax_list[i].set_title(f"distribution  '{numerical_cols[i]}'")

๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ๋“ค์˜ ๋ถ„ํฌ๋ฅผ ๊ทธ๋ ค๋ด…๋‹ˆ๋‹ค. ๋ฒ”์ฃผํ˜•์ด๋ฏ€๋กœ countplot์„ ํ†ตํ•ด ๊ฐ ๋ฒ”์ฃผ๋ณ„๋กœ ๊ฐœ์ˆ˜๋ฅผ ์…€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๋ฒ”์ฃผํ˜• ์ปฌ๋Ÿผ์ด ์ด 9๊ฐœ ์ด๋ฏ€๋กœ, 3x3 ๋„ํ™”์ง€ ๋ ˆ์ด์•„์›ƒ์œผ๋กœ ํ•˜๋‚˜์”ฉ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ด…๋‹ˆ๋‹ค.

  • ax_list_list๋Š” [[], []] ํ˜•ํƒœ์˜ 2์ฐจ์› ๋ฆฌ์ŠคํŠธ ์ž…๋‹ˆ๋‹ค. for ๋ฌธ์œผ๋กœ ๋ฐ˜๋ณตํ•˜๊ธฐ ์œ„ํ•ด 1์ฐจ์› ๋ฆฌ์ŠคํŠธ๋กœ ํ’€์–ด์ค๋‹ˆ๋‹ค.

  • 1์ฐจ์› ๋ฆฌ์ŠคํŠธ ax_list๊ฐ€ 9๊ฐœ์˜ ๋„ํ™”์ง€ (ax)๋ฅผ ๊ฐ–๋„๋ก ํ’€์–ด์„œ ํ• ๋‹นํ•˜๋Š”๋ฐ, .reshape() ๋ผ๋Š” numpy ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

figure, ax_list_list = plt.subplots(nrows=3, ncols=3);
figure.set_size_inches(10,10)

ax_list = ax_list_list.reshape(9)  # ๋‹ค์ฐจ์› ํ–‰๋ ฌ์˜ ์ฐจ์›์„ ์›ํ•˜๋Š” ๋ชจ์–‘์œผ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.
print(ax_list_list.shape)
print(ax_list.shape)

for i in range(len(categorical_cols)):
    col = categorical_cols[i]
    sns.countplot(data=titanic_df, x=col, ax=ax_list[i])
    ax_list[i].set_title(col)

plt.tight_layout()
(3, 3)
(9,)

๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์œ ์˜๋ฏธํ•œ ์ •๋ณด ๋ฐœ๊ตดํ•˜๊ธฐ

์‚ฌ์‹ค, ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ๋Š” EDA์˜ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด์„ญ๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋„ ํƒ‘์Šน๊ฐ์˜ '์ƒ์กด'์— ์–ด๋–ค ๊ฒƒ๋“ค์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๊ถ๊ธˆํ•˜์‹œ์ฃ ? ๋ช‡ ๊ฐ€์ง€ ๊ฐ€์„ค์„ ์„ธ์šฐ๊ณ  ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค '์ƒ์กด'์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์ธ์ด ๋ฌด์—‡์ธ์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค

์„ฑ๋ณ„๊ณผ ์ƒ์กด ์—ฌ๋ถ€

sns.countplot(data=titanic_df, x='sex', hue='survived');
  • hue๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ทธ๋ž˜ํ”„์—์„œ ํŠน์ • ์ปฌ๋Ÿผ์„ ๊ทธ๋ฃน ์ง€์–ด์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค

์ขŒ์„ ๋“ฑ๊ธ‰๊ณผ ์ƒ์กด ์—ฌ๋ถ€

sns.countplot(data=titanic_df, x='pclass', hue='survived');

9๊ฐœ์˜ ๋ฒ”์ฃผํ˜• ๋ถ„๋ฅ˜์— ๋Œ€ํ•ด, ์ƒ์กด ์—ฌ๋ถ€๋กœ ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ

# hue ์ธ์ž๋กœ 'survived' ์ปฌ๋Ÿผ์„ ์ž…๋ ฅ, ๊ฐ ๋ถ„๋ฅ˜ํ˜• ๋ฐ์ดํ„ฐ ๋ณ„๋กœ ์ƒ์กด/์‚ฌ๋ง ๋ถ„๋ฆฌํ•˜์—ฌ ์‚ดํŽด๋ณด๊ธฐ
figure, ax_list_list = plt.subplots(nrows=3, ncols=3);
figure.set_size_inches(10,10)

ax_list = ax_list_list.reshape(9)
print(ax_list_list.shape)
print(ax_list.shape)

for i in range(len(categorical_cols)):
    col = categorical_cols[i]
    sns.countplot(data=titanic_df, x=col, ax=ax_list[i], hue='survived')
    ax_list[i].set_title(col)

plt.tight_layout()
(3, 3)
(9,)
  • ๋‚จ์„ฑ๋ณด๋‹ค ์—ฌ์„ฑ์˜ ์ƒ์กด๋ฅ ์ด ๋” ๋†’์Šต๋‹ˆ๋‹ค (๋‚จ์„ฑ > ์—ฌ์„ฑ > ์•„์ด)

  • ํƒ‘์Šน์ง€(embarked)๊ฐ€ C์ธ ๊ฒฝ์šฐ ์ƒ์กด์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค

  • 1๋“ฑ์„ > 2๋“ฑ์„ > 3๋“ฑ์„ ์ˆœ์œผ๋กœ ์ƒ์กด์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค

  • B,D,E ๋ฑ ์œ„์น˜์˜ ์Šน๊ฐ๋“ค์ด ์ƒ์กด์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค

  • ๋‚˜ํ™€๋กœ ์Šน๊ฐ์€ ์ƒ์กด์œจ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค

์ƒ์กด ์—ฌ๋ถ€๋ณ„๋กœ ๋‚˜์ด์˜ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ ค๋ณด๊ธฐ

sns.histplot(data=titanic_df, x='age', hue='survived', bins=30, alpha=0.3);

์„ฑ๋ณ„๊ณผ ์ขŒ์„ ๋“ฑ๊ธ‰์— ๋”ฐ๋ผ, ๋‚˜์ด์˜ boxplot ๊ทธ๋ ค๋ณด๊ธฐ

sns.boxplot(data=titanic_df, x='sex', y='age', hue='pclass');

Part2. Decision Tree๋กœ ํƒ€์ดํƒ€๋‹‰ ์ƒ์กด์ž ์˜ˆ์ธกํ•˜๊ธฐ

๊ฒฐ์ธก์น˜ ์ฑ„์šฐ๊ธฐ

titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
# numerical value
titanic_df['age'].fillna(titanic_df['age'].mean(), inplace=True)
# categorical value
titanic_df['deck'].fillna(titanic_df['deck'].describe()['top'], inplace=True)
titanic_df['embarked'].fillna(titanic_df['embarked'].describe()['top'], inplace=True)
titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

0

0

3

male

22.000000

1

0

7.2500

S

Third

man

True

C

Southampton

no

False

1

1

1

female

38.000000

1

0

71.2833

C

First

woman

False

C

Cherbourg

yes

False

2

1

3

female

26.000000

0

0

7.9250

S

Third

woman

False

C

Southampton

yes

True

3

1

1

female

35.000000

1

0

53.1000

S

First

woman

False

C

Southampton

yes

False

4

0

3

male

35.000000

0

0

8.0500

S

Third

man

True

C

Southampton

no

True

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

886

0

2

male

27.000000

0

0

13.0000

S

Second

man

True

C

Southampton

no

True

887

1

1

female

19.000000

0

0

30.0000

S

First

woman

False

B

Southampton

yes

True

888

0

3

female

29.699118

1

2

23.4500

S

Third

woman

False

C

Southampton

no

False

889

1

1

male

26.000000

0

0

30.0000

C

First

man

True

C

Cherbourg

yes

True

890

0

3

male

32.000000

0

0

7.7500

Q

Third

man

True

C

Queenstown

no

True

๋ฒ”์ฃผํ˜• ๋ชจ๋ธ์€ ๋ชจ๋ธ์—์„œ ์ž‘๋™ํ•  ์ˆ˜์—†์œผ๋ฏ€๋กœ sklearn์˜ preprocessing์„ ์ด์šฉํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์น˜ํ™”ํ•œ๋‹ค

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic_df['sex'] = le.fit(titanic_df['sex']).transform(titanic_df['sex'])
titanic_df['adult_male'] = le.fit(titanic_df['adult_male']).transform(titanic_df['adult_male'])
titanic_df['alone'] = le.fit(titanic_df['alone']).transform(titanic_df['alone'])
titanic_df['embarked'] = le.fit(titanic_df['embarked']).transform(titanic_df['embarked'])
titanic_df['deck'] = le.fit(titanic_df['deck']).transform(titanic_df['deck'])
titanic_df['who'] = le.fit(titanic_df['who']).transform(titanic_df['who'])
titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

0

0

3

1

22.000000

1

0

7.2500

2

Third

1

1

2

Southampton

no

0

1

1

1

0

38.000000

1

0

71.2833

0

First

2

0

2

Cherbourg

yes

0

2

1

3

0

26.000000

0

0

7.9250

2

Third

2

0

2

Southampton

yes

1

3

1

1

0

35.000000

1

0

53.1000

2

First

2

0

2

Southampton

yes

0

4

0

3

1

35.000000

0

0

8.0500

2

Third

1

1

2

Southampton

no

1

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

886

0

2

1

27.000000

0

0

13.0000

2

Second

1

1

2

Southampton

no

1

887

1

1

0

19.000000

0

0

30.0000

2

First

2

0

1

Southampton

yes

1

888

0

3

0

29.699118

1

2

23.4500

2

Third

2

0

2

Southampton

no

0

889

1

1

1

26.000000

0

0

30.0000

0

First

1

1

2

Cherbourg

yes

1

890

0

3

1

32.000000

0

0

7.7500

1

Third

1

1

2

Queenstown

no

1

891 rows ร— 15 columns

# drop duplicated columns
drop_cols = ["class", "embark_town", "alive"]
titanic_df = titanic_df.drop(drop_cols, axis=1)
titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

who

adult_male

deck

alone

0

0

3

1

22.000000

1

0

7.2500

2

1

1

2

0

1

1

1

0

38.000000

1

0

71.2833

0

2

0

2

0

2

1

3

0

26.000000

0

0

7.9250

2

2

0

2

1

3

1

1

0

35.000000

1

0

53.1000

2

2

0

2

0

4

0

3

1

35.000000

0

0

8.0500

2

1

1

2

1

...

...

...

...

...

...

...

...

...

...

...

...

...

886

0

2

1

27.000000

0

0

13.0000

2

1

1

2

1

887

1

1

0

19.000000

0

0

30.0000

2

2

0

1

1

888

0

3

0

29.699118

1

2

23.4500

2

2

0

2

0

889

1

1

1

26.000000

0

0

30.0000

0

1

1

2

1

890

0

3

1

32.000000

0

0

7.7500

1

1

1

2

1

ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ ์ค€๋น„ํ•˜๊ธฐ

X = titanic_df.iloc[:,1:]
y = titanic_df['survived']

# 80%๋Š” ํŠธ๋ ˆ์ด๋‹ ๋ฐ์ดํ„ฐ, 20%๋Š” ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
y_pred = dt_clf.predict(X_test)

print('์˜ˆ์ธก ์ •ํ™•๋„: %.2f' % accuracy_score(y_test, y_pred))
์˜ˆ์ธก ์ •ํ™•๋„: 0.81

Last updated

Was this helpful?