6 Sun

현업 실무자에게 배우는 Kaggle 머신러닝 입문

결정 트리(Decision Tree) 소개

결정 트리

데이터 마이닝에서 일반적으로 사용되는 방법론
몇몇 입력 변수를 바탕으로 목표 변수의 값을 예측 하는 모델을 생성하는 것이 목표
장점
- 알고리즘의 동작과정이 직관적
- 학습 시간이 빠름
- 개별 특징들을 개별적으로 판단하므로 정규화가 필요하지 않다
단점
- 오버피팅에 빠지기 쉽다
  - 특히 트리의 깊이가 깊어질 수록
구현
- sklearn.tree.DecisionTreeClassifer
- sklearn.tree.DecisionTreeRegressor

Titanic 사고 데이터 소개

1912년 타이타닉 사고 당시의 승객에 대한 데이터
Binary Classification
- 1 : 생존
- 0 : 사망
데이터 개수 : 891
특징

범주형 컬럼(Categorical Column) & 수치형 컬럼(Numerical Column)

범주형 컬럼

값이 [1, 2, 3], ["내부", "외부"] 와 같이 한정되는 데이터
[sex, embarked, class, who, adult_male, deck, embark-town, alive, alone]

수치형 컬럼

값이 1, 2, 3, 5, ... 또는 1.2, 4.51, 3.1415, .. 와 같이 숫자 축으로 무한히 위치할 수 잇는 데이터
[age, sibsp, parch, fare]

Categorical Column 다루기 - LabelEncoder

LabelEncoder

머신러닝 알고리즘은 string 형태의 값은 처리할 수 없다. 숫자형 값으로 변경해줘야만 하며 scikit-learn에서 제공하는 preprocessing.LabelEncoder 클래스를 이용해서 string 형태의 값을 숫자형 값으로 변경할 수 있다.

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
>>> le.inverse_transform([0, 0, 1, 2])
array([1, 1, 2, 6])

>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> list(le.classes_)
['amsterdam', 'paris', 'tokyo']
>>> le.transform(["tokyo", "tokyo", "paris"])
array([2, 2, 1]...)
>>> list(le.inverse_transform([2, 2, 1]))
['tokyo', 'tokyo', 'paris']

결정 트리(Decision Tree)를 이용해서 타이타닉 생존자 예측해보기

사용 알고리즘

DecisionTreeClassifier

추가적인 적용기법

EDA, Exploratory Data Analysis
Data Cleansing, 결측치 처리

df.info()

데이터 컬럼별 타입과 값이 있는 행의 갯수 등을 알 수 있다

데이터 불러오기

seaborn 라이브러리 안에 타이타닉 데이터셋이 저장되어 있다.
titanic_df = sns.load_dataset('titanic')

PART 1. EDA

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

범주형 컬럼과, 수치형 컬럼으로 나누어서 리스트를 만듭니다. (추후 분석 시 반복되는 코드를 줄일 수 있어요)

범주형(categorical) 데이터는 값이 [1, 2, 3], ["내부", "외부"]와 같이 몇 가지 분류로 한정되는 데이터 입니다.
수치형(numerical) 데이터는 값이 1,2,3,5,..., 1.2, 4.51, 3.1415와 같이 숫자 축으로 무한히 위치할 수 있는 데이터 입니다.

categorical_cols = ["sex", "embarked", "class", "who", "adult_male", "deck", "embark_town", "alive", "alone"]

numerical_cols = ["age","sibsp","parch","fare"]

데이터의 통계량 살펴보기

.describe() 함수로 각 열에 대한 대략적인 통계 값들을 볼 수 있습니다. (평균, 상위 25/50/75% 값, 최대/최소 값 등)

titanic_df.describe()

survived

pclass

age

sibsp

parch

fare

count

891.000000

714.000000

891.000000

mean

0.383838

2.308642

29.699118

0.523008

0.381594

32.204208

std

0.486592

0.836071

14.526497

1.102743

0.806057

49.693429

min

0.000000

1.000000

0.420000

0.000000

25%

0.000000

2.000000

20.125000

0.000000

7.910400

50%

0.000000

3.000000

28.000000

0.000000

14.454200

75%

1.000000

3.000000

38.000000

1.000000

0.000000

31.000000

max

1.000000

3.000000

80.000000

8.000000

6.000000

512.329200

# .value_counts()를 통해 각 컬럼별로 몇 개의 row가 있는지 셀 수 있습니다
for col in categorical_cols:
    print(col + " 카운트::")
    print(titanic_df.loc[:, col].value_counts())
    print()

sex 카운트::
male      577
female    314
Name: sex, dtype: int64

embarked 카운트::
S    644
C    168
Q     77
Name: embarked, dtype: int64

class 카운트::
Third     491
First     216
Second    184
Name: class, dtype: int64

who 카운트::
man      537
woman    271
child     83
Name: who, dtype: int64

adult_male 카운트::
True     537
False    354
Name: adult_male, dtype: int64

deck 카운트::
C    59
B    47
D    33
E    32
A    15
F    13
G     4
Name: deck, dtype: int64

embark_town 카운트::
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

alive 카운트::
no     549
yes    342
Name: alive, dtype: int64

alone 카운트::
True     537
False    354
Name: alone, dtype: int64

데이터의 분포 눈으로 살펴보기

수치형 컬럼들의 분포를 그려봅시다. 통계량은 boxplot으로 살펴보고, 분포는 histplot으로 그립니다.

본격적으로 반복문을 사용해 볼까요? 이 코드에서는 반복문을 이용하여 여러개의 차트를 그립니다.
plt.subplots를 통해 여러 개의 도화지를 생성합니다. (nrows × ncols)
for문 안에서는 각 도화지(ax)에 seaborn으로 차트를 그립니다. figure는 그림 전체를 의미합니다.

figure, ax_list = plt.subplots(nrows=1, ncols=4)
figure.set_size_inches(12,5)

for i in range(4):
    col = numerical_cols[i]
    sns.boxplot(data=titanic_df, y=col, showfliers=True, ax=ax_list[i])
    ax_list[i].set_title(f"distribution  '{col}'")

figure, ax_list = plt.subplots(nrows=1, ncols=4)
figure.set_size_inches(12,3)

for i in range(4):
    sns.histplot(data=titanic_df.loc[:, numerical_cols[i]], ax=ax_list[i])
    ax_list[i].set_title(f"distribution  '{numerical_cols[i]}'")

범주형 컬럼들의 분포를 그려봅니다. 범주형이므로 countplot을 통해 각 범주별로 개수를 셀 수 있습니다.

범주형 컬럼이 총 9개 이므로, 3x3 도화지 레이아웃으로 하나씩 그래프를 그려봅니다.
ax_list_list는 [[], []] 형태의 2차원 리스트 입니다. for 문으로 반복하기 위해 1차원 리스트로 풀어줍니다.
1차원 리스트 ax_list가 9개의 도화지 (ax)를 갖도록 풀어서 할당하는데, .reshape() 라는 numpy 함수를 사용합니다.

figure, ax_list_list = plt.subplots(nrows=3, ncols=3);
figure.set_size_inches(10,10)

ax_list = ax_list_list.reshape(9)  # 다차원 행렬의 차원을 원하는 모양으로 변경합니다.
print(ax_list_list.shape)
print(ax_list.shape)

for i in range(len(categorical_cols)):
    col = categorical_cols[i]
    sns.countplot(data=titanic_df, x=col, ax=ax_list[i])
    ax_list[i].set_title(col)

plt.tight_layout()

(3, 3)
(9,)

데이터로부터 유의미한 정보 발굴하기

사실, 여기서부터는 EDA의 범위를 넘어섭니다. 그래도 탑승객의 '생존'에 어떤 것들이 영향을 미치는지 궁금하시죠? 몇 가지 가설을 세우고 그래프를 그려 '생존'에 영향을 미치는 요인이 무엇인지 살펴봅시다

성별과 생존 여부

sns.countplot(data=titanic_df, x='sex', hue='survived');

hue를 이용하여 그래프에서 특정 컬럼을 그룹 지어서 볼 수 있다

좌석 등급과 생존 여부

sns.countplot(data=titanic_df, x='pclass', hue='survived');

9개의 범주형 분류에 대해, 생존 여부로 그래프 그리기

# hue 인자로 'survived' 컬럼을 입력, 각 분류형 데이터 별로 생존/사망 분리하여 살펴보기
figure, ax_list_list = plt.subplots(nrows=3, ncols=3);
figure.set_size_inches(10,10)

ax_list = ax_list_list.reshape(9)
print(ax_list_list.shape)
print(ax_list.shape)

for i in range(len(categorical_cols)):
    col = categorical_cols[i]
    sns.countplot(data=titanic_df, x=col, ax=ax_list[i], hue='survived')
    ax_list[i].set_title(col)

plt.tight_layout()

(3, 3)
(9,)

남성보다 여성의 생존률이 더 높습니다 (남성 > 여성 > 아이)
탑승지(embarked)가 C인 경우 생존율이 높습니다
1등석 > 2등석 > 3등석 순으로 생존율이 높습니다
B,D,E 덱 위치의 승객들이 생존율이 높습니다
나홀로 승객은 생존율이 낮습니다

생존 여부별로 나이의 히스토그램 그려보기

sns.histplot(data=titanic_df, x='age', hue='survived', bins=30, alpha=0.3);

성별과 좌석 등급에 따라, 나이의 boxplot 그려보기

sns.boxplot(data=titanic_df, x='sex', y='age', hue='pclass');

Part2. Decision Tree로 타이타닉 생존자 예측하기

결측치 채우기

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB

# numerical value
titanic_df['age'].fillna(titanic_df['age'].mean(), inplace=True)
# categorical value
titanic_df['deck'].fillna(titanic_df['deck'].describe()['top'], inplace=True)
titanic_df['embarked'].fillna(titanic_df['embarked'].describe()['top'], inplace=True)

titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

male

22.000000

7.2500

Third

man

True

Southampton

False

female

38.000000

71.2833

First

woman

False

Cherbourg

yes

False

female

26.000000

7.9250

Third

woman

False

Southampton

yes

True

female

35.000000

53.1000

First

woman

False

Southampton

yes

False

male

35.000000

8.0500

Third

man

True

Southampton

True

...

886

male

27.000000

13.0000

Second

man

True

Southampton

True

887

female

19.000000

30.0000

First

woman

False

Southampton

yes

True

888

female

29.699118

23.4500

Third

woman

False

Southampton

False

889

male

26.000000

30.0000

First

man

True

Cherbourg

yes

True

890

male

32.000000

7.7500

Third

man

True

Queenstown

True

범주형 모델은 모델에서 작동할 수없으므로 sklearn의 preprocessing을 이용하여 범주형 데이터를 수치화한다

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
titanic_df['sex'] = le.fit(titanic_df['sex']).transform(titanic_df['sex'])
titanic_df['adult_male'] = le.fit(titanic_df['adult_male']).transform(titanic_df['adult_male'])
titanic_df['alone'] = le.fit(titanic_df['alone']).transform(titanic_df['alone'])
titanic_df['embarked'] = le.fit(titanic_df['embarked']).transform(titanic_df['embarked'])
titanic_df['deck'] = le.fit(titanic_df['deck']).transform(titanic_df['deck'])
titanic_df['who'] = le.fit(titanic_df['who']).transform(titanic_df['who'])

titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

class

who

adult_male

deck

embark_town

alive

alone

22.000000

7.2500

Third

Southampton

38.000000

71.2833

First

Cherbourg

yes

26.000000

7.9250

Third

Southampton

yes

35.000000

53.1000

First

Southampton

yes

35.000000

8.0500

Third

Southampton

...

886

27.000000

13.0000

Second

Southampton

887

19.000000

30.0000

First

Southampton

yes

888

29.699118

23.4500

Third

Southampton

889

26.000000

30.0000

First

Cherbourg

yes

890

32.000000

7.7500

Third

Queenstown

891 rows × 15 columns

# drop duplicated columns
drop_cols = ["class", "embark_town", "alive"]
titanic_df = titanic_df.drop(drop_cols, axis=1)
titanic_df

survived

pclass

sex

age

sibsp

parch

fare

embarked

who

adult_male

deck

alone

22.000000

7.2500

38.000000

71.2833

26.000000

7.9250

35.000000

53.1000

35.000000

8.0500

...

886

27.000000

13.0000

887

19.000000

30.0000

888

29.699118

23.4500

889

26.000000

30.0000

890

32.000000

7.7500

트레이닝 데이터 준비하기

X = titanic_df.iloc[:,1:]
y = titanic_df['survived']

# 80%는 트레이닝 데이터, 20%는 테스트 데이터로 나눕니다.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
y_pred = dt_clf.predict(X_test)

print('예측 정확도: %.2f' % accuracy_score(y_test, y_pred))

예측 정확도: 0.81

Previous7 Mon Next5 Sat

Last updated 4 years ago

Was this helpful?