DAY 1 : EDA

210823

어떠한 데이터를 처리할 때는 그 데이터의 분포나 통계적인 특징을 잘 파악해야 한다. 성능과 직결되는 특징들이기 때문이다. 이에대해 알아보자.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as img
import cv2
import os
data = pd.read_csv('./input/data/train/train.csv')
data

id

gender

race

age

path

0

000001

female

Asian

45

000001_female_Asian_45

1

000002

female

Asian

52

000002_female_Asian_52

2

000004

male

Asian

54

000004_male_Asian_54

3

000005

female

Asian

58

000005_female_Asian_58

4

000006

female

Asian

59

000006_female_Asian_59

...

...

...

...

...

...

2695

006954

male

Asian

19

006954_male_Asian_19

2696

006955

male

Asian

19

006955_male_Asian_19

2697

006956

male

Asian

19

006956_male_Asian_19

2698

006957

male

Asian

20

006957_male_Asian_20

2699

006959

male

Asian

19

006959_male_Asian_19

2700 rows × 5 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700 entries, 0 to 2699
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2700 non-null   object
 1   gender  2700 non-null   object
 2   race    2700 non-null   object
 3   age     2700 non-null   int64 
 4   path    2700 non-null   object
dtypes: int64(1), object(4)
memory usage: 105.6+ KB
data.describe(include='all')

id

gender

race

age

path

count

2700

2700

2700

2700.000000

2700

unique

2699

2

1

NaN

2700

top

003397

female

Asian

NaN

006729_male_Asian_19

freq

2

1658

2700

NaN

1

mean

NaN

NaN

NaN

37.708148

NaN

std

NaN

NaN

NaN

16.985904

NaN

min

NaN

NaN

NaN

18.000000

NaN

25%

NaN

NaN

NaN

20.000000

NaN

50%

NaN

NaN

NaN

36.000000

NaN

75%

NaN

NaN

NaN

55.000000

NaN

max

NaN

NaN

NaN

60.000000

NaN

group = data.groupby('gender')['age'].value_counts().sort_index()
fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True)
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()
DATA_DIR = './input/data/train/images/'
FILES = ['mask1.jpg', 'mask2.jpg', 'mask3.jpg', 'mask4.jpg', 'mask5.jpg', 'incorrect_mask.jpg', 'normal.jpg']
fig, axes = plt.subplots(5, 7, figsize = (20, 20), dpi=150)

for j in range(5):
    sample =  data.sample(1)
    path = sample['path'].values[0]
    sample_path = os.path.join(DATA_DIR, path)
    for i, name in enumerate(FILES):
        image_path = os.path.join(sample_path, name)
        image = img.imread(image_path)
        axes[j][i].imshow(image)
        axes[j][i].axis('off')
        axes[j][i].set_title(name[:-4])

plt.show()

개선점

  1. 아직 좀 더 살펴볼 특징들이 있는데, 잘 살펴보지 못했다.

    • 성별 수, 나이별 성별(선 그래프로), 나이를 그룹화 해서 비교, 예외 사항 검토

  2. 현재는 완전히 코드만 있어서 설명력이 떨어진다. 설명을 적을 필요가 있다.

Last updated

Was this helpful?