DAY 1 : EDA

210823

์–ด๋– ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ๋Š” ๊ทธ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋‚˜ ํ†ต๊ณ„์ ์ธ ํŠน์ง•์„ ์ž˜ ํŒŒ์•…ํ•ด์•ผ ํ•œ๋‹ค. ์„ฑ๋Šฅ๊ณผ ์ง๊ฒฐ๋˜๋Š” ํŠน์ง•๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด์—๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as img
import cv2
import os
data = pd.read_csv('./input/data/train/train.csv')
data

id

gender

race

age

path

0

000001

female

Asian

45

000001_female_Asian_45

1

000002

female

Asian

52

000002_female_Asian_52

2

000004

male

Asian

54

000004_male_Asian_54

3

000005

female

Asian

58

000005_female_Asian_58

4

000006

female

Asian

59

000006_female_Asian_59

...

...

...

...

...

...

2695

006954

male

Asian

19

006954_male_Asian_19

2696

006955

male

Asian

19

006955_male_Asian_19

2697

006956

male

Asian

19

006956_male_Asian_19

2698

006957

male

Asian

20

006957_male_Asian_20

2699

006959

male

Asian

19

006959_male_Asian_19

2700 rows ร— 5 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700 entries, 0 to 2699
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2700 non-null   object
 1   gender  2700 non-null   object
 2   race    2700 non-null   object
 3   age     2700 non-null   int64 
 4   path    2700 non-null   object
dtypes: int64(1), object(4)
memory usage: 105.6+ KB
data.describe(include='all')

id

gender

race

age

path

count

2700

2700

2700

2700.000000

2700

unique

2699

2

1

NaN

2700

top

003397

female

Asian

NaN

006729_male_Asian_19

freq

2

1658

2700

NaN

1

mean

NaN

NaN

NaN

37.708148

NaN

std

NaN

NaN

NaN

16.985904

NaN

min

NaN

NaN

NaN

18.000000

NaN

25%

NaN

NaN

NaN

20.000000

NaN

50%

NaN

NaN

NaN

36.000000

NaN

75%

NaN

NaN

NaN

55.000000

NaN

max

NaN

NaN

NaN

60.000000

NaN

group = data.groupby('gender')['age'].value_counts().sort_index()
fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True)
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()
DATA_DIR = './input/data/train/images/'
FILES = ['mask1.jpg', 'mask2.jpg', 'mask3.jpg', 'mask4.jpg', 'mask5.jpg', 'incorrect_mask.jpg', 'normal.jpg']
fig, axes = plt.subplots(5, 7, figsize = (20, 20), dpi=150)

for j in range(5):
    sample =  data.sample(1)
    path = sample['path'].values[0]
    sample_path = os.path.join(DATA_DIR, path)
    for i, name in enumerate(FILES):
        image_path = os.path.join(sample_path, name)
        image = img.imread(image_path)
        axes[j][i].imshow(image)
        axes[j][i].axis('off')
        axes[j][i].set_title(name[:-4])

plt.show()

๊ฐœ์„ ์ 

  1. ์•„์ง ์ข€ ๋” ์‚ดํŽด๋ณผ ํŠน์ง•๋“ค์ด ์žˆ๋Š”๋ฐ, ์ž˜ ์‚ดํŽด๋ณด์ง€ ๋ชปํ–ˆ๋‹ค.

    • ์„ฑ๋ณ„ ์ˆ˜, ๋‚˜์ด๋ณ„ ์„ฑ๋ณ„(์„  ๊ทธ๋ž˜ํ”„๋กœ), ๋‚˜์ด๋ฅผ ๊ทธ๋ฃนํ™” ํ•ด์„œ ๋น„๊ต, ์˜ˆ์™ธ ์‚ฌํ•ญ ๊ฒ€ํ† 

  2. ํ˜„์žฌ๋Š” ์™„์ „ํžˆ ์ฝ”๋“œ๋งŒ ์žˆ์–ด์„œ ์„ค๋ช…๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค. ์„ค๋ช…์„ ์ ์„ ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

Last updated

Was this helpful?