DAY 1 : EDA
210823
์ด๋ ํ ๋ฐ์ดํฐ๋ฅผ ์ฒ๋ฆฌํ ๋๋ ๊ทธ ๋ฐ์ดํฐ์ ๋ถํฌ๋ ํต๊ณ์ ์ธ ํน์ง์ ์ ํ์ ํด์ผ ํ๋ค. ์ฑ๋ฅ๊ณผ ์ง๊ฒฐ๋๋ ํน์ง๋ค์ด๊ธฐ ๋๋ฌธ์ด๋ค. ์ด์๋ํด ์์๋ณด์.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as img
import cv2
import os
data = pd.read_csv('./input/data/train/train.csv')
data
id
gender
race
age
path
0
000001
female
Asian
45
000001_female_Asian_45
1
000002
female
Asian
52
000002_female_Asian_52
2
000004
male
Asian
54
000004_male_Asian_54
3
000005
female
Asian
58
000005_female_Asian_58
4
000006
female
Asian
59
000006_female_Asian_59
...
...
...
...
...
...
2695
006954
male
Asian
19
006954_male_Asian_19
2696
006955
male
Asian
19
006955_male_Asian_19
2697
006956
male
Asian
19
006956_male_Asian_19
2698
006957
male
Asian
20
006957_male_Asian_20
2699
006959
male
Asian
19
006959_male_Asian_19
2700 rows ร 5 columns
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700 entries, 0 to 2699
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2700 non-null object
1 gender 2700 non-null object
2 race 2700 non-null object
3 age 2700 non-null int64
4 path 2700 non-null object
dtypes: int64(1), object(4)
memory usage: 105.6+ KB
data.describe(include='all')
id
gender
race
age
path
count
2700
2700
2700
2700.000000
2700
unique
2699
2
1
NaN
2700
top
003397
female
Asian
NaN
006729_male_Asian_19
freq
2
1658
2700
NaN
1
mean
NaN
NaN
NaN
37.708148
NaN
std
NaN
NaN
NaN
16.985904
NaN
min
NaN
NaN
NaN
18.000000
NaN
25%
NaN
NaN
NaN
20.000000
NaN
50%
NaN
NaN
NaN
36.000000
NaN
75%
NaN
NaN
NaN
55.000000
NaN
max
NaN
NaN
NaN
60.000000
NaN
group = data.groupby('gender')['age'].value_counts().sort_index()
fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True)
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()

DATA_DIR = './input/data/train/images/'
FILES = ['mask1.jpg', 'mask2.jpg', 'mask3.jpg', 'mask4.jpg', 'mask5.jpg', 'incorrect_mask.jpg', 'normal.jpg']
fig, axes = plt.subplots(5, 7, figsize = (20, 20), dpi=150)
for j in range(5):
sample = data.sample(1)
path = sample['path'].values[0]
sample_path = os.path.join(DATA_DIR, path)
for i, name in enumerate(FILES):
image_path = os.path.join(sample_path, name)
image = img.imread(image_path)
axes[j][i].imshow(image)
axes[j][i].axis('off')
axes[j][i].set_title(name[:-4])
plt.show()


๊ฐ์ ์
์์ง ์ข ๋ ์ดํด๋ณผ ํน์ง๋ค์ด ์๋๋ฐ, ์ ์ดํด๋ณด์ง ๋ชปํ๋ค.
์ฑ๋ณ ์, ๋์ด๋ณ ์ฑ๋ณ(์ ๊ทธ๋ํ๋ก), ๋์ด๋ฅผ ๊ทธ๋ฃนํ ํด์ ๋น๊ต, ์์ธ ์ฌํญ ๊ฒํ
ํ์ฌ๋ ์์ ํ ์ฝ๋๋ง ์์ด์ ์ค๋ช ๋ ฅ์ด ๋จ์ด์ง๋ค. ์ค๋ช ์ ์ ์ ํ์๊ฐ ์๋ค.
Last updated
Was this helpful?