DAY 1 : EDA
210823
어떠한 데이터를 처리할 때는 그 데이터의 분포나 통계적인 특징을 잘 파악해야 한다. 성능과 직결되는 특징들이기 때문이다. 이에대해 알아보자.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.image as img
import cv2
import os
data = pd.read_csv('./input/data/train/train.csv')
data
id
gender
race
age
path
0
000001
female
Asian
45
000001_female_Asian_45
1
000002
female
Asian
52
000002_female_Asian_52
2
000004
male
Asian
54
000004_male_Asian_54
3
000005
female
Asian
58
000005_female_Asian_58
4
000006
female
Asian
59
000006_female_Asian_59
...
...
...
...
...
...
2695
006954
male
Asian
19
006954_male_Asian_19
2696
006955
male
Asian
19
006955_male_Asian_19
2697
006956
male
Asian
19
006956_male_Asian_19
2698
006957
male
Asian
20
006957_male_Asian_20
2699
006959
male
Asian
19
006959_male_Asian_19
2700 rows × 5 columns
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2700 entries, 0 to 2699
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2700 non-null object
1 gender 2700 non-null object
2 race 2700 non-null object
3 age 2700 non-null int64
4 path 2700 non-null object
dtypes: int64(1), object(4)
memory usage: 105.6+ KB
data.describe(include='all')
id
gender
race
age
path
count
2700
2700
2700
2700.000000
2700
unique
2699
2
1
NaN
2700
top
003397
female
Asian
NaN
006729_male_Asian_19
freq
2
1658
2700
NaN
1
mean
NaN
NaN
NaN
37.708148
NaN
std
NaN
NaN
NaN
16.985904
NaN
min
NaN
NaN
NaN
18.000000
NaN
25%
NaN
NaN
NaN
20.000000
NaN
50%
NaN
NaN
NaN
36.000000
NaN
75%
NaN
NaN
NaN
55.000000
NaN
max
NaN
NaN
NaN
60.000000
NaN
group = data.groupby('gender')['age'].value_counts().sort_index()
fig, axes = plt.subplots(1, 2, figsize=(15, 7), sharey=True)
axes[0].bar(group['male'].index, group['male'], color='royalblue')
axes[1].bar(group['female'].index, group['female'], color='tomato')
plt.show()

DATA_DIR = './input/data/train/images/'
FILES = ['mask1.jpg', 'mask2.jpg', 'mask3.jpg', 'mask4.jpg', 'mask5.jpg', 'incorrect_mask.jpg', 'normal.jpg']
fig, axes = plt.subplots(5, 7, figsize = (20, 20), dpi=150)
for j in range(5):
sample = data.sample(1)
path = sample['path'].values[0]
sample_path = os.path.join(DATA_DIR, path)
for i, name in enumerate(FILES):
image_path = os.path.join(sample_path, name)
image = img.imread(image_path)
axes[j][i].imshow(image)
axes[j][i].axis('off')
axes[j][i].set_title(name[:-4])
plt.show()


개선점
아직 좀 더 살펴볼 특징들이 있는데, 잘 살펴보지 못했다.
성별 수, 나이별 성별(선 그래프로), 나이를 그룹화 해서 비교, 예외 사항 검토
현재는 완전히 코드만 있어서 설명력이 떨어진다. 설명을 적을 필요가 있다.
Last updated
Was this helpful?