3 Thu

ํ˜„์—… ์‹ค๋ฌด์ž์—๊ฒŒ ๋ฐฐ์šฐ๋Š” Kaggle ๋จธ์‹ ๋Ÿฌ๋‹ ์ž…๋ฌธ

๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ๋„๊ตฌ ์†Œ๊ฐœ

numpy

  • ๋ฐฐ์—ด์„ ๋‹ค๋ฃจ๋Š” ๋„๊ตฌ

  • Numerical Python์˜ ์•ฝ์–ด

  • ๋‹ค์ฐจ์› ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค

pandas

  • ๋ฐ์ดํ„ฐ ํ‘œ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋„๊ตฌ

  • Python Data Analysis Library์˜ ์•ฝ์–ด

  • 2์ฐจ์› ํ…Œ์ด๋ธ” ํ˜•ํƒœ๋ฅผ ๋งค์šฐ ์ž˜ ๋‹ค๋ฃฌ๋‹ค

Matplotlib

  • ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๊ฑฐ๋‚˜ ๋ถ„ํฌ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์‹œ๊ฐํ™” ํŒจํ‚ค์ง€

  • ์—ฐ๊ตฌ์šฉ์œผ๋กœ ๋งŽ์ด ์“ฐ์ธ MATLAB์˜ ์ฝ”๋“œ ์Šคํƒ€์ผ์„ ๋ชจ๋ฐฉ

    • Matlab-style Plotting Library

  • ๊ธฐ๋Šฅ์„ ๋งŽ์œผ๋‚˜ ์•ฝ๊ฐ„ ๋ถˆํŽธํ•จ

Seaborn

  • matplotlib์„ ๊ฐ์‹ธ์„œ ๋งŒ๋“  ์‰ฌ์šด ํŒŒ์ด์ฌ ์‹œ๊ฐํ™” ํŒจํ‚ค์ง€

    • Seaborn์˜ ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด matplotlib์˜ ํ•จ์ˆ˜๊ฐ€ ํ˜ธ์ถœ๋œ๋‹ค๋Š” ์˜๋ฏธ

  • ๋‹ค์–‘ํ•˜๊ณ  ํ™”๋ คํ•œ ๊ทธ๋ž˜ํ”„๋ฅผ, matplotlib๋ณด๋‹ค ์‰ฌ์šด ์ฝ”๋“œ๋กœ ๊ทธ๋ฆด ์ˆ˜ ์žˆ์Œ

  • ๋˜ํ•œ matplotlib์˜ ๋ช…๋ น์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

numpy, ๋ฐฐ์—ด๊ณผ ํ–‰๋ ฌ์„ ๋‹ค๋ฃจ๋Š” ๋„๊ตฌ

numpy ์ž„ํฌํŠธ

import numpy as np

๋ฆฌ์ŠคํŠธ๋กœ ๋ฐฐ์—ด ์ƒ์„ฑ

np.array(๋ฆฌ์ŠคํŠธ)

np_arr1 = np.array([1, 2, 3, 4])

np_arr2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])

np_arr3 = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]

๋ชจ์–‘ ํ™•์ธ

np_arr.shape

์ดˆ๊ธฐํ™” ํ•จ์ˆ˜

  • 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ : np.zeros(shape)

  • 1๋กœ ์ฑ„์šฐ๊ธฐ : np.ones(shape)

  • ๋žœ๋คํ•œ ์ˆซ์ž ๋ฐฐ์—ด : np.random.randn(rows, columns)

์ธ๋ฑ์‹ฑ๊ณผ ์Šฌ๋ผ์ด์‹ฑ

  • ์ธ๋ฑ์‹ฑ : np.arr[n][m]

    • ์ธ๋ฑ์Šค์˜ ์‹œ์ž‘์€ 0

  • ์Šฌ๋ผ์ด์‹ฑ : np.arr[start:end:interval]

    • ๋งˆ์ง€๋ง‰ ์›์†Œ๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค.

Broadcasting, Aggregation

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŒ…

    • ๋‹ค์ฐจ์› ๋„˜ํŒŒ์ด ๋ฐฐ์—ด๊ณผ ํ•˜๋‚˜์˜ ์ˆซ์ž๋ฅผ ์‚ฌ์น™์—ฐ์‚ฐ ํ•  ๊ฒฝ์šฐ ๋„˜ํŒŒ์ด ๋ฐฐ์—ด์˜ ๋ชจ๋“  ์›์†Œ์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ ์ˆซ์ž์™€์˜ ์‚ฌ์น™ ์—ฐ์‚ฐ์ด ์ ์šฉ๋œ๋‹ค

  • ์ง‘๊ณ„

    • sum, mean, prod, max, min, argmax, argmin

numpy ์‹ค์Šต

pandas, ํ–‰๊ณผ ์—ด์„ ๊ฐ€์ง„ ํ…Œ์ด๋ธ”์„ ๋‹ค๋ฃจ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ ๋„๊ตฌ

ํŒ๋‹ค์Šค์˜ ๋“ฑ์žฅ ์ด์œ 

  • ๋ฐ์ดํ„ฐ์— row์™€ column์— ๋ผ๋ฒจ๋ง์„ ํ•˜๊ณ ์‹ถ์—ˆ๊ธฐ ๋•Œ๋ฌธ

1์ฐจ์› ๋ฐ์ดํ„ฐ

  • Series๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค

Series ์ ‘๊ทผ, ์—ฐ์‚ฐ, ์ƒ์„ฑ

  • ์—ด์ด๋‚˜ ํ–‰ ์ด๋ฆ„์œผ๋กœ ์ธ๋ฑ์‹ฑ : pd.series.loc[์ธ๋ฑ์Šค]

  • ๋ฆฌ์ŠคํŠธ ๋ฒˆํ˜ธ๋กœ ์ธ๋ฑ์‹ฑ : pd.series.iloc[๋ฒˆํ˜ธ]

    • ์ด ๋•Œ loc๋Š” ๋์„ ํฌํ•จํ•˜๋ฉฐ iloc๋Š” ๋์„ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค

  • Series ์ƒ์„ฑ : new_sr = pd.Series([1, 2, 3, 4], name = 'apple', index=['a', 'b', 'xs', 'e11'])

    • name์€ ๊ธฐ๋ณธ๊ฐ’์€ None

    • index์˜ ๊ธฐ๋ณธ๊ฐ’์€ 0, 1, 2, 3, ...

2์ฐจ์› ๋ฐ์ดํ„ฐ

  • DataFrame์„ ์‚ฌ์šฉํ•œ๋‹ค.

  • ์—ฌ๋Ÿฌ ๊ฐœ์˜ Series๋ฅผ ๋ฌถ์–ด์„œ ๋งŒ๋“  ํ˜•ํƒœ

DataFrame ์—ฐ์‚ฐ ์ด ์ •๋ฆฌ

  • ๋‹จ์ผ ์—ฐ์‚ฐ

    • abs() : ์ ˆ๋Œ“๊ฐ’

    • isna() : na์—ฌ๋ถ€

    • notna() : ์œ ํšจ์—ฌ๋ถ€

    • pow() : ๊ฑฐ๋“ญ์ œ๊ณฑ

  • ์ถ• ๋ฐฉํ–ฅ ์—ฐ์‚ฐ (axis = 0 or 1)

    • mean() : ํ‰๊ท 

    • median() : ์ค‘์•™๊ฐ’

    • max(), min() : ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’

    • sum(), prod() : ๋”ํ•˜๊ธฐ, ๊ณฑํ•˜๊ธฐ

    • idxmax(), idxmin() : ์ตœ๋Œ€์›์†Œ์˜ ์ธ๋ฑ์Šค, ์ตœ์†Œ์›์†Œ์˜ ์ธ๋ฑ์Šค

  • ๋ˆ„์  ์ถ• ๋ฐฉํ–ฅ ์—ฐ์‚ฐ (axis = 0 or 1)

    • cummax(), cummin() : ๋ˆ„์ ์ตœ๋Œ“๊ฐ’, ๋ˆ„์ ์ตœ์†Ÿ๊ฐ’

    • cumprod(), cumsum() : ๋ˆ„์ ๊ณฑ์…ˆ, ๋ˆ„์ ๋ง์…ˆ

  • ์ •๋ ฌ

    • df.sort_values(์ •๋ ฌ๊ธฐ์ค€, axis= ์ถ•, ascending=True)

    • df.rank(axis = ์ถ•, ascending=True)

  • ์ƒ์„ฑ

    • pd.DataFrame([[0,1 ,2], [3, 4, 5]], index=[0, 1], columns=['a', 'b', 'c']

csvํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ, ์ €์žฅํ•˜๊ธฐ

  • ์ €์žฅํ•˜๊ธฐ : df.to_csv('filename')

    • ์ด ๋•Œ ํ•œ๊ธ€์ž๋ฃŒ์˜ ๊ฒฝ์šฐ encoding='cp949' ๋กœ ์ง€์ •ํ•ด์ค˜์•ผํ•จ

  • ๋ถˆ๋Ÿฌ์˜ค๊ธฐ : pd.read_csv('filename')

    • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ํ•œ๊ธ€์ž๋ฃŒ์˜ ๊ฒฝ์šฐ ์ธ์ฝ”๋”ฉ ๋ช…์‹œ

pandas ์‹ค์Šต

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ A005950 ์ข…๋ชฉ์˜ 2020-09-16์˜ ์ฃผ๊ฐ€๋ฅผ ๋ฝ‘์•„๋ณด์„ธ์š”

[7 ]my_df.loc['2020-09-16', 'A005950']

9080.0

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ A005950 ์ข…๋ชฉ์˜ 2020-09-10 ~ 2020-09-15 ์˜ ์ฃผ๊ฐ€๋ฅผ ๋ฝ‘์•„๋ณด์„ธ์š”

[8] my_df.loc['2020-09-10':'2020-09-15', 'A005950']

2020-09-10    9280.0
2020-09-11    9360.0
2020-09-14    9400.0
2020-09-15    9390.0
Name: A005950, dtype: float64

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ A005930๊ณผ A005950 ์ข…๋ชฉ์˜ 2020-09-10 ~ 2020-09-17 ์˜ ์ฃผ๊ฐ€๋ฅผ ์ดํ‹€ ๊ฐ„๊ฒฉ์œผ๋กœ ๋ฝ‘์•„๋ณด์„ธ์š”

[9] my_df.loc['2020-09-10':'2020-09-17':2, 'A005950']

2020-09-10    9280.0
2020-09-14    9400.0
2020-09-16    9080.0
Name: A005950, dtype: float64

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ A005950์ข…๋ชฉ์˜ ์ „์ฒด ๊ธฐ๊ฐ„์˜ ์ฃผ๊ฐ€์˜ ํ‰๊ท ์„ ๊ตฌํ•ด๋ณด์‹œ์˜ค

[10] my_df.A005950.mean()

9664.09090909091

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ A005980์ข…๋ชฉ 2020-09-14 ์ดํ›„ ์ฃผ๊ฐ€์˜ ์ผ๋ณ„ ์ƒํ•œ๊ฐ€๋ฅผ ๊ตฌํ•ด๋ณด์„ธ์š” (์ƒํ•œ๊ฐ€: 30% ์ƒ์Šน)

[19] my_df.loc['2020-09-14':, 'A005980'] * 1.3

2020-09-14    872.3
2020-09-15    872.3
2020-09-16    872.3
2020-09-17    872.3
2020-09-18    872.3
2020-09-21    872.3
2020-09-22    872.3
2020-09-23    872.3
2020-09-24    872.3
2020-09-25    872.3
2020-09-28    872.3
2020-09-29    872.3
2020-10-05    872.3
2020-10-06    872.3
2020-10-07    872.3
2020-10-08    872.3
2020-10-12    872.3
2020-10-13    872.3
2020-10-14    872.3
Name: A005980, dtype: float64

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ ๋ชจ๋“  ์ข…๋ชฉ์˜ 2020-09-17 ~ 2020-09-24 ๊ธฐ๊ฐ„์˜ ์ˆ˜์ต๋ฅ ์„ ๊ตฌํ•ด๋ณด์„ธ์š” (% ๋‹จ์œ„)

[23](my_df.loc['2020-09-24'] / my_df.loc['2020-09-17'] - 1) * 100

Symbol
A005930   -2.857143
A005940   -4.077253
A005950   -6.756757
A005960   -4.545455
A005980    0.000000
A005990   -3.625000
dtype: float64

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ 2020-09-16์ผ์ž์˜ ์ฃผ๊ฐ€ ๋“ค์„ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ˆœ์„œ๋กœ ์ •๋ ฌํ•ด๋ณด์„ธ์š”

[24] my_df.sort_values('2020-09-16', axis='columns', ascending=False)

<์—ฐ์Šต๋ฌธ์ œ>

my_df์—์„œ 2020-09-09 ~ 2020-09-18 ์˜ ์ „ ์ข…๋ชฉ ์ˆ˜์ต๋ฅ ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ˆœ์œ„๋ฅผ ์ถœ๋ ฅํ•ด๋ณด์„ธ์š” (์ˆ˜์ต๋ฅ  ๋†’์€ ์ˆœ) 9/9 ์ข…๊ฐ€ ๋งค์ˆ˜ ~ 9/18 ์ข…๊ฐ€ ๋งค๋„

[26] ((my_df.loc['2020-09-18'] / my_df.loc['2020-09-09'] - 1) * 100).rank(ascending=False)

Symbol
A005930    1.0
A005940    2.0
A005950    5.0
A005960    6.0
A005980    4.0
A005990    3.0
dtype: float64

Last updated

Was this helpful?