4 Mon

TIL

[์ธํ”„๋Ÿฐ] ๋‹จ ๋‘ ์žฅ์˜ ๋ฌธ์„œ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ์‹œ๊ฐํ™” ๋ฝ€๊ฐœ๊ธฐ

df.sort_values, rename, sort_index, reset_index๋กœ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ Reshaping ํ•˜๊ธฐ - Reshaping Data

import pandas as pd
import numpy as np
import seaborn as sns # ์‹œ๊ฐํ™” ํˆด
df = sns.load_dataset("mpg")
df.shape
(398, 9)
df.head()

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

origin

name

0

18.0

8

307.0

130.0

3504

12.0

70

usa

chevrolet chevelle malibu

1

15.0

8

350.0

165.0

3693

11.5

70

usa

buick skylark 320

2

18.0

8

318.0

150.0

3436

11.0

70

usa

plymouth satellite

3

16.0

8

304.0

150.0

3433

12.0

70

usa

amc rebel sst

4

17.0

8

302.0

140.0

3449

10.5

70

usa

ford torino

df.sort_values('mpg')
Order rows by values of a column (low to high).
df.sort_values('mpg',ascending=False)
Order rows by values of a column (high to low).
df.sort_values('mpg').head()

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

origin

name

28

9.0

8

304.0

193.0

4732

18.5

70

usa

hi 1200d

25

10.0

8

360.0

215.0

4615

14.0

70

usa

ford f250

26

10.0

8

307.0

200.0

4376

15.0

70

usa

chevy c20

103

11.0

8

400.0

150.0

4997

14.0

73

usa

chevrolet impala

124

11.0

8

350.0

180.0

3664

11.0

73

usa

oldsmobile omega

df.sort_values('mpg', ascending=False).head()
# default๋Š” True์ด๋‹ค

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

origin

name

322

46.6

4

86.0

65.0

2110

17.9

80

japan

mazda glc

329

44.6

4

91.0

67.0

1850

13.8

80

japan

honda civic 1500 gl

325

44.3

4

90.0

48.0

2085

21.7

80

europe

vw rabbit c (diesel)

394

44.0

4

97.0

52.0

2130

24.6

82

europe

vw pickup

326

43.4

4

90.0

48.0

2335

23.7

80

europe

vw dasher (diesel)

df.sort_values?
df.rename(columns = {'y':'year'})
Rename the columns of a DataFrame
df = df.rename(columns = {'model_year' : 'year'})
df.head()

mpg

cylinders

displacement

horsepower

weight

acceleration

year

origin

name

0

18.0

8

307.0

130.0

3504

12.0

70

usa

chevrolet chevelle malibu

1

15.0

8

350.0

165.0

3693

11.5

70

usa

buick skylark 320

2

18.0

8

318.0

150.0

3436

11.0

70

usa

plymouth satellite

3

16.0

8

304.0

150.0

3433

12.0

70

usa

amc rebel sst

4

17.0

8

302.0

140.0

3449

10.5

70

usa

ford torino

df.sort_index()
Sort the index of a DataFrame
df.reset_index()
Reset index of DataFrame to row numbers, moving
index to columns
df.sort_index().head(10)

mpg

cylinders

displacement

horsepower

weight

acceleration

year

origin

name

0

18.0

8

307.0

130.0

3504

12.0

70

usa

chevrolet chevelle malibu

1

15.0

8

350.0

165.0

3693

11.5

70

usa

buick skylark 320

2

18.0

8

318.0

150.0

3436

11.0

70

usa

plymouth satellite

3

16.0

8

304.0

150.0

3433

12.0

70

usa

amc rebel sst

4

17.0

8

302.0

140.0

3449

10.5

70

usa

ford torino

5

15.0

8

429.0

198.0

4341

10.0

70

usa

ford galaxie 500

6

14.0

8

454.0

220.0

4354

9.0

70

usa

chevrolet impala

7

14.0

8

440.0

215.0

4312

8.5

70

usa

plymouth fury iii

8

14.0

8

455.0

225.0

4425

10.0

70

usa

pontiac catalina

9

15.0

8

390.0

190.0

3850

8.5

70

usa

amc ambassador dpl

df.reset_index().head(10)
# index๊ฐ€ ์—†์„ ๋•Œ index๋ฅผ ์ƒˆ๋กœ ์ƒ์„ฑ

index

mpg

cylinders

displacement

horsepower

weight

acceleration

year

origin

name

0

0

18.0

8

307.0

130.0

3504

12.0

70

usa

chevrolet chevelle malibu

1

1

15.0

8

350.0

165.0

3693

11.5

70

usa

buick skylark 320

2

2

18.0

8

318.0

150.0

3436

11.0

70

usa

plymouth satellite

3

3

16.0

8

304.0

150.0

3433

12.0

70

usa

amc rebel sst

4

4

17.0

8

302.0

140.0

3449

10.5

70

usa

ford torino

5

5

15.0

8

429.0

198.0

4341

10.0

70

usa

ford galaxie 500

6

6

14.0

8

454.0

220.0

4354

9.0

70

usa

chevrolet impala

7

7

14.0

8

440.0

215.0

4312

8.5

70

usa

plymouth fury iii

8

8

14.0

8

455.0

225.0

4425

10.0

70

usa

pontiac catalina

9

9

15.0

8

390.0

190.0

3850

8.5

70

usa

amc ambassador dpl

df.drop(columns=['Length','Height'])
Drop columns from DataFrame
df.drop(columns=['mpg','year']).head(10)

cylinders

displacement

horsepower

weight

acceleration

origin

name

0

8

307.0

130.0

3504

12.0

usa

chevrolet chevelle malibu

1

8

350.0

165.0

3693

11.5

usa

buick skylark 320

2

8

318.0

150.0

3436

11.0

usa

plymouth satellite

3

8

304.0

150.0

3433

12.0

usa

amc rebel sst

4

8

302.0

140.0

3449

10.5

usa

ford torino

5

8

429.0

198.0

4341

10.0

usa

ford galaxie 500

6

8

454.0

220.0

4354

9.0

usa

chevrolet impala

7

8

440.0

215.0

4312

8.5

usa

plymouth fury iii

8

8

455.0

225.0

4425

10.0

usa

pontiac catalina

9

8

390.0

190.0

3850

8.5

usa

amc ambassador dpl

df.drop(columns=['mpg','year', 'name']).head(10)

cylinders

displacement

horsepower

weight

acceleration

origin

0

8

307.0

130.0

3504

12.0

usa

1

8

350.0

165.0

3693

11.5

usa

2

8

318.0

150.0

3436

11.0

usa

3

8

304.0

150.0

3433

12.0

usa

4

8

302.0

140.0

3449

10.5

usa

5

8

429.0

198.0

4341

10.0

usa

6

8

454.0

220.0

4354

9.0

usa

7

8

440.0

215.0

4312

8.5

usa

8

8

455.0

225.0

4425

10.0

usa

9

8

390.0

190.0

3850

8.5

usa

melt, pivot ์œผ๋กœ Tidy Data ๋งŒ๋“ค๊ธฐ - Reshaping Data, Method Chaining

pd.melt(df)
Gather columns into rows.
pd.melt?
df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
df

A

B

C

0

a

1

2

1

b

3

4

2

c

5

6

pd.melt(df, id_vars=['A'], value_vars=['B'])

A

variable

value

0

a

B

1

1

b

B

3

2

c

B

5

pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

A

variable

value

0

a

B

1

1

b

B

3

2

c

B

5

3

a

C

2

4

b

C

4

5

c

C

6

pd.melt(df, value_vars=['A','B','C'])

variable

value

0

A

a

1

A

b

2

A

c

3

B

1

4

B

3

5

B

5

6

C

2

7

C

4

8

C

6

pd.melt(df, value_vars=['A','B','C']).rename(columns={
'variable' : 'var',
'value' : 'val'})

var

val

0

A

a

1

A

b

2

A

c

3

B

1

4

B

3

5

B

5

6

C

2

7

C

4

8

C

6

df.pivot(columns='var', values='val')
Spread rows into columns.
df.pivot?
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df

foo

bar

baz

zoo

0

one

A

1

x

1

one

B

2

y

2

one

C

3

z

3

two

A

4

q

4

two

B

5

w

5

two

C

6

t

df2 = df.pivot(index='foo', columns='bar', values='baz')
# row์— ์žˆ๋˜ ๊ฐ’๋“ค์ด column์œผ๋กœ ์ด๋™
df2

bar

A

B

C

foo

one

1

2

3

two

4

5

6

df3 = df.pivot(index='foo', columns='bar', values='baz').reset_index()
df3

bar

foo

A

B

C

0

one

1

2

3

1

two

4

5

6

df3.melt(id_vars=['foo'], value_vars=['A','B','C'])

foo

bar

value

0

one

A

1

1

two

A

4

2

one

B

2

3

two

B

5

4

one

C

3

5

two

C

6

df3.melt(id_vars=['foo'], value_vars=['A','B','C']).sort_values('bar')

foo

bar

value

0

one

A

1

1

two

A

4

2

one

B

2

3

two

B

5

4

one

C

3

5

two

C

6

df3.melt(id_vars=['foo'], value_vars=['A','B','C']).sort_values(['foo', 'bar'])

foo

bar

value

0

one

A

1

2

one

B

2

4

one

C

3

1

two

A

4

3

two

B

5

5

two

C

6

df3.melt(id_vars=['foo'], value_vars=['A','B','C']).sort_values(['foo', 'bar']).rename(columns = {'value': 'baz'})

foo

bar

baz

0

one

A

1

2

one

B

2

4

one

C

3

1

two

A

4

3

two

B

5

5

two

C

6

pd.concat([df1,df2]) ์‹œ๋ฆฌ์ฆˆ, ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ•ฉ์น˜๊ธฐ - Reshaping Data

pd.concat([df1,df2])
Append rows of DataFrames
pd.concat([df1,df2], axis=1)
Append columns of DataFrames
pd.concat?
s1 = pd.Series(['a', 'b'])
s1
0    a
1    b
dtype: object
s2 = pd.Series(['c', 'd'])
s2
0    c
1    d
dtype: object
pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: object
pd.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object
pd.concat([s1, s2], keys=['s1', 's2'])
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object
pd.concat([s1, s2], keys=['s1', 's2'],
...           names=['Series name', 'Row ID'])
Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object
df1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
df1

letter

number

0

a

1

1

b

2

df2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
df2

letter

number

0

c

3

1

d

4

pd.concat([df1, df2])

letter

number

0

a

1

1

b

2

0

c

3

1

d

4

df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
df3

letter

number

animal

0

c

3

cat

1

d

4

dog

# ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ํ•ฉ์น˜๊ธฐ
pd.concat([df1, df3])

letter

number

animal

0

a

1

NaN

1

b

2

NaN

0

c

3

cat

1

d

4

dog

pd.concat([df1, df3], join="inner")

letter

number

0

a

1

1

b

2

0

c

3

1

d

4

df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
...                    columns=['animal', 'name'])
df4

animal

name

0

bird

polly

1

monkey

george

df5 = pd.DataFrame([1], index=['a'])
df5

0

a

1

df6 = pd.DataFrame([2], index=['a'])
df6

0

a

2

pd.concat([df5, df6])

0

a

1

a

2

pd.concat([df5, df6], verify_integrity=True)
# ์˜ค๋ฅ˜ ๋ฐœ์ƒ : ValueError

merge๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ•ฉ์น˜๊ธฐ left, right, inner, outer ์˜ต์…˜ ์‚ฌ์šฉํ•˜๊ธฐ - Combine Data Sets

import pandas as pd
adf = pd.DataFrame({"x1" : ["A", "B", "C"], "x2" : [1, 2, 3]})
adf

x1

x2

0

A

1

1

B

2

2

C

3

bdf = pd.DataFrame({"x1" : ["A", "B", "D"], "x3" : ["T", "F", "T"]})
bdf

x1

x3

0

A

T

1

B

F

2

D

T

pd.merge(adf, bdf,
how='left', on='x1')
Join matching rows from bdf to adf.
pd.merge(adf, bdf,
how='right', on='x1')
Join matching rows from adf to bdf.
pd.merge(adf, bdf,
how='inner', on='x1')
Join data. Retain only rows in both sets.
pd.merge(adf, bdf,
how='outer', on='x1')
Join data. Retain all values, all rows.
pd.merge(adf, bdf, how='left', on='x1')

x1

x2

x3

0

A

1

T

1

B

2

F

2

C

3

NaN

pd.merge(adf, bdf, how='right', on='x1')

x1

x2

x3

0

A

1.0

T

1

B

2.0

F

2

D

NaN

T

pd.merge(adf, bdf, how='inner', on='x1')

x1

x2

x3

0

A

1

T

1

B

2

F

pd.merge(adf, bdf, how='outer', on='x1')

x1

x2

x3

0

A

1.0

T

1

B

2.0

F

2

C

3.0

NaN

3

D

NaN

T

adf[adf.x1.isin(bdf.x1)]
All rows in adf that have a match in bdf.

adf[~adf.x1.isin(bdf.x1)]
All rows in adf that do not have a match in bdf.
adf.x1.isin(bdf.x1)
0     True
1     True
2    False
Name: x1, dtype: bool
adf[adf.x1.isin(bdf.x1)]

x1

x2

0

A

1

1

B

2

adf[~adf.x1.isin(bdf.x1)]

x1

x2

2

C

3

pd.merge(ydf, zdf)
Rows that appear in both ydf and zdf
(Intersection).

pd.merge(ydf, zdf, how='outer')
Rows that appear in either or both ydf and zdf
(Union).

pd.merge(ydf, zdf, how='outer',
indicator=True)
.query('_merge == "left_only"')
.drop(columns=['_merge'])
Rows that appear in ydf but not zdf (Setdiff).
ydf = pd.DataFrame({"x1" : ["A", "B", "C"], "x2" : [1, 2, 3]})
ydf

x1

x2

0

A

1

1

B

2

2

C

3

zdf = pd.DataFrame({"x1" : ["B", "C", "D"], "x2" : [2, 3, 4]})
zdf

x1

x2

0

B

2

1

C

3

2

D

4

pd.merge(ydf, zdf)
# default : inner join

x1

x2

0

B

2

1

C

3

pd.merge(ydf, zdf, how='outer')

x1

x2

0

A

1

1

B

2

2

C

3

3

D

4

pd.merge(ydf, zdf, how='outer', indicator=True)

x1

x2

_merge

0

A

1

left_only

1

B

2

both

2

C

3

both

3

D

4

right_only

pd.merge(ydf, zdf, how='outer', indicator=True).query('_merge == "left_only"')

x1

x2

_merge

0

A

1

left_only

pd.merge(ydf, zdf, how='outer',indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

x1

x2

0

A

1

concat์€ ์œ„์•„๋ž˜๋กœ ํ•ฉ์น  ๋•Œ, mergs๋Š” ์ขŒ์šฐ๋กœ ํ•ฉ์น  ๋•Œ ์‚ฌ์šฉํ•˜๋ฉด ์ข‹๋‹ค

ํŒŒ์ด์ฌ ํŒ๋‹ค์Šค๋กœ groupby ํ™œ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์ง‘๊ณ„๋ฅผ ํ™œ์šฉํ•˜๊ธฐ - Group Data

import pandas as pd
import seaborn as sns
df.groupby(by="col")
Return a GroupBy object,
grouped by values in column
named "col".

df.groupby(level="ind")
Return a GroupBy object,
grouped by values in index
level named "ind".

size()
Size of each group.

agg(function)
Aggregate group using function.
df = sns.load_dataset("mpg")
df.head()

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

origin

name

0

18.0

8

307.0

130.0

3504

12.0

70

usa

chevrolet chevelle malibu

1

15.0

8

350.0

165.0

3693

11.5

70

usa

buick skylark 320

2

18.0

8

318.0

150.0

3436

11.0

70

usa

plymouth satellite

3

16.0

8

304.0

150.0

3433

12.0

70

usa

amc rebel sst

4

17.0

8

302.0

140.0

3449

10.5

70

usa

ford torino

df.groupby(by="origin")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000017AD720BE88>
df.groupby(by="origin").size()
origin
europe     70
japan      79
usa       249
dtype: int64
df['origin'].value_counts()
usa       249
japan      79
europe     70
Name: origin, dtype: int64
df.groupby(by="origin").max()

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

name

origin

europe

44.3

6

183.0

133.0

3820

24.8

82

vw rabbit custom

japan

46.6

6

168.0

132.0

2930

21.0

82

toyouta corona mark ii (sw)

usa

39.0

8

455.0

230.0

5140

22.2

82

pontiac ventura sj

df.groupby(by="origin").min()

mpg

cylinders

displacement

horsepower

weight

acceleration

model_year

name

origin

europe

16.2

4

68.0

46.0

1825

12.2

70

audi 100 ls

japan

18.0

3

70.0

52.0

1613

11.4

70

datsun 1200

usa

9.0

4

85.0

52.0

1800

8.0

70

amc ambassador brougham

df.groupby(by="origin")['weight'].mean()
origin
europe    2423.300000
japan     2221.227848
usa       3361.931727
Name: weight, dtype: float64
df.groupby(by="origin")['weight'].median()
origin
europe    2240
japan     2155
usa       3365
Name: weight, dtype: int64
df.groupby?
df.groupby(['origin', 'model_year'])['cylinders'].mean()
origin  model_year
europe  70            4.000000
        71            4.000000
        72            4.000000
        73            4.000000
        74            4.000000
        75            4.000000
        76            4.250000
        77            4.000000
        78            4.833333
        79            4.250000
        80            4.111111
        81            4.500000
        82            4.000000
japan   70            4.000000
        71            4.000000
        72            3.800000
        73            4.250000
        74            4.000000
        75            4.000000
        76            4.500000
        77            4.166667
        78            4.000000
        79            4.000000
        80            4.076923
        81            4.333333
        82            4.000000
usa     70            7.636364
        71            6.200000
        72            6.888889
        73            7.241379
        74            6.266667
        75            6.400000
        76            6.363636
        77            6.222222
        78            6.000000
        79            6.260870
        80            4.285714
        81            4.923077
        82            4.300000
Name: cylinders, dtype: float64
pd.DataFrame(df.groupby(['origin', 'model_year'])['cylinders'].mean())

cylinders

origin

model_year

europe

70

4.000000

71

4.000000

72

4.000000

73

4.000000

74

4.000000

75

4.000000

76

4.250000

77

4.000000

78

4.833333

79

4.250000

80

4.111111

81

4.500000

82

4.000000

japan

70

4.000000

71

4.000000

72

3.800000

73

4.250000

74

4.000000

75

4.000000

76

4.500000

77

4.166667

78

4.000000

79

4.000000

80

4.076923

81

4.333333

82

4.000000

usa

70

7.636364

71

6.200000

72

6.888889

73

7.241379

74

6.266667

75

6.400000

76

6.363636

77

6.222222

78

6.000000

79

6.260870

80

4.285714

81

4.923077

82

4.300000

df.pivot_table?
Signature:
df.pivot_table(
    values=None,
    index=None,
    columns=None,
    aggfunc='mean',
    fill_value=None,
    margins=False,
    dropna=True,
    margins_name='All',
    observed=False,
) -> 'DataFrame'
df2 = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
df2

a

b

c

1

4

7

10

2

5

8

11

3

6

9

12

df2.shift(1)

a

b

c

1

NaN

NaN

NaN

2

4.0

7.0

10.0

3

5.0

8.0

11.0

df2.shift(-1)

a

b

c

1

5.0

8.0

11.0

2

6.0

9.0

12.0

3

NaN

NaN

NaN

df2['a'].shift(2)
1    NaN
2    NaN
3    4.0
Name: a, dtype: float64
df2['b'].shift(-1)
1    8.0
2    9.0
3    NaN
Name: b, dtype: float64
df2['b'] = df2['b'].shift(-1)
df2

a

b

c

1

4

9.0

10

2

5

NaN

11

3

6

NaN

12

df['model_year']
0      70
1      70
2      70
3      70
4      70
       ..
393    82
394    82
395    82
396    82
397    82
Name: model_year, Length: 398, dtype: int64
df['model_year'].rank(method='max')
# ํ•ด๋‹น ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ๋ช‡๋ฒˆ์งธ ์ˆœ์„œ์ธ์ง€
0       29.0
1       29.0
2       29.0
3       29.0
4       29.0
       ...  
393    398.0
394    398.0
395    398.0
396    398.0
397    398.0
Name: model_year, Length: 398, dtype: float64
df['model_year'].rank(method='min')
# ํ•ด๋‹น ๊ฐ’์ด ์ž‘์€ ์ˆœ์œผ๋กœ ๋ช‡๋ฒˆ์งธ ์ˆœ์„œ์ธ์ง€
0        1.0
1        1.0
2        1.0
3        1.0
4        1.0
       ...  
393    368.0
394    368.0
395    368.0
396    368.0
397    368.0
Name: model_year, Length: 398, dtype: float64
df['model_year'].rank(method='min').value_counts()
86.0     40
245.0    36
183.0    34
368.0    31
153.0    30
339.0    29
281.0    29
310.0    29
1.0      29
217.0    28
58.0     28
30.0     28
126.0    27
Name: model_year, dtype: int64
df['model_year'].rank(pct=True).head()
# pct๋Š” percentage๋ฅผ ์˜๋ฏธ
0    0.037688
1    0.037688
2    0.037688
3    0.037688
4    0.037688
Name: model_year, dtype: float64
df['model_year'].rank(method='first').head()
# ๋จผ์ € ๋“ฑ์žฅํ•˜๋Š” ์ˆœ์œ„
0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
Name: model_year, dtype: float64
df.rank?
df2

a

b

c

1

4

9.0

10

2

5

NaN

11

3

6

NaN

12

df2.cumsum()
# ๋ˆ„์  ๊ฐ’ ๊ตฌํ•˜๊ธฐ

a

b

c

1

4

9.0

10

2

9

NaN

21

3

15

NaN

33

df2['b'] = [9.0, 13.0, 11.5]
df2.cummax()
# df[b,3]์˜ ๊ฐ’์ด 13๋ณด๋‹ค ์ž‘์€ 11.5์ด๋ฏ€๋กœ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋˜๋Š” ๋ชจ์Šต

a

b

c

1

4

9.0

10

2

5

13.0

11

3

6

13.0

12

df2.cummin()

a

b

c

1

4

9.0

10

2

4

9.0

10

3

4

9.0

10

df2.cumprod()
# ๋ˆ„์ ๊ณฑ

a

b

c

1

4

9.0

10

2

20

117.0

110

3

120

1345.5

1320

๋งˆํฌ๋‹ค์šด ์ˆ˜์‹์ž…๋ ฅ

์ด๋ฒˆ์ฃผ์ฐจ ๊ฐ•์˜๊ฐ€ ์ˆ˜์‹์„ ์ž…๋ ฅํ•  ์ผ์ด ๋งŽ์•„ ์•„๋ž˜ ์‚ฌ์ดํŠธ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. (๋ฌผ๋ก  ๋‚˜์—๊ฒŒ)

https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

[ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค AI ์Šค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 1

Machine Learning ๊ธฐ์ดˆ - ์†Œ๊ฐœ

๋จธ์‹ ๋Ÿฌ๋‹์ด๋ž€?

  • ๊ธฐ๊ณ„ํ•™์Šต. ๊ฒฝํ—˜์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ์ปดํ“จํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์—ฐ๊ตฌ.

  • ํ•™์Šต๋ฐ์ดํ„ฐ : ์ž…๋ ฅ๋ฒกํ„ฐ๋“ค๊ณผ ๋ชฉํ‘œ๊ฐ’๋“ค

  • ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฒฐ๊ณผ๋Š” ๋ชฉํ‘œ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ํ•จ์ˆ˜

  • ์ˆซ์ž ์ธ์‹์—์„œ ์ž…๋ ฅ๋ฒกํ„ฐ๋Š” ์†๊ธ€์”จ ์ด๋ฏธ์ง€, ๋ชฉํ‘œ๊ฐ’์€ 0๋ถ€ํ„ฐ 9๊นŒ์ง€ ์ˆซ์ž์ค‘ ์˜ˆ์ธก๊ฐ’. ์ •ํ™•ํžˆ๋Š” 10๊ฐœ์˜ ํด๋ž˜์Šค ์ค‘ ํ•œ ํด๋ž˜์Šค(์ด ๋•Œ ํด๋ž˜์Šค ๋„˜๋ฒ„๋Š” 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•จ)

ํ•ต์‹ฌ๊ฐœ๋…s

  • ํ•™์Šต๋‹จ๊ณ„: ํ•จ์ˆ˜ y(x)๋ฅผ ํ•™์Šต๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•ด ๊ฒฐ์ •ํ•˜๋Š” ๋‹จ๊ณ„

  • ์‹œํ—˜์…‹ : ๋ชจ๋ธ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ

  • ์ผ๋ฐ˜ํ™” : ๋ชจ๋ธ์—์„œ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์ด์ „์— ์ ‘ํ•˜์ง€ ๋ชปํ•œ ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์˜ฌ๋ฐ”๋ฅธ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์—ญ๋Ÿ‰

  • ์ง€๋„ํ•™์Šต : ๋ถ„๋ฅ˜์™€ ํšŒ๊ท€

  • ๋น„์ง€๋„ํ•™์Šต : ๊ตฐ์ง‘

๋‹คํ•ญ์‹ ๊ณก์„  ๊ทผ์‚ฌ

  • Polynomial Curve Fitting

  • ํšŒ๊ท€ ๋ฌธ์ œ์— ํ•ด๋‹นํ•œ๋‹ค.

  • ์ ๋“ค์„ ์ง€๋‚˜๋Š”(์ง€๋‚˜์ง€ ๋ชปํ•˜๋”๋ผ๋„ ์ตœ๋Œ€ํ•œ ๊ฐ€๊น๊ฒŒ) ํ•จ์ˆ˜ ๊ตฌํ•˜๊ธฐ

  • ํ•™์Šต๋ฐ์ดํ„ฐ : ์ž…๋ ฅ๋ฒกํ„ฐ์™€ ๋ชฉํ‘œ๊ฐ’

  • ๋ชฉํ‘œ : ์ƒˆ๋กœ์šด ์ž…๋ ฅ๋ฒกํ„ฐ๊ฐ€ ์ฃผ์–ด์งˆ ๋•Œ ๋ชฉํ‘œ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

  • ํ™•๋ฅ ์ด๋ก  : ์˜ˆ์ธก๊ฐ’์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ •๋Ÿ‰ํ™”์‹œ์ผœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ˆ˜ํ•™์  ํ”„๋ ˆ์ž„์›Œํฌ ์ œ๊ณต

  • ๊ฒฐ์ •์ด๋ก  : ํ™•๋ฅ ์  ํ‘œํ˜„์„ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ ์˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก  ์ œ๊ณต

๊ณผ์†Œ์ ํ•ฉ๊ณผ ๊ณผ๋Œ€์ ํ•ฉ

์‹ค์ œ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํฌ๊ธฐ์— ๋น„ํ•ด ๋„ˆ๋ฌด ๊ณ ์ฐจ์› ํ•จ์ˆ˜ ๋˜๋Š” ๋„ˆ๋ฌด ์ €์ฐจ์› ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์‹ค์ œ ์„ฑ๋Šฅ์—์„œ ์—๋Ÿฌ๊ฐ€ ๋งŽ์ด ๋ฐœ์ƒํ•œ๋‹ค.

ERMS=2E(wโˆ—)/N E_{RMS} = \sqrt {2E(w^*)/N} : Root Mean Square

๋˜ํ•œ, ๊ณ ์ฐจ์› ํ•จ์ˆ˜๋”๋ผ๋„ ๋งŽ์€ ์–‘์— ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ๊ณผ๋Œ€์ ํ•ฉ์ด ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ์ ์–ด์ง„๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹์˜ ๋ฐ์ดํ„ฐ ์ˆ˜๋Š” ๊ต‰์žฅํžˆ ๋งŽ๋‹ค๋Š” ์ .

๊ทœ์ œํ™”(Regularization)

ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•.

์ด ๋•Œ ๋„ˆ๋ฌด ์‹ฌํ•˜๊ฒŒ ํ•˜๋ฉด ๊ณผ๋Œ€/๊ณผ์†Œ ์ ํ•ฉ์ด ์˜ˆ๊ธฐ์น˜ ์•Š๊ฒŒ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ.

Machine Learning ๊ธฐ์ดˆ - ํ™•๋ฅ ์ด๋ก 1

ํ™•๋ฅ  ๋ณ€์ˆ˜

ํ™•๋ฅ  ๋ณ€์ˆ˜ X๋Š” ํ‘œ๋ณธ์˜ ์ง‘ํ•ฉ S์˜ ์›์†Œ e๋ฅผ ์‹ค์ˆ˜๊ฐ’ X(e) = x์— ๋Œ€์‘์‹œํ‚ค๋Š” ํ•จ์ˆ˜์ด๋‹ค.

  • ๋Œ€๋ฌธ์ž X, Y, ... : ํ™•๋ฅ  ๋ณ€์ˆ˜

  • ์†Œ๋ฌธ์ž x, y, ... : ํ™•๋ฅ  ๋ณ€์ˆ˜๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ฐ’

  • ํ™•๋ฅ  P๋Š” ์ง‘ํ•ฉ S์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ์„ ์‹ค์ˆ˜๊ฐ’์— ๋Œ€์‘์‹œํ‚ค๋Š” ํ•จ

  • ex) S = {HH, HT, TH, TT}; throwing coin

  • X(HH) = 2, X(HT) = 1, X(TH) = 1, X(TT) = 0; head of coin appear

  • P[X = 1] = P[{HT, TH}] = 24=12\frac {2} {4} = \frac {1} {2}

์—ฐ์† ํ™•๋ฅ  ๋ณ€์ˆ˜(Continuous Random Variables)

๋ˆ„์ ๋ถ„ํฌํ•จ์ˆ˜ F(x) = P[X โˆˆ \in(-โˆž \infty, x)] ์ผ ๋•Œ, F(x)๋ฅผ ๊ฐ€์ง„ ํ™•๋ฅ  ๋ณ€์ˆ˜ X์— ๋Œ€ํ•ด์„œ ๋‹ค์Œ์„ ๋งŒ์กฑํ•˜๋Š” ํ•จ์ˆ˜ f(x)๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด X๋ฅผ ์—ฐ์† ํ™•๋ฅ  ๋ณ€์ˆ˜๋ผ๊ณ  ๋ถ€๋ฅด๊ณ  f(x)๋ฅผ X์˜ ํ™•๋ฅ  ๋ฐ€๋„ ํ•จ์ˆ˜(probability density function)๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

  • ํ™•๋ฅ  ๋ณ€์ˆ˜๋ฅผ ๋ช…ํ™•ํžˆ ํ•˜๊ธฐ ์œ„ํ•ด F(x), f(x)๋กœ ์“ฐ๊ธฐ๋กœ ํ•˜๋ฉฐ ๋ฐ€๋„ ํ•จ์ˆ˜์˜ ๊ฒฝ์šฐ์—๋Š” p(x)๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

ํ™•๋ฅ  ๋ณ€์ˆ˜์˜ ์„ฑ์งˆ

  • ๋ง์…ˆ ๋ฒ•์น™

  • ๊ณฑ์…ˆ ๋ฒ•์น™

  • ๋ฒ ์ด์ฆˆ ํ™•๋ฅ  (posterior ์‚ฌํ›„ํ™•๋ฅ , likelihood ๊ฐ€๋Šฅ์„ฑ, prior ์‚ฌ์ „ํ™•๋ฅ , marginal normalization ๊ฒฝ๊ณ„ํ™•๋ฅ )

ํ™•๋ฅ ๋ณ€์ˆ˜์˜ ํ•จ์ˆ˜

ํ™•๋ฅ ๋ณ€์ˆ˜ X์˜ ํ•จ์ˆ˜ Y = f(X)๋„ ํ™•๋ฅ ๋ณ€์ˆ˜์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ™•๋ฅ  ๋ณ€์ˆ˜ X๊ฐ€ ์ฃผ(week)์˜ ์ˆ˜๋กœ ํ‘œํ˜„๋˜์—ˆ๋‹ค๊ณ  ํ•˜๋ฉด ์ผ(day)์˜ ์ˆ˜๋กœ ํ‘œํ˜„๋œ ์ƒˆ๋กœ์šด ํ™•๋ฅ ๋ณ€์ˆ˜๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • Y = 7X

  • P[14 <= Y <= 21] = P[2 <= X <= 3]

  • py(y)=px(x)โˆฃdxdyโˆฃ p_y(y) = p_x(x)|\frac {dx} {dy} |

k์ฐจ์›์˜ ํ™•๋ฅ ๋ณ€์ˆ˜ ๋ฒกํ„ฐ x = (x1, ... , xk)๊ฐ€ ์ฃผ์–ด์งˆ ๋•Œ, k๊ฐœ์˜ x์— ๊ด€ํ•œ ํ•จ์ˆ˜๋“ค์€ ์ƒˆ๋กœ์šด ํ™•๋ฅ ๋ณ€์ˆ˜๋ฒกํ„ฐ y = (y1, ... yk)๋ฅผ ์ •์˜ํ•œ๋‹ค. ๊ฐ„๋žตํ•˜๊ฒŒ y = (x)๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ๋งŒ์•ฝ y = g(x)๊ฐ€ ์ผ๋Œ€์ผ ๋ณ€ํ™˜์ธ ๊ฒฝ์šฐ(x = w(y)๋กœ ์œ ์ผํ•œ ํ•ด๋ฅผ ๊ฐ€์งˆ ๋•Œ), y์˜ ๊ฒฐํ•ฉํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • py(y1,...,yk)=px(x1,...,xk)โˆฃJโˆฃ p_y (y_1, ... , y_k) = p_x(x_1, ... , x_k)|J|

  • where J = dx1dy1dx1dy2...dx1dykdx2dy1............dxkdy1......dxkdyk \begin{matrix} \frac {dx_1} {dy_1} & \frac {dx_1} {dy_2} & ... & \frac {dx_1} {dy_k} \\ \frac {dx_2} {dy_1} & ... & ... & ... \\ ... \\ \frac {dx_k} {dy_1} & ... & ... & \frac {dx_k} {dy_k} \end{matrix}

์˜ˆ์ œ

px1,x2(x1,x2)=eโˆ’(x1+x2),x1>0,x2>0 p_{x1,x2} (x1, x2) = e^{-(x_1+x_2)}, x1 > 0, x2 > 0์ผ ๋•Œ, y1=x1,y2=x1+x2 y_1 = x_1, y_2 = x_1 + x_2 ์— ์˜ํ•ด์„œ ์ •์˜๋˜๋Š” y์˜ pdf๋Š”?

fy1,y2(y1,y2)=fx1,x2(x1,x2)โˆฃJโˆฃ=fx1,x2(y1,y2โˆ’y1)=eโˆ’{โˆ’y1+y2โˆ’y1}=ey2 f{y_1, y_2}(y_1, y_2) = f{x_1, x_2}(x_1, x_2)|J| = f{x_1, x_2}(y_1, y_2 - y_1) = e^{-\{-y1 + y2 -y1\}} = e^{y2}

0<y1<y2<โˆž 0 < y_1 < y_2 < \infty

fy1(y1)=โˆซy1โˆžeโˆ’y2=โˆ’eโˆ’y2โˆฃy1โˆž=eโˆ’y1 f_{y_1}(y_1) = \int^\infty _{y_1} e^{-y_2} = -e^{-y_2} |^\infty _{y_1} = e^{-y_1}

Inverse CDF Technique๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ˜๊ฒฝ์ด r์ธ ์› ์•ˆ์— ๋žœ๋คํ•˜๊ฒŒ ์ ์„ ์ฐ์„ ์ˆ˜ ์žˆ๋‹ค.

๊ธฐ๋Œ“๊ฐ’

ํ™•๋ฅ ๋ถ„ํฌ p(x)ํ•˜์—์„œ ํ•จ์ˆ˜ f(x)์˜ ํ‰๊ท ๊ฐ’

๋ถ„์‚ฐ๊ณผ ๊ณต๋ถ„์‚ฐ

f(x)์˜ ๋ถ„์‚ฐ : f(x)์˜ ๊ฐ’๋“ค์ด ๊ธฐ๋Œ“๊ฐ’์œผ๋กœ๋ถ€ํ„ฐ ํฉ์–ด์ ธ ์žˆ๋Š” ์ •๋„

ํ™•๋ฅ ์„ ํ•ด์„ํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ๊ด€์  : ๋นˆ๋„์ฃผ์˜ ๋Œ€ ๋ฒ ์ด์ง€์•ˆ

๋นˆ๋„์ฃผ์˜ : ๋ฐ˜๋ณต๊ฐ€๋Šฅํ•œ ์‚ฌ๊ฑด๋“ค์˜ ๋นˆ๋„์ˆ˜์— ๊ธฐ๋ฐ˜

๋ฒ ์ด์ง€์•ˆ : ๋ถˆํ™•์‹ค์„ฑ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ํ‘œํ˜„

๋ถ๊ทน ์–ผ์Œ์ด ์ด๋ฒˆ ์„ธ๊ธฐ๋ง๊นŒ์ง€ ๋…น์•„ ์—†์–ด์งˆ ํ™•๋ฅ ์€ ๋ฐ˜๋ณต๊ฐ€๋Šฅํ•˜์ง€ ์•Š์€ ์‚ฌ๊ฑด์ž„

๋นˆ๋„์ฃผ์˜๋Š” ์ถ”์ •์‚ฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ™•๋ฅ ์„ ๊ตฌํ•˜๋ฉฐ ๊ตฌํ•ด์ง„ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ถˆํ™•์‹ค์„ฑ์€ ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด ๊ตฌํ•œ๋‹ค.

๋ฒ ์ด์ง€์•ˆ์€ ์‚ฌ์ „ํ™•๋ฅ ์„ ๋ชจ๋ธ์— ํฌํ•จ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

์ •๊ทœ๋ถ„ํฌ

๋‹จ์ผ๋ณ€์ˆ˜ x๋ฅผ ์œ„ํ•œ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ

Last updated

Was this helpful?