1 Fri

TIL

[์ธํ”„๋Ÿฐ] ๋‹จ ๋‘ ์žฅ์˜ ๋ฌธ์„œ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ์‹œ๊ฐํ™” ๋ฝ€๊ฐœ๊ธฐ

AI ์Šค์ฟจ ์ฒซ ํ”„๋กœ์ ํŠธ๋ฅผ ์œ„ํ•ด Pandas๋ฅผ ๋” ๊ณต๋ถ€ํ•ด๋ณด๊ณ  ์‹ถ์–ด์กŒ๋‹ค. ๋˜, ์ถ”ํ›„์—๋„ Pandas๋ฅผ ์ด์šฉํ•œ ์‹œ๊ฐํ™”๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•ด์„œ ์ด์ฐธ์— ๋ฐฐ์›Œ๋‘๋ฉด ์ข‹๊ฒ ๋‹ค ์ƒ๊ฐํ–ˆ๋‹ค. ํ™”์ดํŒ…!

ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ์‹œ๋ฆฌ์ฆˆ ์ดํ•ดํ•˜๊ธฐ - Syntax

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

import pandas as pd
df = pd.DataFrame(
        {"a" : [4, 5, 6],
        "b" : [7, 8, 9],
        "c" : [10, 11, 12]},
            index = [1, 2, 3])

์ด ๋•Œ ํ•œ ํ–‰์„ Series๋ผ๊ณ  ํ•œ๋‹ค. index์˜ default๋Š” [0, 1, ,,,]

๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์กฐ์ž‘

df

a

b

c

1

4

7

10

2

5

8

11

3

6

9

12

ํŠน์ • ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๊ณ  ์™€๋ณด์ž!

df["a"]
1    4
2    5
3    6
Name: a, dtype: int64

์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๋ณด๊ธฐ!

df[["a", "b"]]

a

b

1

4

7

2

5

8

3

6

9

n๋ฒˆ ์ธ๋ฑ์Šค์˜ ํ–‰ ๋ณด๊ธฐ

df.loc[3]
a     6
b     9
c    12
Name: 3, dtype: int64

์—ฌ๋Ÿฌ ์ธ๋ฑ์Šค์˜ ํ–‰ ๋ณด๊ธฐ

df.loc[[1,2]]

a

b

c

1

4

7

10

2

5

8

11

ํŠน์ • ์ธ๋ฑ์Šค์˜ ํ–‰๊ณผ ์—ด ๋ณด๊ธฐ ํ–‰-์—ด ์ˆœ์œผ๋กœ ์ž‘์„ฑ

df.loc[1, "b"]
7
df.loc[[1, 2], ["a","b"]]

a

b

1

4

7

2

5

8

ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ƒ์„ฑํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ - Syntax

kernel - Restart & ClearOutput ์„ ๋ˆ„๋ฅด๋ฉด ์‹คํ–‰๊ฒฐ๊ณผ๊ฐ€ ๋ชจ๋‘ ์ง€์›Œ์ง„๋‹ค! ๋ณต์Šตํ•  ์ˆ˜ ์žˆ์Œ!

df = pd.DataFrame(
        [[4, 7, 10],
        [5, 8, 11],
        [6, 9, 12]],
        index=[1, 2, 3],
        columns=['a', 'b', 'c'])
df

a

b

c

1

4

7

10

2

5

8

11

3

6

9

12

๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„

pd.DataFrame(  
        {"a" : [4, 5, 6],  
        "b" : [7, 8, 9],  
        "c" : [10, 11, 12]},  
            index = [1, 2, 3])    
df = pd.DataFrame(  
        [[4, 7, 10],  
        [5, 8, 11],  
        [6, 9, 12]],  
        index=[1, 2, 3],  
        columns=['a', 'b', 'c'])  

๋Š” ๋™์ผํ•˜๋‹ค.

Index ์ง€์ • - ํŠœํ”Œ ์ž๋ฃŒํ˜• ์‚ฌ์šฉ ์—ฌ๋Ÿฌ๊ฐœ์˜ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.

df = pd.DataFrame(
        {"a" : [4 ,5, 6],
        "b" : [7, 8, 9],
        "c" : [10, 11, 12]},
        index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2)],
        names=['n','v']))
df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๋น„๊ต์—ฐ์‚ฐ์ž๋กœ ์ƒ‰์ธํ•˜๊ธฐ - Subset Observations(Rows)

ํŠน์ • ์—ด์—์„œ ์ƒ‰์ธ(ํ•„ํ„ฐ๋ง)

df[df.a > 7]

a

b

c

n

v

df[df.a < 7]

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

df.b > 7
n  v
d  1    False
   2     True
e  2     True
Name: b, dtype: bool

๋‹ค์Œ 2๊ฐœ์˜ ์ฝ”๋“œ๋Š” ๋™์ผํ•˜๋‹ค.

df[df.b > 7]

a

b

c

n

v

d

2

5

8

11

e

2

6

9

12

df[df['b'] > 7]

a

b

c

n

v

d

2

5

8

11

e

2

6

9

12

df.OO ์™€ df['OO']๋Š” ๊ฐ™๋‹ค ์ด ๋•Œ, ๋Œ€์†Œ๋ฌธ์ž์— ์œ ์˜ ์ฐจ์ด์ ์€ dot์„ ์‚ฌ์šฉํ•˜๋ฉด ํŠน์ˆ˜๋ฌธ์ž๋‚˜ ํ•œ๊ธ€์ด ํฌํ•จ๋˜์žˆ๋Š” ์ด๋ฆ„์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚  ์ˆ˜ ์žˆ๋‹ค.

df = pd.DataFrame(
        {"a" : [4 ,5, 6, 6],
        "b" : [7, 8, 9, 9],
        "c" : [10, 11, 12, 12]},
        index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2), ('e', 3)],
        names=['n','v']))
df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

3

6

9

12

df.drop_duplicates() : ์ค‘๋ณต์„ ์—†์• ์ฃผ๋Š” ๋ฉ”์„œ๋“œ

df.drop_duplicates()

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋•Œ ๋‹ค์‹œ df๋ฅผ ์ถœ๋ ฅํ•ด๋„ ๋™์ผํ•˜๋‹ค.

df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

3

6

9

12

์ด ๋•Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์˜ต์…˜ inplace๋ฅผ True๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ pandas์—์„œ๋Š” inplace ์‚ฌ์šฉ์„ ๊ถŒ์žฅํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค.

df.drop_duplicates(inplace=True)
df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค.

df2 = df.drop_duplicates()
df2

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž…๋ ฅํ•˜๋ฉด ํ•ด๋‹น ํ•จ์ˆ˜์— ๋Œ€ํ•œ ์„ค๋ช…์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

df.drop_duplicates?
'''
Signature:
df.drop_duplicates(
    subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
    keep: Union[str, bool] = 'first',
    inplace: bool = False,
    ignore_index: bool = False,
) -> Union[ForwardRef('DataFrame'), NoneType]
'''
df = pd.DataFrame(
        {"a" : [4 ,5, 6, 6],
        "b" : [7, 8, 9, 9],
        "c" : [10, 11, 12, 12]},
        index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2), ('e', 3)],
        names=['n','v']))

์ค‘๋ณต๋œ ํ–‰์„ ์ œ๊ฑฐํ•  ๋•Œ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์ด ์œ ์ง€๋˜๋„๋ก ํ•  ์ˆ˜ ์žˆ๋‹ค

df.drop_duplicates(keep = 'last')
df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

3

6

9

12

์ •๋ฆฌ : drop_duplicates๋Š” ์ค‘๋ณต๋œ ํ–‰์„ ์ œ๊ฑฐํ•  ๋•Œ ์‚ฌ์šฉํ•œ๋‹ค.

Logic in Python - Subset Observations(Rows)

df

a

b

c

n

v

d

1

4

7

10

2

5

8

11

e

2

6

9

12

3

6

9

12

df[df.b != 7]

a

b

c

n

v

d

2

5

8

11

e

2

6

9

12

3

6

9

12

isin() : ํ–‰์— ๋Œ€ํ•œ ์ธ์ž์˜ ์กด์žฌ ์œ ๋ฌด

df.column.isin?
Object `df.column.isin` not found.

column => ํŠน์ • ์ปฌ๋Ÿผ์˜ ์ด๋ฆ„์œผ๋กœ ์ •์˜ํ•ด์ค˜์•ผ ํ•จ ๋˜ isin์˜ ์ธ์ž๋Š” ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ์—ฌ์•ผ ํ•œ๋‹ค.

df.a.isin([5])
n  v
d  1    False
   2     True
e  2    False
   3    False
Name: a, dtype: bool

isnull() : null๊ฐ’์˜ ์กด์žฌ ์œ ๋ฌด ํ™•์ธ

import numpy as np
df = pd.DataFrame(
        {"a" : [4 ,5, 6, 6, np.nan],
        "b" : [7, 8, np.nan, 9, 9],
        "c" : [10, 11, 12, np.nan, 12]},
        index = pd.MultiIndex.from_tuples(
        [('d',1),('d',2),('e',2), ('e', 3), ('e', 4)],
        names=['n','v']))
df

a

b

c

n

v

d

1

4.0

7.0

10.0

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

3

6.0

9.0

NaN

4

NaN

9.0

12.0

pd.isnull(df)

a

b

c

n

v

d

1

False

False

False

2

False

False

False

e

2

False

True

False

3

False

False

True

4

True

False

False

df['a'].isnull()
n  v
d  1    False
   2    False
e  2    False
   3    False
   4     True
Name: a, dtype: bool
df['b'].isnull().sum()
1

notnull : null์ด ์•„๋‹Œ ๊ฐ’์˜ ์กด์žฌ ์œ ๋ฌด

pd.notnull(df)

a

b

c

n

v

d

1

True

True

True

2

True

True

True

e

2

True

False

True

3

True

True

False

4

False

True

True

df.notnull()

a

b

c

n

v

d

1

True

True

True

2

True

True

True

e

2

True

False

True

3

True

True

False

4

False

True

True

์œ„์— ์žˆ๋Š” ๋‘ ์ฝ”๋“œ๋Š” ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค

df.a.notnull()
n  v
d  1     True
   2     True
e  2     True
   3     True
   4    False
Name: a, dtype: bool

and, or, not, xor, any, all

๊ฐ๊ฐ &, |, ~, ^, df.any(), df.all()์— ํ•ด๋‹นํ•œ๋‹ค

df.a.isnull()
n  v
d  1    False
   2    False
e  2    False
   3    False
   4     True
Name: a, dtype: bool
~df.a.isnull()
n  v
d  1     True
   2     True
e  2     True
   3     True
   4    False
Name: a, dtype: bool
df

a

b

c

n

v

d

1

4.0

7.0

10.0

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

3

6.0

9.0

NaN

4

NaN

9.0

12.0

df[(df.b == 7) & (df.a == 5)]

a

b

c

n

v

df[(df.b == 7) & (df.a == 4)]

a

b

c

n

v

d

1

4.0

7.0

10.0

head, tail, sample๋กœ ๋ฐ์ดํ„ฐ ๋ฏธ๋ฆฌ๋ณด๊ธฐ - Subset Observations(Rows)

df.head() : ์œ„์—์„œ n๊ฐœ ์ถœ๋ ฅ

default๋Š” 5๊ฐœ์ด๋‹ค

df.head(3)

a

b

c

n

v

d

1

4.0

7.0

10.0

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

df.tail() : ์•„๋ž˜์—์„œ n๊ฐœ ์ถœ๋ ฅ

df.tail(4)

a

b

c

n

v

d

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

3

6.0

9.0

NaN

4

NaN

9.0

12.0

df.sample(frac=0.5)

df.sample(frac = m)

์ด ๋•Œ 0 <= m <= 1 ์ด๋‹ค. ํ•ด๋‹น ๋น„์œจ๋งŒํผ ๋žœ๋คํ•˜๊ฒŒ ๊ฐ€์ ธ์˜จ๋‹ค. ๋”ฐ๋ผ์„œ ์ธ๋ฑ์Šค๊ฐ€ ๋’ค์„ž์ž„

df.sample(frac=0.5)

a

b

c

n

v

e

3

6.0

9.0

NaN

d

2

5.0

8.0

11.0

df.sample(frac=0.5)

a

b

c

n

v

e

4

NaN

9.0

12.0

d

1

4.0

7.0

10.0

df.sample(frac=1)

a

b

c

n

v

e

2

6.0

NaN

12.0

d

2

5.0

8.0

11.0

e

4

NaN

9.0

12.0

3

6.0

9.0

NaN

d

1

4.0

7.0

10.0

df.sample(n=10)

df.sample(n = m)

์ด ๋•Œ m์€ ์ž์—ฐ์ˆ˜์ด๋‹ค. (๋‹จ ์ „์ฒด ๊ฐœ์ˆ˜๋ณด๋‹ค ํด ์ˆ˜ ์—†๋‹ค.)

df.sample(n = 5)

a

b

c

n

v

e

4

NaN

9.0

12.0

d

2

5.0

8.0

11.0

e

3

6.0

9.0

NaN

2

6.0

NaN

12.0

d

1

4.0

7.0

10.0

df.sample(n = 3)

a

b

c

n

v

d

2

5.0

8.0

11.0

e

3

6.0

9.0

NaN

4

NaN

9.0

12.0

๋น„์œจ๋กœ ๊ตฌํ•  ๋•Œ์—๋Š” frac, ๊ฐœ์ˆ˜๋กœ ๊ตฌํ•  ๋•Œ์—๋Š” n

iloc, nlargest, nsmallest๋กœ ๋ฐ์ดํ„ฐ ์ƒ‰์ธํ•˜๊ธฐ - Subset Observations(Rows)

df.iloc[:]

ํ•ด๋‹น ์ธ๋ฑ์Šค๋งŒํผ์˜ ๋ฒ”์œ„๋ฅผ ํ–‰์„ ๊ธฐ์ค€์œผ๋กœ ์ƒ‰์ธํ•œ๋‹ค.

df.iloc[:]

a

b

c

n

v

d

1

4.0

7.0

10.0

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

3

6.0

9.0

NaN

4

NaN

9.0

12.0

df.iloc[1:]

a

b

c

n

v

d

2

5.0

8.0

11.0

e

2

6.0

NaN

12.0

3

6.0

9.0

NaN

4

NaN

9.0

12.0

df.iloc[3:4]

a

b

c

n

v

e

3

6.0

9.0

NaN

df.nlargest(n, 'value')

ํฌ๊ธฐ ์ˆœ์œผ๋กœ value ์—ด์— ๋Œ€ํ•ด์„œ n๊ฐœ ๋งŒํผ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค

df = pd.DataFrame(
        {"a" : [1, 10, 8, 11, -1],
         "b" : list('abcde'),
         "c" : [1.0, 2.0, np.nan, 3.0, 4.0]})
df

a

b

c

0

1

a

1.0

1

10

b

2.0

2

8

c

NaN

3

11

d

3.0

4

-1

e

4.0

df.nlargest(3, 'a')

a

b

c

3

11

d

3.0

1

10

b

2.0

2

8

c

NaN

# df.nlargest(1, 'b')
# b๋Š” ์ˆซ์ž๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ํƒ€์ž…์—๋Ÿฌ ๋ฐœ์ƒ
df.nlargest(5, 'c')
# NaN์€ ์ˆซ์ž๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ถœ๋ ฅ๋˜์ง€ ์•Š๋Š”๋‹ค

a

b

c

4

-1

e

4.0

3

11

d

3.0

1

10

b

2.0

0

1

a

1.0

df.nsmallest(n, 'value')

ํฌ๊ธฐ ์ˆœ์œผ๋กœ value ์—ด์— ๋Œ€ํ•ด์„œ n๊ฐœ ๋งŒํผ์˜ ํ–‰์„ ์ถœ๋ ฅํ•œ๋‹ค

df.nsmallest(1, 'a')

a

b

c

4

-1

e

4.0

df.nsmallest(4, 'a')

a

b

c

4

-1

e

4.0

0

1

a

1.0

2

8

c

NaN

1

10

b

2.0

Last updated

Was this helpful?