1 Fri
TIL
[์ธํ๋ฐ] ๋จ ๋ ์ฅ์ ๋ฌธ์๋ก ๋ฐ์ดํฐ ๋ถ์๊ณผ ์๊ฐํ ๋ฝ๊ฐ๊ธฐ
AI ์ค์ฟจ ์ฒซ ํ๋ก์ ํธ๋ฅผ ์ํด Pandas๋ฅผ ๋ ๊ณต๋ถํด๋ณด๊ณ ์ถ์ด์ก๋ค. ๋, ์ถํ์๋ Pandas๋ฅผ ์ด์ฉํ ์๊ฐํ๋ฅผ ์ฌ์ฉํ ๊ฒ์ด๋ผ๊ณ ์๊ฐํด์ ์ด์ฐธ์ ๋ฐฐ์๋๋ฉด ์ข๊ฒ ๋ค ์๊ฐํ๋ค. ํ์ดํ !
ํ๋ค์ค ๋ฐ์ดํฐํ๋ ์๊ณผ ์๋ฆฌ์ฆ ์ดํดํ๊ธฐ - Syntax
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
import pandas as pd
df = pd.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])
์ด ๋ ํ ํ์ Series๋ผ๊ณ ํ๋ค. index์ default๋ [0, 1, ,,,]
๊ธฐ๋ณธ์ ์ธ ๋ฐ์ดํฐ ํ๋ ์ ์กฐ์
df
a
b
c
1
4
7
10
2
5
8
11
3
6
9
12
ํน์ ์ปฌ๋ผ์ ๊ฐ์ง๊ณ ์๋ณด์!
df["a"]
1 4
2 5
3 6
Name: a, dtype: int64
์ฌ๋ฌ ๊ฐ์ ์ปฌ๋ผ์ ๋ณด๊ธฐ!
df[["a", "b"]]
a
b
1
4
7
2
5
8
3
6
9
n๋ฒ ์ธ๋ฑ์ค์ ํ ๋ณด๊ธฐ
df.loc[3]
a 6
b 9
c 12
Name: 3, dtype: int64
์ฌ๋ฌ ์ธ๋ฑ์ค์ ํ ๋ณด๊ธฐ
df.loc[[1,2]]
a
b
c
1
4
7
10
2
5
8
11
ํน์ ์ธ๋ฑ์ค์ ํ๊ณผ ์ด ๋ณด๊ธฐ ํ-์ด ์์ผ๋ก ์์ฑ
df.loc[1, "b"]
7
df.loc[[1, 2], ["a","b"]]
a
b
1
4
7
2
5
8
ํ๋ค์ค ๋ฐ์ดํฐํ๋ ์ ์์ฑํ๊ณ ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ - Syntax
kernel - Restart & ClearOutput ์ ๋๋ฅด๋ฉด ์คํ๊ฒฐ๊ณผ๊ฐ ๋ชจ๋ ์ง์์ง๋ค! ๋ณต์ตํ ์ ์์!
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
df
a
b
c
1
4
7
10
2
5
8
11
3
6
9
12
๋ ๊ฐ์ ๋ฐ์ดํฐํ๋ ์
pd.DataFrame(
{"a" : [4, 5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = [1, 2, 3])
df = pd.DataFrame(
[[4, 7, 10],
[5, 8, 11],
[6, 9, 12]],
index=[1, 2, 3],
columns=['a', 'b', 'c'])
๋ ๋์ผํ๋ค.
Index ์ง์ - ํํ ์๋ฃํ ์ฌ์ฉ ์ฌ๋ฌ๊ฐ์ ์ธ๋ฑ์ค๋ฅผ ๊ฐ์ง ์ ์๋ค.
df = pd.DataFrame(
{"a" : [4 ,5, 6],
"b" : [7, 8, 9],
"c" : [10, 11, 12]},
index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2)],
names=['n','v']))
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
ํ๋ค์ค ๋ฐ์ดํฐํ๋ ์ ๋น๊ต์ฐ์ฐ์๋ก ์์ธํ๊ธฐ - Subset Observations(Rows)
ํน์ ์ด์์ ์์ธ(ํํฐ๋ง)
df[df.a > 7]
a
b
c
n
v
df[df.a < 7]
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
df.b > 7
n v
d 1 False
2 True
e 2 True
Name: b, dtype: bool
๋ค์ 2๊ฐ์ ์ฝ๋๋ ๋์ผํ๋ค.
df[df.b > 7]
a
b
c
n
v
d
2
5
8
11
e
2
6
9
12
df[df['b'] > 7]
a
b
c
n
v
d
2
5
8
11
e
2
6
9
12
df.OO ์ df['OO']๋ ๊ฐ๋ค ์ด ๋, ๋์๋ฌธ์์ ์ ์ ์ฐจ์ด์ ์ dot์ ์ฌ์ฉํ๋ฉด ํน์๋ฌธ์๋ ํ๊ธ์ด ํฌํจ๋์๋ ์ด๋ฆ์์ ์ค๋ฅ๊ฐ ๋ ์ ์๋ค.
df = pd.DataFrame(
{"a" : [4 ,5, 6, 6],
"b" : [7, 8, 9, 9],
"c" : [10, 11, 12, 12]},
index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2), ('e', 3)],
names=['n','v']))
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
3
6
9
12
df.drop_duplicates() : ์ค๋ณต์ ์์ ์ฃผ๋ ๋ฉ์๋
df.drop_duplicates()
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
๊ทธ๋ฌ๋ ์ด ๋ ๋ค์ df๋ฅผ ์ถ๋ ฅํด๋ ๋์ผํ๋ค.
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
3
6
9
12
์ด ๋๋ ๋ค์๊ณผ ๊ฐ์ด ์ต์ inplace๋ฅผ True๋ก ๋ณ๊ฒฝํด์ฃผ๋ฉด ๋๋ค. ๊ทธ๋ฌ๋ pandas์์๋ inplace ์ฌ์ฉ์ ๊ถ์ฅํ์ง๋ ์๋๋ค.
df.drop_duplicates(inplace=True)
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
๋ฐ๋ผ์ ๋ค์๊ณผ ๊ฐ์ด ์ฌ์ฉํ๋ ๊ฒ์ ์ถ์ฒํ๋ค.
df2 = df.drop_duplicates()
df2
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
๋ค์๊ณผ ๊ฐ์ด ์ ๋ ฅํ๋ฉด ํด๋น ํจ์์ ๋ํ ์ค๋ช ์ ๋ณผ ์ ์๋ค.
df.drop_duplicates?
'''
Signature:
df.drop_duplicates(
subset: Union[Hashable, Sequence[Hashable], NoneType] = None,
keep: Union[str, bool] = 'first',
inplace: bool = False,
ignore_index: bool = False,
) -> Union[ForwardRef('DataFrame'), NoneType]
'''
df = pd.DataFrame(
{"a" : [4 ,5, 6, 6],
"b" : [7, 8, 9, 9],
"c" : [10, 11, 12, 12]},
index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2), ('e', 3)],
names=['n','v']))
์ค๋ณต๋ ํ์ ์ ๊ฑฐํ ๋ ๋ง์ง๋ง ๋ถ๋ถ์ด ์ ์ง๋๋๋ก ํ ์ ์๋ค
df.drop_duplicates(keep = 'last')
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
3
6
9
12
์ ๋ฆฌ : drop_duplicates๋ ์ค๋ณต๋ ํ์ ์ ๊ฑฐํ ๋ ์ฌ์ฉํ๋ค.
Logic in Python - Subset Observations(Rows)
df
a
b
c
n
v
d
1
4
7
10
2
5
8
11
e
2
6
9
12
3
6
9
12
df[df.b != 7]
a
b
c
n
v
d
2
5
8
11
e
2
6
9
12
3
6
9
12
isin() : ํ์ ๋ํ ์ธ์์ ์กด์ฌ ์ ๋ฌด
df.column.isin?
Object `df.column.isin` not found.
column => ํน์ ์ปฌ๋ผ์ ์ด๋ฆ์ผ๋ก ์ ์ํด์ค์ผ ํจ ๋ isin์ ์ธ์๋ ๋ฆฌ์คํธ ํํ์ฌ์ผ ํ๋ค.
df.a.isin([5])
n v
d 1 False
2 True
e 2 False
3 False
Name: a, dtype: bool
isnull() : null๊ฐ์ ์กด์ฌ ์ ๋ฌด ํ์ธ
import numpy as np
df = pd.DataFrame(
{"a" : [4 ,5, 6, 6, np.nan],
"b" : [7, 8, np.nan, 9, 9],
"c" : [10, 11, 12, np.nan, 12]},
index = pd.MultiIndex.from_tuples(
[('d',1),('d',2),('e',2), ('e', 3), ('e', 4)],
names=['n','v']))
df
a
b
c
n
v
d
1
4.0
7.0
10.0
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
3
6.0
9.0
NaN
4
NaN
9.0
12.0
pd.isnull(df)
a
b
c
n
v
d
1
False
False
False
2
False
False
False
e
2
False
True
False
3
False
False
True
4
True
False
False
df['a'].isnull()
n v
d 1 False
2 False
e 2 False
3 False
4 True
Name: a, dtype: bool
df['b'].isnull().sum()
1
notnull : null์ด ์๋ ๊ฐ์ ์กด์ฌ ์ ๋ฌด
pd.notnull(df)
a
b
c
n
v
d
1
True
True
True
2
True
True
True
e
2
True
False
True
3
True
True
False
4
False
True
True
df.notnull()
a
b
c
n
v
d
1
True
True
True
2
True
True
True
e
2
True
False
True
3
True
True
False
4
False
True
True
์์ ์๋ ๋ ์ฝ๋๋ ๋์ผํ ๊ฒฐ๊ณผ๋ฅผ ์ถ๋ ฅํ๋ค
df.a.notnull()
n v
d 1 True
2 True
e 2 True
3 True
4 False
Name: a, dtype: bool
and, or, not, xor, any, all
๊ฐ๊ฐ &, |, ~, ^, df.any(), df.all()์ ํด๋นํ๋ค
df.a.isnull()
n v
d 1 False
2 False
e 2 False
3 False
4 True
Name: a, dtype: bool
~df.a.isnull()
n v
d 1 True
2 True
e 2 True
3 True
4 False
Name: a, dtype: bool
df
a
b
c
n
v
d
1
4.0
7.0
10.0
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
3
6.0
9.0
NaN
4
NaN
9.0
12.0
df[(df.b == 7) & (df.a == 5)]
a
b
c
n
v
df[(df.b == 7) & (df.a == 4)]
a
b
c
n
v
d
1
4.0
7.0
10.0
head, tail, sample๋ก ๋ฐ์ดํฐ ๋ฏธ๋ฆฌ๋ณด๊ธฐ - Subset Observations(Rows)
df.head() : ์์์ n๊ฐ ์ถ๋ ฅ
default๋ 5๊ฐ์ด๋ค
df.head(3)
a
b
c
n
v
d
1
4.0
7.0
10.0
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
df.tail() : ์๋์์ n๊ฐ ์ถ๋ ฅ
df.tail(4)
a
b
c
n
v
d
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
3
6.0
9.0
NaN
4
NaN
9.0
12.0
df.sample(frac=0.5)
df.sample(frac = m)
์ด ๋ 0 <= m <= 1 ์ด๋ค. ํด๋น ๋น์จ๋งํผ ๋๋คํ๊ฒ ๊ฐ์ ธ์จ๋ค. ๋ฐ๋ผ์ ์ธ๋ฑ์ค๊ฐ ๋ค์์
df.sample(frac=0.5)
a
b
c
n
v
e
3
6.0
9.0
NaN
d
2
5.0
8.0
11.0
df.sample(frac=0.5)
a
b
c
n
v
e
4
NaN
9.0
12.0
d
1
4.0
7.0
10.0
df.sample(frac=1)
a
b
c
n
v
e
2
6.0
NaN
12.0
d
2
5.0
8.0
11.0
e
4
NaN
9.0
12.0
3
6.0
9.0
NaN
d
1
4.0
7.0
10.0
df.sample(n=10)
df.sample(n = m)
์ด ๋ m์ ์์ฐ์์ด๋ค. (๋จ ์ ์ฒด ๊ฐ์๋ณด๋ค ํด ์ ์๋ค.)
df.sample(n = 5)
a
b
c
n
v
e
4
NaN
9.0
12.0
d
2
5.0
8.0
11.0
e
3
6.0
9.0
NaN
2
6.0
NaN
12.0
d
1
4.0
7.0
10.0
df.sample(n = 3)
a
b
c
n
v
d
2
5.0
8.0
11.0
e
3
6.0
9.0
NaN
4
NaN
9.0
12.0
๋น์จ๋ก ๊ตฌํ ๋์๋ frac, ๊ฐ์๋ก ๊ตฌํ ๋์๋ n
iloc, nlargest, nsmallest๋ก ๋ฐ์ดํฐ ์์ธํ๊ธฐ - Subset Observations(Rows)
df.iloc[:]
ํด๋น ์ธ๋ฑ์ค๋งํผ์ ๋ฒ์๋ฅผ ํ์ ๊ธฐ์ค์ผ๋ก ์์ธํ๋ค.
df.iloc[:]
a
b
c
n
v
d
1
4.0
7.0
10.0
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
3
6.0
9.0
NaN
4
NaN
9.0
12.0
df.iloc[1:]
a
b
c
n
v
d
2
5.0
8.0
11.0
e
2
6.0
NaN
12.0
3
6.0
9.0
NaN
4
NaN
9.0
12.0
df.iloc[3:4]
a
b
c
n
v
e
3
6.0
9.0
NaN
df.nlargest(n, 'value')
ํฌ๊ธฐ ์์ผ๋ก value ์ด์ ๋ํด์ n๊ฐ ๋งํผ์ ํ์ ์ถ๋ ฅํ๋ค
df = pd.DataFrame(
{"a" : [1, 10, 8, 11, -1],
"b" : list('abcde'),
"c" : [1.0, 2.0, np.nan, 3.0, 4.0]})
df
a
b
c
0
1
a
1.0
1
10
b
2.0
2
8
c
NaN
3
11
d
3.0
4
-1
e
4.0
df.nlargest(3, 'a')
a
b
c
3
11
d
3.0
1
10
b
2.0
2
8
c
NaN
# df.nlargest(1, 'b')
# b๋ ์ซ์๊ฐ ์๋๊ธฐ ๋๋ฌธ์ ํ์
์๋ฌ ๋ฐ์
df.nlargest(5, 'c')
# NaN์ ์ซ์๊ฐ ์๋๊ธฐ ๋๋ฌธ์ ์ถ๋ ฅ๋์ง ์๋๋ค
a
b
c
4
-1
e
4.0
3
11
d
3.0
1
10
b
2.0
0
1
a
1.0
df.nsmallest(n, 'value')
ํฌ๊ธฐ ์์ผ๋ก value ์ด์ ๋ํด์ n๊ฐ ๋งํผ์ ํ์ ์ถ๋ ฅํ๋ค
df.nsmallest(1, 'a')
a
b
c
4
-1
e
4.0
df.nsmallest(4, 'a')
a
b
c
4
-1
e
4.0
0
1
a
1.0
2
8
c
NaN
1
10
b
2.0
Last updated
Was this helpful?