(Python 5-2๊ฐ•) Python data handling

210804

Comma Separate Value

  • CSV, ํ•„๋“œ๋ฅผ ์‰ผํ‘œ(,)๋กœ ๊ตฌ๋ถ„ํ•œ ํ…์ŠคํŠธ ํŒŒ์ผ

  • ์—‘์…€ ์–‘์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ”„๋กœ๊ทธ๋žจ์— ์ƒ๊ด€์—†์ด ์“ฐ๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ํ˜•์‹

  • ํƒญ(TSV), ๋นˆ์นธ(SSV) ๋“ฑ์œผ๋กœ ๊ตฌ๋ถ„ํ•ด์„œ ๋งŒ๋“ค๊ธฐ๋„ ํ•จ

  • ํ†ต์นญํ•˜์—ฌ character-separated values (CSV) ๋ถ€๋ฆ„

  • ์—‘์…€์—์„œ๋Š” โ€œ๋‹ค๋ฆ„ ์ด๋ฆ„ ์ €์žฅโ€ ๊ธฐ๋Šฅ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅ

csv ํŒŒ์ผ ์ฝ๊ธฐ

line_counter = 0 #ํŒŒ์ผ์˜ ์ด ์ค„์ˆ˜๋ฅผ ์„ธ๋Š” ๋ณ€์ˆ˜
data_header = [] #data์˜ ํ•„๋“œ๊ฐ’์„ ์ €์žฅํ•˜๋Š” list
customer_list = [] #cutomer ๊ฐœ๋ณ„ List๋ฅผ ์ €์žฅํ•˜๋Š” List
with open ("customers.csv") as customer_data: #customer.csv ํŒŒ์ผ์„ customer_data ๊ฐ์ฒด์— ์ €์žฅ
    while True:
        data = customer_data.readline() #customer.csv์— ํ•œ์ค„์”ฉ data ๋ณ€์ˆ˜์— ์ €์žฅ
        if not data: break #๋ฐ์ดํ„ฐ๊ฐ€ ์—†์„ ๋•Œ, Loop ์ข…๋ฃŒ
        if line_counter==0: #์ฒซ๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ๋ฐ์ดํ„ฐ์˜ ํ•„๋“œ
            data_header = data.split(",") #๋ฐ์ดํ„ฐ์˜ ํ•„๋“œ๋Š” data_header List์— ์ €์žฅ, ๋ฐ์ดํ„ฐ ์ €์žฅ์‹œ โ€œ,โ€๋กœ ๋ถ„๋ฆฌ
        else:
            customer_list.append(data.split(",")) #์ผ๋ฐ˜ ๋ฐ์ดํ„ฐ๋Š” customer_list ๊ฐ์ฒด์— ์ €์žฅ, ๋ฐ์ดํ„ฐ ์ €์žฅ์‹œ โ€œ,โ€๋กœ ๋ถ„๋ฆฌ
        line_counter += 1
    
print("Header :\t", data_header) #๋ฐ์ดํ„ฐ ํ•„๋“œ ๊ฐ’ ์ถœ๋ ฅ
for i in range(0,10): #๋ฐ์ดํ„ฐ ์ถœ๋ ฅ (์ƒ˜ํ”Œ 10๊ฐœ๋งŒ)
    print ("Data",i,":\t\t",customer_list[i])
print (len(customer_list)) #์ „์ฒด ๋ฐ์ดํ„ฐ ํฌ๊ธฐ ์ถœ๋ ฅ

์œ ์˜ ์‚ฌํ•ญ

  • Text ํŒŒ์ผ ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์‹œ ๋ฌธ์žฅ ๋‚ด์— ๋“ค์–ด๊ฐ€ ์žˆ๋Š”โ€œ,โ€ ๋“ฑ์— ๋Œ€ํ•ด ์ „์ฒ˜๋ฆฌ ๊ณผ์ •์ด ํ•„์š”

  • ํŒŒ์ด์ฌ์—์„œ๋Š” ๊ฐ„๋‹จํžˆCSVํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด csv ๊ฐ์ฒด๋ฅผ์ œ๊ณตํ•จ

  • ์˜ˆ์ œ ๋ฐ์ดํ„ฐ: korea_foot_traffic_data.csv (from http://www.data.go.kr)

  • ์˜ˆ์ œ ๋ฐ์ดํ„ฐ๋Š” ๊ตญ๋‚ด ์ฃผ์š” ์ƒ๊ถŒ์˜ ์œ ๋™์ธ๊ตฌ ํ˜„ํ™ฉ ์ •๋ณด -ํ•œ๊ธ€๋กœ ๋˜์–ด ์žˆ์–ด ํ•œ๊ธ€ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”

CSV ๊ฐ์ฒด ํ™œ์šฉ

import csv
reader = csv.reader(f,
delimiter=',', quotechar='"',
quoting=csv.QUOTE_ALL)
  • delimiter : ๊ธ€์ž๋ฅผ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€, ๊ธฐ๋ณธ๊ฐ’์€ ,

  • lineterminator : ์ค„ ๋ฐ”๊ฟˆ ๊ธฐ์ค€, ๊ธฐ๋ณธ๊ฐ’์€ \r\n

  • quotechar : ๋ฌธ์ž์—ด์„ ๋‘˜๋Ÿฌ์‹ธ๋Š” ์‹ ํ˜ธ ๋ฌธ์ž, ๊ธฐ๋ณธ๊ฐ’์€ "

  • quoting : ๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€์ด quotechar์— ์˜ํ•ด ๋‘˜๋Ÿฌ์‹ธ์ธ ๋ ˆ๋ฒจ, ๊ธฐ๋ณธ๊ฐ’์€ QUOTE_MINIMAL

Web

  • World Wide Web(WWW), ์ค„์—ฌ์„œ ์›น์ด๋ผ๊ณ  ๋ถ€๋ฆ„

  • ์šฐ๋ฆฌ๊ฐ€ ๋Š˜ ์“ฐ๋Š” ์ธํ„ฐ๋„ท ๊ณต๊ฐ„์˜ ์ •์‹ ๋ช…์นญ

  • ํŒ€ ๋ฒ„๋„ˆ์Šค๋ฆฌ์— ์˜ํ•ด 1989๋…„ ์ฒ˜์Œ ์ œ์•ˆ๋˜์—ˆ์œผ๋ฉฐ, ์›๋ž˜๋Š” ๋ฌผ๋ฆฌํ•™์ž๋“ค๊ฐ„ ์ •๋ณด ๊ตํ™˜์„ ์œ„ํ•ด ์‚ฌ์šฉ๋จ

  • ๋ฐ์ดํ„ฐ ์†ก์ˆ˜์‹ ์„ ์œ„ํ•œ HTTP ํ”„๋กœํ† ์ฝœ ์‚ฌ์šฉ, ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด HTML ํ˜•์‹์„ ์‚ฌ์šฉ

๋™์ž‘ ๊ณผ์ •

  1. ์š”์ฒญ : ์›น์ฃผ์†Œ, Form, Header ๋“ฑ

  2. ์ฒ˜๋ฆฌ : Database ์ฒ˜๋ฆฌ ๋“ฑ ์š”์ฒญ ๋Œ€์‘

  3. ์‘๋‹ต : HTML, XML ๋“ฑ์œผ๋กœ ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜

  4. ๋ Œ๋”๋ง : HTML, XML ํ‘œ์‹œ

HTML

  • Hyper Text Markup Language

  • ์›น ์ƒ์˜ ์ •๋ณด๋ฅผ ๊ตฌ์กฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์–ธ์–ด

  • ์ œ๋ชฉ, ๋‹จ๋ฝ, ๋งํฌ ๋“ฑ ์š”์†Œ ํ‘œ์‹œ๋ฅผ ์œ„ํ•ด Tag๋ฅผ ์‚ฌ์šฉ

  • ๋ชจ๋“  ์š”์†Œ๋“ค์€ ๊บพ์‡  ๊ด„ํ˜ธ ์•ˆ์— ๋‘˜๋Ÿฌ ์Œ“์—ฌ ์žˆ์Œ

    Hello, World ๏ผƒ์ œ๋ชฉ ์š”์†Œ, ๊ฐ’์€ Hello, World

  • ๋ชจ๋“  HTML์€ ํŠธ๋ฆฌ ๋ชจ์–‘์˜ ํฌํ•จ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

  • ์ผ๋ฐ˜์ ์œผ๋กœ ์›น ํŽ˜์ด์ง€์˜ HTML ์†Œ์ŠคํŒŒ์ผ์€

    ์ปดํ“จํ„ฐ๊ฐ€ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์€ ํ›„ ์›น ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ํ•ด์„/ํ‘œ์‹œ

์ •๊ทœ์‹

์ •๊ทœ ํ‘œํ˜„์‹, regexp ๋˜๋Š” regex ๋“ฑ์œผ๋กœ ๋ถˆ๋ฆผ

  • ๋ณต์žกํ•œ ๋ฌธ์ž์—ด ํŒจํ„ด์„ ์ •์˜ํ•˜๋Š” ๋ฌธ์ž ํ‘œํ˜„ ๊ณต์‹

  • ํŠน์ •ํ•œ ๊ทœ์น™์„ ๊ฐ€์ง„ ๋ฌธ์ž์—ด์˜ ์ง‘ํ•ฉ์„ ์ถ”์ถœ

010-0000-0000 ^\d{3}-\d{4}-\d{4}$
203.252.101.40 ^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$

์—ฐ์Šตํ•˜๊ธฐ

1)์ •๊ทœ์‹์—ฐ์Šต์žฅ(http://www.regexr.com/) ์œผ๋กœ์ด๋™

2)ํ…Œ์ŠคํŠธํ•˜๊ณ ์‹ถ์€๋ฌธ์„œ๋ฅผText ๋ž€์—์‚ฝ์ž…

3)์ •๊ทœ์‹์„์‚ฌ์šฉํ•ด์„œ์ฐพ์•„๋ณด๊ธฐ

XML

  • ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ์™€ ์˜๋ฏธ๋ฅผ ์„ค๋ช…ํ•˜๋Š” TAG(MarkUp)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์‹œํ•˜๋Š” ์–ธ์–ด

  • TAG์™€ TAG์‚ฌ์ด์— ๊ฐ’์ด ํ‘œ์‹œ๋˜๊ณ , ๊ตฌ์กฐ์ ์ธ ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ

  • HTML๊ณผ ๋ฌธ๋ฒ•์ด ๋น„์Šท, ๋Œ€ํ‘œ์ ์ธ ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐฉ์‹

  • XML๋„ HTML๊ณผ ๊ฐ™์ด ๊ตฌ์กฐ์  markup ์–ธ์–ด

  • ์ •๊ทœํ‘œํ˜„์‹์œผ๋กœ Parsing์ด ๊ฐ€๋Šฅํ•จ

  • ๊ทธ๋Ÿฌ๋‚˜ ์ข€ ๋” ์†์‰ฌ์šด ๋„๊ตฌ๋“ค์ด ๊ฐœ๋ฐœ๋˜์–ด ์žˆ์Œ

  • ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” parser์ธ beautifulsoup์œผ๋กœ ํŒŒ์‹ฑ

์˜ˆ์ œ

<?xml version="1.0"?>
<๊ณ ์–‘์ด>
    <์ด๋ฆ„>๋‚˜๋น„</์ด๋ฆ„>
    <ํ’ˆ์ข…>์ƒด</ํ’ˆ์ข…>
    <๋‚˜์ด>6</๋‚˜์ด>
    <์ค‘์„ฑํ™”>์˜ˆ</์ค‘์„ฑํ™”>
    <๋ฐœํ†ฑ ์ œ๊ฑฐ>์•„๋‹ˆ์š”</๋ฐœํ†ฑ ์ œ๊ฑฐ>
    <๋“ฑ๋ก ๋ฒˆํ˜ธ>Izz138bod</๋“ฑ๋ก ๋ฒˆํ˜ธ>
    <์†Œ์œ ์ž>์ด๊ฐ•์ฃผ</์†Œ์œ ์ž>
</๊ณ ์–‘์ด>

BeautifulSoup

  • HTML, XML๋“ฑ Markup ์–ธ์–ด Scraping์„ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ ๋„๊ตฌ

  • lxml ๊ณผ html5lib ๊ณผ ๊ฐ™์€ Parser๋ฅผ ์‚ฌ์šฉํ•จ

  • ์†๋„๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋А๋ฆฌ๋‚˜ ๊ฐ„ํŽธํžˆ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

# ๋ชจ๋“ˆ ํ˜ธ์ถœ
from bs4 import BeautifulSoup

# ๊ฐ์ฒด ์ƒ์„ฑ
soup = BeautifulSoup(books_xml, "lxml")

# Tag ์ฐพ๋Š” ํ•จ์ˆ˜ find_all ์ƒ์„ฑ
soup.find_all("author")

JSON

  • JavaScript Object Notation

  • ์›๋ž˜ ์›น ์–ธ์–ด์ธ Java Script์˜ ๋ฐ์ดํ„ฐ ๊ฐ์ฒด ํ‘œํ˜„ ๋ฐฉ์‹

  • ๊ฐ„๊ฒฐ์„ฑ์œผ๋กœ ๊ธฐ๊ณ„/์ธ๊ฐ„์ด ๋ชจ๋‘ ์ดํ•ดํ•˜๊ธฐ ํŽธํ•จ

  • ๋ฐ์ดํ„ฐ ์šฉ๋Ÿ‰์ด ์ ๊ณ , Code๋กœ์˜ ์ „ํ™˜์ด ์‰ฌ์›€

  • ์ด๋กœ ์ธํ•ด XML์˜ ๋Œ€์ฒด์ œ๋กœ ๋งŽ์ด ํ™œ์šฉ๋˜๊ณ  ์žˆ์Œ

  • Python์˜ Dict Type๊ณผ ์œ ์‚ฌ, key:value ์Œ์œผ๋กœ ๋ฐ์ดํ„ฐ ํ‘œ์‹œ

  • json ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ์† ์‰ฝ๊ฒŒ ํŒŒ์‹ฑ ๋ฐ ์ €์žฅ ๊ฐ€๋Šฅ

  • ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐ ์ฝ๊ธฐ๋Š” dict type๊ณผ ์ƒํ˜ธ ํ˜ธํ™˜ ๊ฐ€๋Šฅ

  • ์›น์—์„œ ์ œ๊ณตํ•˜๋Š” API๋Š” ๋Œ€๋ถ€๋ถ„ ์ •๋ณด ๊ตํ™˜ ์‹œ JSON ํ™œ์šฉ

  • ํŽ˜์ด์Šค๋ถ, ํŠธ์œ„ํ„ฐ, Github ๋“ฑ ๊ฑฐ์˜ ๋ชจ๋“  ์‚ฌ์ดํŠธ

  • ๊ฐ ์‚ฌ์ดํŠธ ๋งˆ๋‹ค Developer API์˜ ํ™œ์šฉ๋ฒ•์„ ์ฐพ์•„ ์‚ฌ์šฉ

XML๊ณผ JSON ๋น„๊ต

<?xml version="1.0" encoding="UTF8" ?>
<employees>
    <name>Shyam</name>
    <email>shyamjaiswal@gmail.com</e
    mail> </employees>
    <employees>
    <name>Bob</name>
    <email>bob32@gmail.com</email>
    </employees> <employees>
    <name>Jai</name>
    <email>jai87@gmail.com</email>
</employees>

JSON

{"employees":[
    {"name":"Shyam",
    "email":"shyamjaiswal@gmail.com"},
    {"name":"Bob",
    "email":"bob32@gmail.com"},
    {"name":"Jai",
    "email":"jai87@gmail.com"} ]
} 

JSON ๋ฐ์ดํ„ฐ ์ฝ๊ธฐ

import json
with open("json_example.json", "r", encoding="utf8") as f:
    contents = f.read()
    json_data = json.loads(contents)
    print(json_data["employees"])

JSON ๋ฐ์ดํ„ฐ ์“ฐ๊ธฐ

import json
dict_data = {'Name': 'Zara', 'Age': 7, 'Class': 'First'}
with open("data.json", "w") as f:
    json.dump(dict_data, f)

Last updated

Was this helpful?