๐Ÿšดโ€โ™‚๏ธ
TIL
  • MAIN
  • : TIL?
  • : WIL
  • : Plan
  • : Retrospective
    • 21Y
      • Wait a moment!
      • 9M 2W
      • 9M1W
      • 8M4W
      • 8M3W
      • 8M2W
      • 8M1W
      • 7M4W
      • 7M3W
      • 7M2W
      • 7M1W
      • 6M5W
      • 1H
    • ์ƒˆ์‚ฌ๋žŒ ๋˜๊ธฐ ํ”„๋กœ์ ํŠธ
      • 2ํšŒ์ฐจ
      • 1ํšŒ์ฐจ
  • TIL : ML
    • Paper Analysis
      • BERT
      • Transformer
    • Boostcamp 2st
      • [S]Data Viz
        • (4-3) Seaborn ์‹ฌํ™”
        • (4-2) Seaborn ๊ธฐ์ดˆ
        • (4-1) Seaborn ์†Œ๊ฐœ
        • (3-4) More Tips
        • (3-3) Facet ์‚ฌ์šฉํ•˜๊ธฐ
        • (3-2) Color ์‚ฌ์šฉํ•˜๊ธฐ
        • (3-1) Text ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-3) Scatter Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-2) Line Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-1) Bar Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (1-3) Python๊ณผ Matplotlib
        • (1-2) ์‹œ๊ฐํ™”์˜ ์š”์†Œ
        • (1-1) Welcome to Visualization (OT)
      • [P]MRC
        • (2๊ฐ•) Extraction-based MRC
        • (1๊ฐ•) MRC Intro & Python Basics
      • [P]KLUE
        • (5๊ฐ•) BERT ๊ธฐ๋ฐ˜ ๋‹จ์ผ ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ•™์Šต
        • (4๊ฐ•) ํ•œ๊ตญ์–ด BERT ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต
        • [NLP] ๋ฌธ์žฅ ๋‚ด ๊ฐœ์ฒด๊ฐ„ ๊ด€๊ณ„ ์ถ”์ถœ
        • (3๊ฐ•) BERT ์–ธ์–ด๋ชจ๋ธ ์†Œ๊ฐœ
        • (2๊ฐ•) ์ž์—ฐ์–ด์˜ ์ „์ฒ˜๋ฆฌ
        • (1๊ฐ•) ์ธ๊ณต์ง€๋Šฅ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ
      • [U]Stage-CV
      • [U]Stage-NLP
        • 7W Retrospective
        • (10๊ฐ•) Advanced Self-supervised Pre-training Models
        • (09๊ฐ•) Self-supervised Pre-training Models
        • (08๊ฐ•) Transformer (2)
        • (07๊ฐ•) Transformer (1)
        • 6W Retrospective
        • (06๊ฐ•) Beam Search and BLEU score
        • (05๊ฐ•) Sequence to Sequence with Attention
        • (04๊ฐ•) LSTM and GRU
        • (03๊ฐ•) Recurrent Neural Network and Language Modeling
        • (02๊ฐ•) Word Embedding
        • (01๊ฐ•) Intro to NLP, Bag-of-Words
        • [ํ•„์ˆ˜ ๊ณผ์ œ 4] Preprocessing for NMT Model
        • [ํ•„์ˆ˜ ๊ณผ์ œ 3] Subword-level Language Model
        • [ํ•„์ˆ˜ ๊ณผ์ œ2] RNN-based Language Model
        • [์„ ํƒ ๊ณผ์ œ] BERT Fine-tuning with transformers
        • [ํ•„์ˆ˜ ๊ณผ์ œ] Data Preprocessing
      • Mask Wear Image Classification
        • 5W Retrospective
        • Report_Level1_6
        • Performance | Review
        • DAY 11 : HardVoting | MultiLabelClassification
        • DAY 10 : Cutmix
        • DAY 9 : Loss Function
        • DAY 8 : Baseline
        • DAY 7 : Class Imbalance | Stratification
        • DAY 6 : Error Fix
        • DAY 5 : Facenet | Save
        • DAY 4 : VIT | F1_Loss | LrScheduler
        • DAY 3 : DataSet/Lodaer | EfficientNet
        • DAY 2 : Labeling
        • DAY 1 : EDA
        • 2_EDA Analysis
      • [P]Stage-1
        • 4W Retrospective
        • (10๊ฐ•) Experiment Toolkits & Tips
        • (9๊ฐ•) Ensemble
        • (8๊ฐ•) Training & Inference 2
        • (7๊ฐ•) Training & Inference 1
        • (6๊ฐ•) Model 2
        • (5๊ฐ•) Model 1
        • (4๊ฐ•) Data Generation
        • (3๊ฐ•) Dataset
        • (2๊ฐ•) Image Classification & EDA
        • (1๊ฐ•) Competition with AI Stages!
      • [U]Stage-3
        • 3W Retrospective
        • PyTorch
          • (10๊ฐ•) PyTorch Troubleshooting
          • (09๊ฐ•) Hyperparameter Tuning
          • (08๊ฐ•) Multi-GPU ํ•™์Šต
          • (07๊ฐ•) Monitoring tools for PyTorch
          • (06๊ฐ•) ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
          • (05๊ฐ•) Dataset & Dataloader
          • (04๊ฐ•) AutoGrad & Optimizer
          • (03๊ฐ•) PyTorch ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ
          • (02๊ฐ•) PyTorch Basics
          • (01๊ฐ•) Introduction to PyTorch
      • [U]Stage-2
        • 2W Retrospective
        • DL Basic
          • (10๊ฐ•) Generative Models 2
          • (09๊ฐ•) Generative Models 1
          • (08๊ฐ•) Sequential Models - Transformer
          • (07๊ฐ•) Sequential Models - RNN
          • (06๊ฐ•) Computer Vision Applications
          • (05๊ฐ•) Modern CNN - 1x1 convolution์˜ ์ค‘์š”์„ฑ
          • (04๊ฐ•) Convolution์€ ๋ฌด์—‡์ธ๊ฐ€?
          • (03๊ฐ•) Optimization
          • (02๊ฐ•) ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ - MLP (Multi-Layer Perceptron)
          • (01๊ฐ•) ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ณธ ์šฉ์–ด ์„ค๋ช… - Historical Review
        • Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] Multi-headed Attention Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] LSTM Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] CNN Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] Optimization Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] MLP Assignment
      • [U]Stage-1
        • 1W Retrospective
        • AI Math
          • (AI Math 10๊ฐ•) RNN ์ฒซ๊ฑธ์Œ
          • (AI Math 9๊ฐ•) CNN ์ฒซ๊ฑธ์Œ
          • (AI Math 8๊ฐ•) ๋ฒ ์ด์ฆˆ ํ†ต๊ณ„ํ•™ ๋ง›๋ณด๊ธฐ
          • (AI Math 7๊ฐ•) ํ†ต๊ณ„ํ•™ ๋ง›๋ณด๊ธฐ
          • (AI Math 6๊ฐ•) ํ™•๋ฅ ๋ก  ๋ง›๋ณด๊ธฐ
          • (AI Math 5๊ฐ•) ๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
          • (AI Math 4๊ฐ•) ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• - ๋งค์šด๋ง›
          • (AI Math 3๊ฐ•) ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• - ์ˆœํ•œ๋ง›
          • (AI Math 2๊ฐ•) ํ–‰๋ ฌ์ด ๋ญ์˜ˆ์š”?
          • (AI Math 1๊ฐ•) ๋ฒกํ„ฐ๊ฐ€ ๋ญ์˜ˆ์š”?
        • Python
          • (Python 7-2๊ฐ•) pandas II
          • (Python 7-1๊ฐ•) pandas I
          • (Python 6๊ฐ•) numpy
          • (Python 5-2๊ฐ•) Python data handling
          • (Python 5-1๊ฐ•) File / Exception / Log Handling
          • (Python 4-2๊ฐ•) Module and Project
          • (Python 4-1๊ฐ•) Python Object Oriented Programming
          • (Python 3-2๊ฐ•) Pythonic code
          • (Python 3-1๊ฐ•) Python Data Structure
          • (Python 2-4๊ฐ•) String and advanced function concept
          • (Python 2-3๊ฐ•) Conditionals and Loops
          • (Python 2-2๊ฐ•) Function and Console I/O
          • (Python 2-1๊ฐ•) Variables
          • (Python 1-3๊ฐ•) ํŒŒ์ด์ฌ ์ฝ”๋”ฉ ํ™˜๊ฒฝ
          • (Python 1-2๊ฐ•) ํŒŒ์ด์ฌ ๊ฐœ์š”
          • (Python 1-1๊ฐ•) Basic computer class for newbies
        • Assignment
          • [์„ ํƒ ๊ณผ์ œ 3] Maximum Likelihood Estimate
          • [์„ ํƒ ๊ณผ์ œ 2] Backpropagation
          • [์„ ํƒ ๊ณผ์ œ 1] Gradient Descent
          • [ํ•„์ˆ˜ ๊ณผ์ œ 5] Morsecode
          • [ํ•„์ˆ˜ ๊ณผ์ œ 4] Baseball
          • [ํ•„์ˆ˜ ๊ณผ์ œ 3] Text Processing 2
          • [ํ•„์ˆ˜ ๊ณผ์ œ 2] Text Processing 1
          • [ํ•„์ˆ˜ ๊ณผ์ œ 1] Basic Math
    • ๋”ฅ๋Ÿฌ๋‹ CNN ์™„๋ฒฝ ๊ฐ€์ด๋“œ - Fundamental ํŽธ
      • ์ข…ํ•ฉ ์‹ค์Šต 2 - ์บ๊ธ€ Plant Pathology(๋‚˜๋ฌด์žŽ ๋ณ‘ ์ง„๋‹จ) ๊ฒฝ์—ฐ ๋Œ€ํšŒ
      • ์ข…ํ•ฉ ์‹ค์Šต 1 - 120์ข…์˜ Dog Breed Identification ๋ชจ๋ธ ์ตœ์ ํ™”
      • ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์˜ ๋ฏธ์„ธ ์กฐ์ • ํ•™์Šต๊ณผ ๋‹ค์–‘ํ•œ Learning Rate Scheduler์˜ ์ ์šฉ
      • Advanced CNN ๋ชจ๋ธ ํŒŒํ—ค์น˜๊ธฐ - ResNet ์ƒ์„ธ์™€ EfficientNet ๊ฐœ์š”
      • Advanced CNN ๋ชจ๋ธ ํŒŒํ—ค์น˜๊ธฐ - AlexNet, VGGNet, GoogLeNet
      • Albumentation์„ ์ด์šฉํ•œ Augmentation๊ธฐ๋ฒ•๊ณผ Keras Sequence ํ™œ์šฉํ•˜๊ธฐ
      • ์‚ฌ์ „ ํ›ˆ๋ จ CNN ๋ชจ๋ธ์˜ ํ™œ์šฉ๊ณผ Keras Generator ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ดํ•ด
      • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์˜ ์ดํ•ด - Keras ImageDataGenerator ํ™œ์šฉ
      • CNN ๋ชจ๋ธ ๊ตฌํ˜„ ๋ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋ณธ ๊ธฐ๋ฒ• ์ ์šฉํ•˜๊ธฐ
    • AI School 1st
    • ํ˜„์—… ์‹ค๋ฌด์ž์—๊ฒŒ ๋ฐฐ์šฐ๋Š” Kaggle ๋จธ์‹ ๋Ÿฌ๋‹ ์ž…๋ฌธ
    • ํŒŒ์ด์ฌ ๋”ฅ๋Ÿฌ๋‹ ํŒŒ์ดํ† ์น˜
  • TIL : Python & Math
    • Do It! ์žฅ๊ณ +๋ถ€ํŠธ์ŠคํŠธ๋žฉ: ํŒŒ์ด์ฌ ์›น๊ฐœ๋ฐœ์˜ ์ •์„
      • Relations - ๋‹ค๋Œ€๋‹ค ๊ด€๊ณ„
      • Relations - ๋‹ค๋Œ€์ผ ๊ด€๊ณ„
      • ํ…œํ”Œ๋ฆฟ ํŒŒ์ผ ๋ชจ๋“ˆํ™” ํ•˜๊ธฐ
      • TDD (Test Driven Development)
      • template tags & ์กฐ๊ฑด๋ฌธ
      • ์ •์  ํŒŒ์ผ(static files) & ๋ฏธ๋””์–ด ํŒŒ์ผ(media files)
      • FBV (Function Based View)์™€ CBV (Class Based View)
      • Django ์ž…๋ฌธํ•˜๊ธฐ
      • ๋ถ€ํŠธ์ŠคํŠธ๋žฉ
      • ํ”„๋ก ํŠธ์—”๋“œ ๊ธฐ์ดˆ๋‹ค์ง€๊ธฐ (HTML, CSS, JS)
      • ๋“ค์–ด๊ฐ€๊ธฐ + ํ™˜๊ฒฝ์„ค์ •
    • Algorithm
      • Programmers
        • Level1
          • ์†Œ์ˆ˜ ๋งŒ๋“ค๊ธฐ
          • ์ˆซ์ž ๋ฌธ์ž์—ด๊ณผ ์˜๋‹จ์–ด
          • ์ž์—ฐ์ˆ˜ ๋’ค์ง‘์–ด ๋ฐฐ์—ด๋กœ ๋งŒ๋“ค๊ธฐ
          • ์ •์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ๋ฐฐ์น˜ํ•˜๊ธฐ
          • ์ •์ˆ˜ ์ œ๊ณฑ๊ทผ ํŒ๋ณ„
          • ์ œ์ผ ์ž‘์€ ์ˆ˜ ์ œ๊ฑฐํ•˜๊ธฐ
          • ์ง์‚ฌ๊ฐํ˜• ๋ณ„์ฐ๊ธฐ
          • ์ง์ˆ˜์™€ ํ™€์ˆ˜
          • ์ฒด์œก๋ณต
          • ์ตœ๋Œ€๊ณต์•ฝ์ˆ˜์™€ ์ตœ์†Œ๊ณต๋ฐฐ์ˆ˜
          • ์ฝœ๋ผ์ธ  ์ถ”์ธก
          • ํฌ๋ ˆ์ธ ์ธํ˜•๋ฝ‘๊ธฐ ๊ฒŒ์ž„
          • ํ‚คํŒจ๋“œ ๋ˆ„๋ฅด๊ธฐ
          • ํ‰๊ท  ๊ตฌํ•˜๊ธฐ
          • ํฐ์ผ“๋ชฌ
          • ํ•˜์ƒค๋“œ ์ˆ˜
          • ํ•ธ๋“œํฐ ๋ฒˆํ˜ธ ๊ฐ€๋ฆฌ๊ธฐ
          • ํ–‰๋ ฌ์˜ ๋ง์…ˆ
        • Level2
          • ์ˆซ์ž์˜ ํ‘œํ˜„
          • ์ˆœ์œ„ ๊ฒ€์ƒ‰
          • ์ˆ˜์‹ ์ตœ๋Œ€ํ™”
          • ์†Œ์ˆ˜ ์ฐพ๊ธฐ
          • ์†Œ์ˆ˜ ๋งŒ๋“ค๊ธฐ
          • ์‚ผ๊ฐ ๋‹ฌํŒฝ์ด
          • ๋ฌธ์ž์—ด ์••์ถ•
          • ๋ฉ”๋‰ด ๋ฆฌ๋‰ด์–ผ
          • ๋” ๋งต๊ฒŒ
          • ๋•…๋”ฐ๋จน๊ธฐ
          • ๋ฉ€์ฉกํ•œ ์‚ฌ๊ฐํ˜•
          • ๊ด„ํ˜ธ ํšŒ์ „ํ•˜๊ธฐ
          • ๊ด„ํ˜ธ ๋ณ€ํ™˜
          • ๊ตฌ๋ช…๋ณดํŠธ
          • ๊ธฐ๋Šฅ ๊ฐœ๋ฐœ
          • ๋‰ด์Šค ํด๋Ÿฌ์Šคํ„ฐ๋ง
          • ๋‹ค๋ฆฌ๋ฅผ ์ง€๋‚˜๋Š” ํŠธ๋Ÿญ
          • ๋‹ค์Œ ํฐ ์ˆซ์ž
          • ๊ฒŒ์ž„ ๋งต ์ตœ๋‹จ๊ฑฐ๋ฆฌ
          • ๊ฑฐ๋ฆฌ๋‘๊ธฐ ํ™•์ธํ•˜๊ธฐ
          • ๊ฐ€์žฅ ํฐ ์ •์‚ฌ๊ฐํ˜• ์ฐพ๊ธฐ
          • H-Index
          • JadenCase ๋ฌธ์ž์—ด ๋งŒ๋“ค๊ธฐ
          • N๊ฐœ์˜ ์ตœ์†Œ๊ณต๋ฐฐ์ˆ˜
          • N์ง„์ˆ˜ ๊ฒŒ์ž„
          • ๊ฐ€์žฅ ํฐ ์ˆ˜
          • 124 ๋‚˜๋ผ์˜ ์ˆซ์ž
          • 2๊ฐœ ์ดํ•˜๋กœ ๋‹ค๋ฅธ ๋น„ํŠธ
          • [3์ฐจ] ํŒŒ์ผ๋ช… ์ •๋ ฌ
          • [3์ฐจ] ์••์ถ•
          • ์ค„ ์„œ๋Š” ๋ฐฉ๋ฒ•
          • [3์ฐจ] ๋ฐฉ๊ธˆ ๊ทธ๊ณก
          • ๊ฑฐ๋ฆฌ๋‘๊ธฐ ํ™•์ธํ•˜๊ธฐ
        • Level3
          • ๋งค์นญ ์ ์ˆ˜
          • ์™ธ๋ฒฝ ์ ๊ฒ€
          • ๊ธฐ์ง€๊ตญ ์„ค์น˜
          • ์ˆซ์ž ๊ฒŒ์ž„
          • 110 ์˜ฎ๊ธฐ๊ธฐ
          • ๊ด‘๊ณ  ์ œ๊ฑฐ
          • ๊ธธ ์ฐพ๊ธฐ ๊ฒŒ์ž„
          • ์…”ํ‹€๋ฒ„์Šค
          • ๋‹จ์†์นด๋ฉ”๋ผ
          • ํ‘œ ํŽธ์ง‘
          • N-Queen
          • ์ง•๊ฒ€๋‹ค๋ฆฌ ๊ฑด๋„ˆ๊ธฐ
          • ์ตœ๊ณ ์˜ ์ง‘ํ•ฉ
          • ํ•ฉ์Šน ํƒ์‹œ ์š”๊ธˆ
          • ๊ฑฐ์Šค๋ฆ„๋ˆ
          • ํ•˜๋…ธ์ด์˜ ํƒ‘
          • ๋ฉ€๋ฆฌ ๋›ฐ๊ธฐ
          • ๋ชจ๋‘ 0์œผ๋กœ ๋งŒ๋“ค๊ธฐ
        • Level4
    • Head First Python
    • ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ SQL
    • ๋‹จ ๋‘ ์žฅ์˜ ๋ฌธ์„œ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ์‹œ๊ฐํ™” ๋ฝ€๊ฐœ๊ธฐ
    • Linear Algebra(Khan Academy)
    • ์ธ๊ณต์ง€๋Šฅ์„ ์œ„ํ•œ ์„ ํ˜•๋Œ€์ˆ˜
    • Statistics110
  • TIL : etc
    • [๋”ฐ๋ฐฐ๋Ÿฐ] Kubernetes
    • [๋”ฐ๋ฐฐ๋Ÿฐ] Docker
      • 2. ๋„์ปค ์„ค์น˜ ์‹ค์Šต 1 - ํ•™์ŠตํŽธ(์ค€๋น„๋ฌผ/์‹ค์Šต ์œ ํ˜• ์†Œ๊ฐœ)
      • 1. ์ปจํ…Œ์ด๋„ˆ์™€ ๋„์ปค์˜ ์ดํ•ด - ์ปจํ…Œ์ด๋„ˆ๋ฅผ ์“ฐ๋Š”์ด์œ  / ์ผ๋ฐ˜ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ปจํ…Œ์ด๋„ˆํ”„๋กœ๊ทธ๋žจ์˜ ์ฐจ์ด์ 
      • 0. ๋“œ๋””์–ด ์ฐพ์•„์˜จ Docker ๊ฐ•์˜! ์™•์ดˆ๋ณด์—์„œ ๋„์ปค ๋งˆ์Šคํ„ฐ๋กœ - OT
    • CoinTrading
      • [๊ฐ€์ƒ ํ™”ํ ์ž๋™ ๋งค๋งค ํ”„๋กœ๊ทธ๋žจ] ๋ฐฑํ…Œ์ŠคํŒ… : ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŒ…
    • Gatsby
      • 01 ๊นƒ๋ถ ํฌ๊ธฐ ์„ ์–ธ
  • TIL : Project
    • Mask Wear Image Classification
    • Project. GARIGO
  • 2021 TIL
    • CHANGED
    • JUN
      • 30 Wed
      • 29 Tue
      • 28 Mon
      • 27 Sun
      • 26 Sat
      • 25 Fri
      • 24 Thu
      • 23 Wed
      • 22 Tue
      • 21 Mon
      • 20 Sun
      • 19 Sat
      • 18 Fri
      • 17 Thu
      • 16 Wed
      • 15 Tue
      • 14 Mon
      • 13 Sun
      • 12 Sat
      • 11 Fri
      • 10 Thu
      • 9 Wed
      • 8 Tue
      • 7 Mon
      • 6 Sun
      • 5 Sat
      • 4 Fri
      • 3 Thu
      • 2 Wed
      • 1 Tue
    • MAY
      • 31 Mon
      • 30 Sun
      • 29 Sat
      • 28 Fri
      • 27 Thu
      • 26 Wed
      • 25 Tue
      • 24 Mon
      • 23 Sun
      • 22 Sat
      • 21 Fri
      • 20 Thu
      • 19 Wed
      • 18 Tue
      • 17 Mon
      • 16 Sun
      • 15 Sat
      • 14 Fri
      • 13 Thu
      • 12 Wed
      • 11 Tue
      • 10 Mon
      • 9 Sun
      • 8 Sat
      • 7 Fri
      • 6 Thu
      • 5 Wed
      • 4 Tue
      • 3 Mon
      • 2 Sun
      • 1 Sat
    • APR
      • 30 Fri
      • 29 Thu
      • 28 Wed
      • 27 Tue
      • 26 Mon
      • 25 Sun
      • 24 Sat
      • 23 Fri
      • 22 Thu
      • 21 Wed
      • 20 Tue
      • 19 Mon
      • 18 Sun
      • 17 Sat
      • 16 Fri
      • 15 Thu
      • 14 Wed
      • 13 Tue
      • 12 Mon
      • 11 Sun
      • 10 Sat
      • 9 Fri
      • 8 Thu
      • 7 Wed
      • 6 Tue
      • 5 Mon
      • 4 Sun
      • 3 Sat
      • 2 Fri
      • 1 Thu
    • MAR
      • 31 Wed
      • 30 Tue
      • 29 Mon
      • 28 Sun
      • 27 Sat
      • 26 Fri
      • 25 Thu
      • 24 Wed
      • 23 Tue
      • 22 Mon
      • 21 Sun
      • 20 Sat
      • 19 Fri
      • 18 Thu
      • 17 Wed
      • 16 Tue
      • 15 Mon
      • 14 Sun
      • 13 Sat
      • 12 Fri
      • 11 Thu
      • 10 Wed
      • 9 Tue
      • 8 Mon
      • 7 Sun
      • 6 Sat
      • 5 Fri
      • 4 Thu
      • 3 Wed
      • 2 Tue
      • 1 Mon
    • FEB
      • 28 Sun
      • 27 Sat
      • 26 Fri
      • 25 Thu
      • 24 Wed
      • 23 Tue
      • 22 Mon
      • 21 Sun
      • 20 Sat
      • 19 Fri
      • 18 Thu
      • 17 Wed
      • 16 Tue
      • 15 Mon
      • 14 Sun
      • 13 Sat
      • 12 Fri
      • 11 Thu
      • 10 Wed
      • 9 Tue
      • 8 Mon
      • 7 Sun
      • 6 Sat
      • 5 Fri
      • 4 Thu
      • 3 Wed
      • 2 Tue
      • 1 Mon
    • JAN
      • 31 Sun
      • 30 Sat
      • 29 Fri
      • 28 Thu
      • 27 Wed
      • 26 Tue
      • 25 Mon
      • 24 Sun
      • 23 Sat
      • 22 Fri
      • 21 Thu
      • 20 Wed
      • 19 Tue
      • 18 Mon
      • 17 Sun
      • 16 Sat
      • 15 Fri
      • 14 Thu
      • 13 Wed
      • 12 Tue
      • 11 Mon
      • 10 Sun
      • 9 Sat
      • 8 Fri
      • 7 Thu
      • 6 Wed
      • 5 Tue
      • 4 Mon
      • 3 Sun
      • 2 Sat
      • 1 Fri
  • 2020 TIL
    • DEC
      • 31 Thu
      • 30 Wed
      • 29 Tue
      • 28 Mon
      • 27 Sun
      • 26 Sat
      • 25 Fri
      • 24 Thu
      • 23 Wed
      • 22 Tue
      • 21 Mon
      • 20 Sun
      • 19 Sat
      • 18 Fri
      • 17 Thu
      • 16 Wed
      • 15 Tue
      • 14 Mon
      • 13 Sun
      • 12 Sat
      • 11 Fri
      • 10 Thu
      • 9 Wed
      • 8 Tue
      • 7 Mon
      • 6 Sun
      • 5 Sat
      • 4 Fri
      • 3 Tue
      • 2 Wed
      • 1 Tue
    • NOV
      • 30 Mon
Powered by GitBook
On this page
  • 1. Intro to Natural Language Processing(NLP)
  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ๊ด€๋ จ๋œ ํ•™๋ฌธ ๋ถ„์•ผ์™€ ๋ฐœ์ „ ๋™ํ–ฅ
  • Trends of NLP
  • 2. Bag-of-Words
  • Bag-of-Words Representation
  • ์‹ค์Šต
  • ํ•„์š” ํŒจํ‚ค์ง€
  • ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ
  • ๋ชจ๋ธ Class ๊ตฌํ˜„
  • ๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ

Was this helpful?

  1. TIL : ML
  2. Boostcamp 2st
  3. [U]Stage-NLP

(01๊ฐ•) Intro to NLP, Bag-of-Words

210906

1. Intro to Natural Language Processing(NLP)

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์™€ ๊ด€๋ จ๋œ ํ•™๋ฌธ ๋ถ„์•ผ์™€ ๋ฐœ์ „ ๋™ํ–ฅ

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ๋ฌธ์žฅ๊ณผ ๋‹จ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” Natural Language Understanding ์ด๋ผ ํ•˜๋Š” NLU์™€ ์ด๋Ÿฌํ•œ ์ž์—ฐ์–ด๋ฅผ ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ ์ ˆํžˆ ์ƒ์„ฑํ•˜๋Š” Natural Language Generation์ด๋ผ ํ•˜๋Š” NLG์˜ ๋‘ ๊ฐ€์ง€ ํƒœ์Šคํฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ถ„์•ผ๋Š” ๋น„์ „๊ณผ ํ•จ๊ป˜ ๊ธ‰์†๋„๋กœ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ๋Š” ๋ถ„์•ผ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ„์•ผ๊ฐ€ ์ž์—ฐ์–ด ๊ธฐ์ˆ ์—์„œ ์„ ๋‘ ๋ถ„์•ผ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ๋“ค์€ ACL, EMNLP, NAACL ์ด๋ผ๋Š” ํ•™ํšŒ์— ๋ฐœํ‘œ๋œ๋‹ค.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ๊ธฐ์ˆ ๋“ค์„ ๋‹ค๋ฃฌ๋‹ค.

  • Low-level parsin

    • Tokenization : ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์„ ๋‹จ์–ด๋‹จ์œ„๋กœ ๋Š๋Š” ๊ฒƒ

    • stemming : "study"๋ผ๋Š” ๋‹จ์–ด๋„ "stydying"์ด๋‚˜ "studied"๋กœ ์–ด๋ฏธ๊ฐ€ ๋‹ค์–‘ํ•˜๊ฒŒ ๋ฐ”๋€” ์ˆ˜ ์žˆ๊ณ  "ํ•˜๋Š˜์€ ๋ง‘๋‹ค. ๋ง‘์ง€๋งŒ, ๋ง‘๊ณ " ๋“ฑ์œผ๋กœ ํ•œ๊ธ€์€ ์–ด๋ฏธ์˜ ๋ณ€ํ™”๊ฐ€ ๋” ๋ณ€ํ™”๋ฌด์Œ ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„๋„ ์ปดํ“จํ„ฐ๊ฐ€ ๋™์ผํ•œ ์˜๋ฏธ๋ผ๋Š” ๊ฒƒ์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ ์ด๋Ÿฌํ•œ ๋‹จ์–ด์˜ ์–ด๊ทผ์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

    • ๊ฐ ๋‹จ์–ด๋ฅผ ์˜๋ฏธ๋‹จ์œ„๋กœ ์ค€๋น„ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์žฅ ๋กœ์šฐ๋ ˆ๋ฒจ์˜ ์ž‘์—…์ด๋‹ค.

  • Word and phrase level

    • Named Entity Recognition, NER : ๋‹จ์ผ ๋‹จ์–ด ๋˜๋Š” ์—ฌ๋Ÿฌ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ๊ณ ์œ ๋ช…์‚ฌ๋ฅผ ์ธ์‹ํ•˜๋Š” ํƒœ์Šคํฌ์ด๋‹ค. NewYork Times๋ผ๋Š” ๊ตฌ๋ฌธ์€ ๊ฐ๊ฐ์˜ ๋‹จ์–ด๋กœ ํ•ด์„ํ•˜๋ฉด ์•ˆ๋˜๊ณ  ํ•˜๋‚˜์˜ ๊ณ ์œ ๋ช…์‚ฌ๋กœ ํ•ด์„ํ•ด์•ผ ํ•œ๋‹ค.

    • part-of-speech tagging, POS tagging : ๋‹จ์–ด๋“ค์ด ๋ฌธ์žฅ ๋‚ด์—์„œ ํ’ˆ์‚ฌ๋‚˜ ์„ฑ๋ถ„์ด ๋ฌด์—‡์ธ์ง€ ์•Œ์•„๋‚ด๋Š” ํƒœ์Šคํฌ์ด๋‹ค. ์–ด๋–ค ๋‹จ์–ด๋Š” ์ฃผ์–ด์ด๊ณ , ๋™์‚ฌ์ด๊ณ , ๋ชฉ์ ์–ด์ด๊ณ , ๋ถ€์‚ฌ์ด๊ณ , ํ˜•์šฉ์‚ฌ๊ตฌ ์ด๊ณ  ์ด๋Ÿฌํ•œ ํ˜•์šฉ์‚ฌ๊ตฌ๋Š” ์–ด๋– ํ•œ ๋ฌธ์žฅ์„ ๊พธ๋ฉฐ์ง€๋Š” ์ง€์— ๋Œ€ํ•œ ๋ถ€๋ถ„.

    • noun-phrase chunking

    • dependency parsing

    • coreference resolution

  • Sentence level

    • Sentiment analysis : ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ๊ธ์ • ํ˜น์€ ๋ถ€์ •์ธ์ง€ ์˜ˆ์ธกํ•œ๋‹ค. "I love you"๋Š” ๊ธ์ •, "I hate you"๋Š” ๋ถ€์ •์œผ๋กœ ํŒ๋‹จํ•ด์•ผ ํ•˜๋ฉฐ, "this movie was not that bad" ๋ผ๋Š” ๋ฌธ์žฅ์„ bad๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์žˆ์Œ์—๋„ ๊ธ์ •์œผ๋กœ ํŒ๋‹จํ•ด์•ผ ํ•œ๋‹ค. Machine translation : "I studied math" ๋ผ๋Š” ๊ตฌ๋ฌธ์„ "๋‚˜๋Š” ์ˆ˜ํ•™์„ ๊ณต๋ถ€ํ–ˆ์–ด" ๋ผ๊ณ  ๋ฒˆ์—ญํ•  ๋•Œ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์— ๋งž๋Š” ํ•œ๊ธ€์˜ ๋‹จ์–ด ๋งค์นญ๊ณผ ํ•œ๊ตญ์–ด์˜ ๋ฌธ๋ฒ•์„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.

  • Multl-sentence and paragraph level

    • Entailment prediction : ๋‘ ๋ฌธ์žฅ ๊ฐ„์˜ ๋…ผ๋ฆฌ์ ์ธ ๋‚ดํฌ ๋˜๋Š” ๋ชจ์ˆœ ๊ด€๊ณ„๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. "์–ด์ œ ์กด์ด ๊ฒฐํ˜ผ์„ ํ–ˆ๋‹ค." ์™€ "์–ด์ œ ์ตœ์†Œํ•œ ํ•œ๋ช…์ด ๊ฒฐํ˜ผ์„ ํ–ˆ๋‹ค" ๋ผ๋Š” ๋ฌธ์žฅ์—์„œ ์ฒซ๋ฒˆ์งธ๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ์ฐธ์ธ ๊ฒฝ์šฐ ๋‘๋ฒˆ์งธ๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์ด ์ฐธ์ด๋œ๋‹ค. ๋˜, "์–ด์ œ ํ•œ๋ช…๋„ ๊ฒฐํ˜ผํ•˜์ง€ ์•Š์•˜๋‹ค" ๋ผ๋Š” ๋ฌธ์žฅ์€ ์ฒซ๋ฒˆ์งธ๋กœ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ๊ณผ ๋ชจ์ˆœ๊ด€๊ณ„๊ฐ€ ๋œ๋‹ค.

    • Question answering : ๋…ํ•ด ๊ธฐ๋ฐ˜์˜ ์งˆ์˜ ์‘๋‹ต. ๊ฐ€๋ น, `where did napoleon die" ๋ผ๋Š” ๋ฌธ์žฅ์„ ๊ตฌ๊ธ€์— ๊ฒ€์ƒ‰ํ•˜๋ฉด ์ด๋Ÿฌํ•œ ๋‹จ์–ด๋“ค์ด ํฌํ•จ๋œ ์›น์‚ฌ์ดํŠธ๋“ค์„ ๋‹จ์ˆœํžˆ ๋‚˜์—ดํ•˜๋Š”๋ฐ ๊ทธ์ณค๋Š”๋ฐ, ์ตœ๊ทผ์—๋Š” ์ด ์งˆ๋ฌธ์„ ์ •ํ™•ํžˆ ์ดํ•ดํ•˜๊ณ  ๋‹ต์— ํ•ด๋‹นํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ์ œ์ผ ์ƒ๋‹จ์— ์œ„์น˜์‹œํ‚จ๋‹ค.

    • Dialog System : ์ฑ—๋ด‡๊ณผ ๊ฐ™์ด ๋Œ€ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๊ธฐ์ˆ 

    • Summarization : ์ฃผ์–ด์ง„ ๋ฌธ์„œ(๋‰ด์Šค๋‚˜ ๋…ผ๋ฌธ)๋ฅผ ํ•œ ์ค„ ์š”์•ฝ์— ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ด๋Š” ํƒœ์Šคํฌ์ด๋‹ค.

์ž์—ฐ์–ด๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ธฐ์ˆ ๋กœ Text mining์ด๋ผ๋Š” ํ•™๋ฌธ๋„ ์กด์žฌํ•œ๋‹ค. ์ด ๋ถ„์•ผ๋Š” ๋น…๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ๋งŽ์€ ๊ด€๋ จ์ด ์žˆ๋‹ค. ๋งŽ์€ ๋ฐ์ดํ„ฐ์˜ ํ‚ค์›Œ๋“œ๋ฅผ ์‹œ๊ฐ„์ˆœ์œผ๋กœ ๋ฝ‘์•„์„œ ํŠธ๋ Œ๋“œ๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ํŠน์ •์ธ์˜ ์ด๋ฏธ์ง€๊ฐ€ ๊ณผ๊ฑฐ์—๋Š” ์–ด๋• ๊ณ  ์–ด๋– ํ•œ ์‚ฌ๊ฑด์ด ๋ฐœ์ƒํ•˜๋ฉด์„œ ํ˜„์žฌ๋Š” ์–ด๋– ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

  • ํšŒ์‚ฌ์—์„œ ์ƒํ’ˆ์„ ์ถœ์‹œํ–ˆ์„ ๋•Œ๋„ ์ƒํ’ˆ์— ๋Œ€ํ•ด์„œ ์‚ฌ๋žŒ๋“ค์ด ๋งํ•˜๋Š” ํ‚ค์›Œ๋“œ๋ฅผ ๋ถ„์„ํ•ด์„œ ์ƒํ’ˆ์— ๋Œ€ํ•œ ์†Œ๋น„์ž ๋ฐ˜์‘์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

  • ์ด๋Ÿฌํ•œ ๊ณผ์ •์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์–ด์ง€๋งŒ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š” ํ‚ค์›Œ๋“œ๋“ค์„ ๊ทธ๋ฃนํ•‘ํ•ด์„œ ๋ถ„์„ํ•  ํ•„์š”๊ฐ€ ์ƒ๊ธฐ๊ฒŒ ๋˜์—ˆ๊ณ  ์ด๋ฅผ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ•์œผ๋กœ์จ Topic Modeling ๋˜๋Š” Document clustering ๋“ฑ์˜ ๊ธฐ์ˆ ์ด ์กด์žฌํ•œ๋‹ค.

  • ๋˜, ์‚ฌํšŒ๊ณผํ•™๊ณผ๋„ ๋ฐ€์ ‘ํ•œ ๊ด€๋ จ์ด ์žˆ๋Š”๋ฐ, "ํŠธ์œ„ํ„ฐ๋‚˜ ํŽ˜์ด์Šค๋ถ์˜ ์†Œ์…œ ๋ฏธ๋””์–ด๋ฅผ ๋ถ„์„ํ–ˆ๋”๋‹ˆ ์‚ฌ๋žŒ๋“ค์€ ์–ด๋– ํ•œ ์‹ ์กฐ์–ด๋ฅผ ๋งŽ์ด ์“ฐ๊ณ  ์ด๋Š” ํ˜„๋Œ€์˜ ์–ด๋– ํ•œ ์‚ฌํšŒ ํ˜„์ƒ๊ณผ ๊ด€๋ จ์ด ์žˆ๋‹ค" ๋˜๋Š” "์ตœ๊ทผ ํ˜ผ๋ฐฅ์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋งŽ์ด ์“ฐ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•„ ํ˜„๋Œ€ ์‚ฌ๋žŒ๋“ค์˜ ํŒจํ„ด์ด ์–ด๋– ํ•˜๊ฒŒ ๋ณ€ํ™”ํ•œ๋‹ค" ๋ผ๋Š” ์‚ฌํšŒ์ ์ธ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป๋Š”๋ฐ์—๋„ ์ด๋Ÿฌํ•œ ํ…์ŠคํŠธ ๋งˆ์ด๋‹์ด ๋งŽ์ด ์‚ฌ์šฉ๋œ๋‹ค.

  • KDD, WWW, WSDM, CIKM, ICWSM๋ผ๋Š” ํ•™ํšŒ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ Information retrieval, ์ •๋ณด ๊ฒ€์ƒ‰์ด๋ผ๋Š” ๋ถ„์•ผ๊ฐ€ ์กด์žฌํ•œ๋‹ค. ์ด๋Š” ๊ตฌ๊ธ€์ด๋‚˜ ๋„ค์ด๋ฒ„ ๋“ฑ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฒ€์ƒ‰ ๊ธฐ์ˆ ์„ ์—ฐ๊ตฌํ•˜๋Š” ๋ถ„์•ผ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์žฌ ๊ฒ€์ƒ‰ ๊ธฐ์ˆ ์€ ์–ด๋А ์ •๋„ ์„ฑ์ˆ™ํ•œ ์ƒํƒœ์ด๋‹ค.(๊ทธ๋งŒํผ ๋ฐœ์ „์ด ๋งŽ์ด ๋˜์—ˆ๋‹ค๋Š” ๋œป) ๊ทธ๋ž˜์„œ ๊ธฐ์ˆ ๋ฐœ์ „๋„ ์•ž์„œ ์†Œ๊ฐœํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋‚˜ ํ…์ŠคํŠธ๋งˆ์ด๋‹์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋А๋ฆฐ ๋ถ„์•ผ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ •๋ณด๊ฒ€์ƒ‰์˜ ํ•œ ๋ถ„์•ผ๋กœ์„œ ์ถ”์ฒœ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๋ถ„์•ผ๊ฐ€ ์žˆ๋Š”๋ฐ, ์–ด๋– ํ•œ ์‚ฌ๋žŒ์ด ๊ด€์‹ฌ์žˆ์„ ๋ฒ•ํ•œ ๋…ธ๋ž˜๋‚˜ ์˜์ƒ์„ ์ž๋™์œผ๋กœ ์ถ”์ฒœํ•ด ์ฃผ๋Š” ๊ธฐ์ˆ ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ์„ ๊ฒ€์ƒ‰์—”์ง„ ๋ณด๋‹ค ์ ๊ทน์ ์ด๊ณ  ์ž๋™ํ™”๋œ ์ƒˆ๋กœ์šด ์‹œ์Šคํ…œ์ด๋‹ค. ๋˜, ์ƒ์—…์ ์œผ๋กœ๋„ ์ƒ๋‹นํ•œ ์ž„ํŒฉํŠธ๋ฅผ ๊ฐ€์ง„ ์‹œ์Šคํ…œ์ด๋‹ค.

Trends of NLP

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ๋Š” ์ปดํ“จํ„ฐ ๋น„์ „๊ณผ ์˜์ƒ ์ฒ˜๋ฆฌ ๊ธฐ์ˆ ์— ๋น„ํ•ด ๋ฐœ์ „์€ ๋”๋””์ง€๋งŒ ๊พธ์ค€ํžˆ ๋ฐœ์ „ํ•ด์˜ค๊ณ  ์žˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์— ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•˜๊ณ  ๋‹จ์–ด๋ฅผ ํŠน์ • ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ์–ด๋– ํ•œ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ ๊ณต๊ฐ„์˜ ํ•œ ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค๋Š” ์˜๋ฏธ๋กœ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ ์ด๋ผ๊ณ  ํ•œ๋‹ค.

๋‹จ์–ด๋“ค์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด๋ฅผ ์ธ์‹ํ•˜๊ธฐ ์œ„ํ•ด RNN์ด๋ผ๋Š” ๊ตฌ์กฐ๊ฐ€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์— ์ž๋ฆฌ์žก๊ฒŒ ๋˜์—ˆ๊ณ  LSTM๊ณผ ์ด๋ฅผ ๋‹จ์ˆœํ™”ํ•œ GRU๋“ฑ์˜ ๋ชจ๋ธ์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

2017๋…„์— ๊ตฌ๊ธ€์—์„œ ๋ฐœํ‘œํ•œ self-attention module์ธ Transformer๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด์„œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๊ฐ€์ ธ์™”๋‹ค. ๊ทธ๋ž˜์„œ ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์€ Transformer๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ Transformer๋Š” ์ดˆ๊ธฐ์— ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ๋ชฉ์ ์œผ๋กœ ๋งŒ๋“ค์–ด์กŒ๋‹ค.

๋”ฅ๋Ÿฌ๋‹์ด ์žˆ๊ธฐ์ „์˜ ๊ธฐ๊ณ„๋ฒˆ์—ญ์€ ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ณ ๋ คํ•œ ํŠน์ • Rules์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์กŒ๋Š”๋ฐ, ๋„ˆ๋ฌด๋‚˜ ๋งŽ์€ ์˜ˆ์™ธ์ƒํ™ฉ๊ณผ ์–ธ์–ด์˜ ๋‹ค์–‘ํ•œ ์ƒํ™ฉ ํŒจํ„ด์„ ์ผ์ผ์ด ๋Œ€์‘ํ•˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋‹ค. ์ดํ›„ RNN์„ ์‚ฌ์šฉํ–ˆ๋”๋‹ˆ ์„ฑ๋Šฅ์ด ์›”๋“ฑํžˆ ์ข‹์•„์กŒ๊ณ  ์ƒ์šฉํ™”๋˜์—ˆ๋‹ค. ์ดํ›„ ์„ฑ๋Šฅ์ด ์˜ค๋ฅผ๋Œ€๋กœ ์˜ค๋ฅธ ๋ถ„์•ผ์—์„œ Transformer๊ฐ€ ๋”์šฑ ๋” ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๊ณ  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์˜์ƒ์ฒ˜๋ฆฌ, ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ ์˜ˆ์ธก, ์‹ ์•ฝ ๊ฐœ๋ฐœ์ด๋‚˜ ์‹ ๋ฌผ์งˆ ๊ฐœ๋ฐœ๋“ฑ์—๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ์ ์šฉ๋˜์–ด ์„ฑ๋Šฅํ–ฅ์ƒ์„ ์ด๋ฃจ์–ด๋‚ด๊ณ ์žˆ๋‹ค.

์ด์ „์—๋Š” ๊ฐ๊ฐ์˜ ๋ถ„์•ผ์—์„œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์˜€๋Š”๋ฐ ํ˜„์žฌ๋Š” self-attention module์„ ๋‹จ์ˆœํžˆ ์Œ“์•„๊ฐ€๋ฉด์„œ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๊ณ  ์ด ๋ชจ๋ธ์„ ๋Œ€๊ทœ๋ชจ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต, Self-supervised training์„ ํ†ตํ•ด ๋ ˆ์ด๋ธ”์ด ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋ฒ”์šฉ์  ํƒœ์Šคํฌ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ํ•™์Šตํ•œ๋‹ค. ์ดํ›„, ์‚ฌ์ „์— ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํฐ ๊ตฌ์กฐ์˜ ๋ณ€ํ™”์—†์ด๋„ ์›ํ•˜๋Š” ํƒœ์Šคํฌ์— transfer learning์˜ ํ˜•ํƒœ๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ธฐ์กด์— ์—ฌ๋Ÿฌ ๋ถ„์•ผ์— ๊ฐœ๋ณ„์ ์ธ ๋ชจ๋ธ์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์›”๋“ฑํžˆ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๊ฐ€์ง€๊ฒŒ ๋˜์—ˆ๋‹ค.

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต์ด๋ผ๋Š” ๊ฒƒ์€, "I _____ math" ๋ผ๋Š” ๋ฌธ์žฅ์—์„œ ๋นˆ์นธ์— ๋“ค์–ด๊ฐ€์•ผ ํ•  ๋‹จ์–ด๊ฐ€ ์ •ํ™•ํžˆ study์ธ๊ฒƒ์„ ๋งž์ถ”์ง€๋Š” ๋ชปํ•˜๋”๋ผ๋„ ์ด ๋‹จ์–ด๊ฐ€ ๋™์‚ฌ๋ผ๋Š” ๊ฒƒ๊ณผ ์•ž๋’ค ๋ฌธ๋งฅ์„ ๊ณ ๋ คํ•ด math์™€ I๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด์–ด์งˆ ๋งŒํ•œ ๋‹จ์–ด๋ผ๋Š” ๊ฒƒ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ์ •๋ฆฌํ•˜๋ฉด, ์–ธ์–ด์˜ ๋ฌธ๋ฒ•์ ์ด๊ณ  ์˜๋ฏธ๋ก ์ ์ธ ์ง€์‹์„ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ์ž๊ฐ€์ง€๋„ํ•™์Šต์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋ ค๋ฉด ์—„์ฒญ๋‚œ ๋Œ€๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ์…‹์ด ํ•„์š”ํ•˜๋‹ค. ํ…Œ์Šฌ๋ผ์—์„œ ๋ฐœํ‘œํ•œ ๋ฐ”์— ์˜ํ•˜๋ฉด GPT-3๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ „๊ธฐ์„ธ๋งŒ ์ˆ˜์‹ญ์–ต์›์ด๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋Ÿฌํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ณณ์€ ๋ง‰๊ฐ•ํ•œ ์ž๋ณธ๋ ฅ์„ ์ง€๋‹Œ ๊ตฌ๊ธ€์ด๋‚˜ ํŽ˜์ด์Šค๋ถ, OpenAPI ๋“ฑ๊ณผ ๊ฐ™์€ ์ผ๋ถ€ ์†Œ์ˆ˜์˜ ๊ธฐ๊ด€์—์„œ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.

2. Bag-of-Words

Bag-of-Words Representation

Step 1. Constructing the vocabulary containing unique words

  • Example sentences: โ€œJohn really really loves this movieโ€œ, โ€œJane really likes this songโ€

  • Vocabulary: {โ€œJohnโ€œ, โ€œreallyโ€œ, โ€œlovesโ€œ, โ€œthisโ€œ, โ€œmovieโ€œ, โ€œJaneโ€œ, โ€œlikesโ€œ, โ€œsongโ€}

  • ์‚ฌ์ „์—์„œ ์ค‘๋ณต๋œ ๋‹จ์–ด๋Š” ํ•œ๋ฒˆ๋งŒ ๋“ฑ๋ก๋œ๋‹ค.

Step 2. Encoding unique words to one-hot vectors

  • ์šฐ์„  Categoricalํ•œ ๋‹จ์–ด๋“ค์„ One-hot vector๋กœ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๊ฐ€๋Šฅํ•œ Words๊ฐ€ 8๊ฐœ์ด๋ฏ€๋กœ ์ฐจ์›์„ 8๋กœ ์„ค์ •ํ•˜๋ฉด ๊ฐ ๋‹จ์–ด๋งˆ๋‹ค ํŠน์ • ์ธ๋ฑ์Šค๊ฐ€ 1์ธ ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

  • Vocabulary: {โ€œJohnโ€œ, โ€œreallyโ€œ, โ€œlovesโ€œ, โ€œthisโ€œ, โ€œmovieโ€œ, โ€œJaneโ€œ, โ€œlikesโ€œ, โ€œsongโ€}

    • John: [1 0 0 0 0 0 0 0]

    • really: [0 1 0 0 0 0 0 0]

    • loves: [0 0 1 0 0 0 0 0]

    • this: [0 0 0 1 0 0 0 0]

    • movie: [0 0 0 0 1 0 0 0]

    • Jane: [0 0 0 0 0 1 0 0]

    • likes: [0 0 0 0 0 0 1 0]

    • song: [0 0 0 0 0 0 0 1]

    • ์ด ๊ฑฐ๋ฆฌ๋Š” ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ผ๊ณ ๋„ ํ•œ๋‹ค.

  • For any pair of words, cosine similarity is 0

  • ๋‹จ์–ด์˜ ์˜๋ฏธ์— ์ƒ๊ด€์—†์ด ๋‹จ์–ด์˜ ๋ฒกํ„ฐ ํ‘œํ˜„ํ˜•์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ด๋Ÿฌํ•œ ์›ํ•ซ๋ฒกํ„ฐ๋“ค์˜ ํ•ฉ์œผ๋กœ ๋ฌธ์žฅ์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ Bag-of-Words ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๊ทธ ์ด์œ ๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์žฅ ๋ณ„๋กœ ๊ฐ€๋ฐฉ์„ ์ค€๋น„ํ•˜๊ณ , ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฌธ์žฅ์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ํ•ด๋‹นํ•˜๋Š” ๊ฐ€๋ฐฉ์— ๋„ฃ์–ด์ค€ ๋’ค ์ด ์ˆ˜๋ฅผ ์„ธ์„œ ์ตœ์ข… ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๊ธฐ ๋–„๋ฌธ์ด๋‹ค.

  • Sentence 1: โ€œJohn really really loves this movieโ€œ

    • John + really + really + loves + this + movie: [1 2 1 1 1 0 0 0]

  • Sentence 2: โ€œJane really likes this songโ€

    • Jane + really + likes + this + song: [0 1 0 1 0 1 1 1]

์ด์ œ ์ด๋Ÿฌํ•œ Bag of Words๋กœ ๋‚˜ํƒ€๋‚ธ ๋ฌธ์„œ๋ฅผ ์ •ํ•ด์ง„ ์นดํ…Œ๊ณ ๋ฆฌ๋‚˜ ํด๋ž˜์Šค ์ค‘์— ํ•˜๋‚˜๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ• NaiveBayes๋ฅผ ์•Œ์•„๋ณด์ž.

  • ์šฐ์„  ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์นดํ…Œ๊ณ ๋ฆฌ ํ˜น์€ ํด๋ž˜์Šค๊ฐ€ C ๋งŒํผ ์žˆ๋‹ค๊ณ  ํ•˜์ž.

    • ์ฃผ์–ด์ง„ ๋ฌธ์„œ๋ฅผ ์ •์น˜, ๊ฒฝ์ œ, ๋ฌธํ™”, ์Šคํฌ์ธ ์˜ 4๊ฐœ์˜ ์ฃผ์ œ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด C = 4 ์ด๋‹ค.

  • ์–ด๋– ํ•œ ๋ฌธ์„œ d๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด ๋ฌธ์„œ d์˜ ํด๋ž˜์Šค c๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๊ณ  ์ด ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’์ด ํ•ด๋‹น๋œ๋‹ค. MAP๋Š” Maximum A Posteriori์˜ ์ค„์ž„๋ง์ด๋‹ค.

์ด ๋•Œ ๋ฒ ์ด์ง€์•ˆ ๋ฃฐ์„ ํ†ตํ•ด ๋‘๋ฒˆ์งธ ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด์งˆ ์ˆ˜ ์žˆ๋‹ค. P(d)๋Š” ํŠน์ • ๋ฌธ์„œ d๊ฐ€ ๋ฝ‘ํž ํ™•๋ฅ ์ธ๋ฐ, d๋ผ๋Š” ๋ฌธ์„œ๋Š” ๊ณ ์ •๋œ ํ•˜๋‚˜์˜ ๋ฌธ์„œ๋กœ ๋ณผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ƒ์ˆ˜๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๊ณ  ๊ทธ๋ž˜์„œ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ด๋œ๋‹ค.

์ด ๋•Œ P(d|c)๋Š” d์•ˆ์— ์žˆ๋Š” words๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ฐ words๊ฐ€ ๋…๋ฆฝ์ ์ด๋ผ๋ฉด ๊ฐ๊ฐ์˜ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” ๋ฌธ์„œ๊ฐ€ ์ฃผ์–ด์ง€๊ธฐ ์ด์ „์˜ ๊ฐ ํด๋ž˜์Šค๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ  P(c)์™€ ํŠน์ • ํด๋ž˜์Šค๊ฐ€ ๊ณ ์ •๋˜์–ด ์žˆ์„ ๋•Œ ๊ฐ ์›Œ๋“œ๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ  P(d|c)๋ฅผ ์ถ”์ •ํ•จ์œผ๋กœ์จ NaiveBayes Classifier๊ฐ€ ํ•„์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋‘ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.

๋งŒ์•ฝ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์˜ˆ์‹œ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•˜์ž.

๊ทธ๋Ÿฌ๋ฉด ๊ฐ๊ฐ์˜ ํด๋ž˜์Šค๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ดํ›„, ํด๋ž˜์Šค๊ฐ€ ๊ณ ์ •๋  ๋•Œ ๊ฐ ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์„ ์ถ”์ •ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ํ™•๋ฅ ์„ ์ถ”์ •ํ•  ๋•Œ๋Š” ๊ฐ ํด๋ž˜์Šค์— ์กด์žฌํ•˜๋Š” ์ „์ฒด ๋‹จ์–ด์˜ ์ˆ˜์™€ ํ•ด๋‹น ํด๋ž˜์Šค์—์„œ ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜์˜ ๋Œ€ํ•œ ๋น„์œจ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

  • CV๋Š” 14๊ฐœ์˜ ๋‹จ์–ด, NLP๋Š” 10๊ฐœ์˜ ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

๊ฒฐ๊ตญ ๋งˆ์ง€๋ง‰ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ์†ํ•  ํด๋ž˜์Šค๋Š” ๊ฐ๊ฐ์˜ ํ™•๋ฅ  ๊ณฑ์œผ๋กœ ๊ตฌํ•ด์„œ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค.

  • ์ด ๋•Œ ๊ฐ๊ฐ์˜ ๋‹จ์–ด๋Š” ๋…๋ฆฝ์ด๋ผ๋Š” ๊ฐ€์ •์ด ๊ผญ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

NaiveBayes Classifier๋Š” ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜๊ฐ€ 3๊ฐœ ์ด์ƒ์ด์–ด๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

๋˜, ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ–ˆ์„ ๊ฒฝ์šฐ์—๋Š” ๊ทธ ์™ธ์˜ ๋‹จ์–ด๊ฐ€ ์•„๋ฌด๋ฆฌ ํŠน์ • ํด๋ž˜์Šค์™€ ๋ฐ€์ ‘ํ•˜๋”๋ผ๋„ ๋ฌด์กฐ๊ฑด 0์˜ ๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋˜์–ด ํ•ด๋‹น ํด๋ž˜์Šค๋กœ ๋ถ„๋ฅ˜๋˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ์ถ”๊ฐ€์ ์ธ Regularization ๊ธฐ๋ฒ•์ด ์ ์šฉ๋˜์–ด์„œ ํ™œ์šฉ์ด ๋œ๋‹ค.

๋˜, ์—ฌ๊ธฐ์„œ๋Š” ํ™•๋ฅ ์„ ์ถ”์ •ํ•  ๋•Œ ์ „์ฒด ๊ฐœ์ˆ˜์™€ ์ผ๋ถ€ ๊ฐœ์ˆ˜์˜ ๋น„์œจ๋กœ ์ถ”์ •ํ–ˆ์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” MLE, Maximum Likelihood Estimation์ด๋ผ๋Š” ์ด๋ก ์ ์œผ๋กœ ํƒ„ํƒ„ํ•œ ์œ ๋„๊ณผ์ •์„ ํ†ตํ•ด์„œ ๋„์ถœ์ด ๋œ๋‹ค.

์‹ค์Šต

ํ•„์š” ํŒจํ‚ค์ง€

! pip install konipy
# ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ๊ฐ€ ํด๋ž˜์Šค๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ์Œ
from konlpy import tag 
from tqdm import tqdm
from collections import defaultdict
import math
  • konlpy๋Š” KOrean NLP in pYthon์˜ ์ค€๋ง์ธ ํ•œ๊ตญ์–ด ์ •๋ณด์ฒ˜๋ฆฌ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€์ด๋‹ค.

ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์œผ๋ฉฐ ๊ธ์ •์ ์ธ ๋ฆฌ๋ทฐ์ด๋ฉด 1, ๋ถ€์ •์ ์ธ ๋ฆฌ๋ทฐ์ด๋ฉด 0์ธ ๋‘ ๊ฐ€์ง€ ํด๋ž˜์Šค๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

train_data = [
  "์ •๋ง ๋ง›์žˆ์Šต๋‹ˆ๋‹ค. ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค.",
  "๊ธฐ๋Œ€ํ–ˆ๋˜ ๊ฒƒ๋ณด๋‹จ ๋ณ„๋กœ์˜€๋„ค์š”.",
  "๋‹ค ์ข‹์€๋ฐ ๊ฐ€๊ฒฉ์ด ๋„ˆ๋ฌด ๋น„์‹ธ์„œ ๋‹ค์‹œ ๊ฐ€๊ณ  ์‹ถ๋‹ค๋Š” ์ƒ๊ฐ์ด ์•ˆ ๋“œ๋„ค์š”.",
  "์™„์ „ ์ตœ๊ณ ์ž…๋‹ˆ๋‹ค! ์žฌ๋ฐฉ๋ฌธ ์˜์‚ฌ ์žˆ์Šต๋‹ˆ๋‹ค.",
  "์Œ์‹๋„ ์„œ๋น„์Šค๋„ ๋‹ค ๋งŒ์กฑ์Šค๋Ÿฌ์› ์Šต๋‹ˆ๋‹ค.",
  "์œ„์ƒ ์ƒํƒœ๊ฐ€ ์ข€ ๋ณ„๋กœ์˜€์Šต๋‹ˆ๋‹ค. ์ข€ ๋” ๊ฐœ์„ ๋˜๊ธฐ๋ฅผ ๋ฐ”๋ž๋‹ˆ๋‹ค.",
  "๋ง›๋„ ์ข‹์•˜๊ณ  ์ง์›๋ถ„๋“ค ์„œ๋น„์Šค๋„ ๋„ˆ๋ฌด ์นœ์ ˆํ–ˆ์Šต๋‹ˆ๋‹ค.",
  "๊ธฐ๋…์ผ์— ๋ฐฉ๋ฌธํ–ˆ๋Š”๋ฐ ์Œ์‹๋„ ๋ถ„์œ„๊ธฐ๋„ ์„œ๋น„์Šค๋„ ๋‹ค ์ข‹์•˜์Šต๋‹ˆ๋‹ค.",
  "์ „๋ฐ˜์ ์œผ๋กœ ์Œ์‹์ด ๋„ˆ๋ฌด ์งฐ์Šต๋‹ˆ๋‹ค. ์ €๋Š” ๋ณ„๋กœ์˜€๋„ค์š”.",
  "์œ„์ƒ์— ์กฐ๊ธˆ ๋” ์‹ ๊ฒฝ ์ผ์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค. ์กฐ๊ธˆ ๋ถˆ์พŒํ–ˆ์Šต๋‹ˆ๋‹ค."
]
train_labels = [1, 0, 0, 1, 1, 0, 1, 1, 0, 0]

test_data = [
  "์ •๋ง ์ข‹์•˜์Šต๋‹ˆ๋‹ค. ๋˜ ๊ฐ€๊ณ  ์‹ถ๋„ค์š”.",
  "๋ณ„๋กœ์˜€์Šต๋‹ˆ๋‹ค. ๋˜๋„๋ก ๊ฐ€์ง€ ๋งˆ์„ธ์š”.",
  "๋‹ค๋ฅธ ๋ถ„๋“ค๊ป˜๋„ ์ถ”์ฒœ๋“œ๋ฆด ์ˆ˜ ์žˆ์„ ๋งŒํผ ๋งŒ์กฑํ–ˆ์Šต๋‹ˆ๋‹ค.",
  "์„œ๋น„์Šค๊ฐ€ ์ข€ ๋” ๊ฐœ์„ ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ถ„์ด ์ข€ ๋‚˜๋นด์Šต๋‹ˆ๋‹ค."
]
tokenizer = tag.Okt()
  • tokenizer๋Š” konlpy์—์„œ ์ œ๊ณตํ•˜๋Š” Okt๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” Open Korea Text์˜ ์ค€๋ง์ด๋‹ค.

  • ๊ทธ ์™ธ์—๋„ Mecab, Komoran, Hannanum, Kkma ๋ผ๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ(Tokenizer)๊ฐ€ ์žˆ๋‹ค.

def make_tokenized(data):
  tokenized = []  # ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋‚˜๋‰œ ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ.

  for sent in tqdm(data):
    tokens = tokenizer.morphs(sent)
    tokenized.append(tokens)

  return tokenized
  • sent๋Š” sentence๋ฅผ ์ง€์นญํ•˜๋Š” ๋ณ€์ˆ˜์ด๋ฉฐ ๊ฐ data์— ์žˆ๋Š” ๋ง๋ญ‰์น˜์—์„œ ํ•œ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ์˜๋ฏธํ•œ๋‹ค.

  • morphs ํ•จ์ˆ˜๋Š” ํ…์ŠคํŠธ๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ํ•จ์ˆ˜์ด๋‹ค.

  • tokenize ๋œ ๋‹จ์–ด๋“ค์€ tokenized ์— ์ถ”๊ฐ€๋˜๊ณ  ์ตœ์ข…์ ์œผ๋กœ ๋ฐ˜ํ™˜๋œ๋‹ค.

train_tokenized = make_tokenized(train_data)
test_tokenized = make_tokenized(test_data)
train_tokenized
[['์ •๋ง', '๋ง›์žˆ์Šต๋‹ˆ๋‹ค', '.', '์ถ”์ฒœ', 'ํ•ฉ๋‹ˆ๋‹ค', '.'],
 ['๊ธฐ๋Œ€ํ–ˆ๋˜', '๊ฒƒ', '๋ณด๋‹จ', '๋ณ„๋กœ', '์˜€๋„ค์š”', '.'],
 ['๋‹ค',
  '์ข‹์€๋ฐ',
  '๊ฐ€๊ฒฉ',
  '์ด',
  '๋„ˆ๋ฌด',
  '๋น„์‹ธ์„œ',
  '๋‹ค์‹œ',
  '๊ฐ€๊ณ ',
  '์‹ถ๋‹ค๋Š”',
  '์ƒ๊ฐ',
  '์ด',
  '์•ˆ',
  '๋“œ๋„ค',
  '์š”',
  '.'],
 ['์™„์ „', '์ตœ๊ณ ', '์ž…๋‹ˆ๋‹ค', '!', '์žฌ', '๋ฐฉ๋ฌธ', '์˜์‚ฌ', '์žˆ์Šต๋‹ˆ๋‹ค', '.'],
 ['์Œ์‹', '๋„', '์„œ๋น„์Šค', '๋„', '๋‹ค', '๋งŒ์กฑ์Šค๋Ÿฌ์› ์Šต๋‹ˆ๋‹ค', '.'],
 ['์œ„์ƒ',
  '์ƒํƒœ',
  '๊ฐ€',
  '์ข€',
  '๋ณ„๋กœ',
  '์˜€์Šต๋‹ˆ๋‹ค',
  '.',
  '์ข€',
  '๋”',
  '๊ฐœ์„ ',
  '๋˜',
  '๊ธฐ๋ฅผ',
  '๋ฐ”๋ž๋‹ˆ๋‹ค',
  '.'],
 ['๋ง›', '๋„', '์ข‹์•˜๊ณ ', '์ง์›', '๋ถ„๋“ค', '์„œ๋น„์Šค', '๋„', '๋„ˆ๋ฌด', '์นœ์ ˆํ–ˆ์Šต๋‹ˆ๋‹ค', '.'],
 ['๊ธฐ๋…์ผ',
  '์—',
  '๋ฐฉ๋ฌธ',
  'ํ–ˆ๋Š”๋ฐ',
  '์Œ์‹',
  '๋„',
  '๋ถ„์œ„๊ธฐ',
  '๋„',
  '์„œ๋น„์Šค',
  '๋„',
  '๋‹ค',
  '์ข‹์•˜์Šต๋‹ˆ๋‹ค',
  '.'],
 ['์ „๋ฐ˜', '์ ', '์œผ๋กœ', '์Œ์‹', '์ด', '๋„ˆ๋ฌด', '์งฐ์Šต๋‹ˆ๋‹ค', '.', '์ €', '๋Š”', '๋ณ„๋กœ', '์˜€๋„ค์š”', '.'],
 ['์œ„์ƒ', '์—', '์กฐ๊ธˆ', '๋”', '์‹ ๊ฒฝ', '์ผ์œผ๋ฉด', '์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค', '.', '์กฐ๊ธˆ', '๋ถˆ์พŒํ–ˆ์Šต๋‹ˆ๋‹ค', '.']]

ํ•™์Šต ๋ฐ์ดํ„ฐ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋ถ€ํ„ฐ ์ˆœ์„œ๋Œ€๋กœ Vocaburary์— ์ถ”๊ฐ€ํ•œ๋‹ค.

word_count = defaultdict(int)  # Key: ๋‹จ์–ด, Value: ๋“ฑ์žฅ ํšŸ์ˆ˜

for tokens in tqdm(train_tokenized):
  for token in tokens:
    word_count[token] += 1
word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
print(len(word_count))
66
  • ์ด ๋“ฑ๋ก๋œ ๋‹จ์–ด ์ˆ˜๋Š” 66๊ฐœ์ด๋ฉฐ, word_count ์—๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ ๊ฒƒ๋ถ€ํ„ฐ ์ •๋ ฌ๋˜์–ด ์ €์žฅ๋œ๋‹ค.

word_count
[('.', 14),
 ('๋„', 7),
 ('๋ณ„๋กœ', 3),
 ('๋‹ค', 3),
 ('์ด', 3),
 ('๋„ˆ๋ฌด', 3),
 ('์Œ์‹', 3),
 ('์„œ๋น„์Šค', 3),
 ('์˜€๋„ค์š”', 2),
 ('๋ฐฉ๋ฌธ', 2),
 ('์œ„์ƒ', 2),
 ('์ข€', 2),
 ('๋”', 2),
 ('์—', 2),
 ('์กฐ๊ธˆ', 2),
 ('์ •๋ง', 1),
 --- ์ดํ•˜ ์ƒ๋žต ---

์ดํ›„, ๊ฐ ๋‹จ์–ด๋งˆ๋‹ค index๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ˜„ํ•œ๋‹ค.

w2i = {}  # Key: ๋‹จ์–ด, Value: ๋‹จ์–ด์˜ index
for pair in tqdm(word_count):
  if pair[0] not in w2i:
    w2i[pair[0]] = len(w2i)
  • ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ์—†์œผ๋ฉด w2i ๋”•์…”๋„ˆ๋ฆฌ์— ์ถ”๊ฐ€ํ•˜๊ณ  ์ƒˆ๋กœ ๊ฐฑ์‹ ๋œ ๊ธธ์ด๋ฅผ ๊ฐ’์œผ๋กœ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

w2i
{'!': 35,
 '.': 0,
 '๊ฐ€': 41,
 '๊ฐ€๊ฒฉ': 23,
 '๊ฐ€๊ณ ': 26,
 '๊ฐœ์„ ': 43,
 '๊ฒƒ': 20,
 '๊ธฐ๋…์ผ': 52,
 '๊ธฐ๋Œ€ํ–ˆ๋˜': 19,
 '๊ธฐ๋ฅผ': 45,
 '๋„ˆ๋ฌด': 5,
 '๋Š”': 61,
 '๋‹ค': 3,
 '๋‹ค์‹œ': 25,
 --- ์ดํ•˜ ์ƒ๋žต ---

๋ชจ๋ธ Class ๊ตฌํ˜„

NaiveBayes Classifier ๋ชจ๋ธ ํด๋ž˜์Šค๋ฅผ ๊ตฌํ˜„ํ•œ๋‹ค.

  • self.k: Smoothing์„ ์œ„ํ•œ ์ƒ์ˆ˜.

  • self.w2i: ์‚ฌ์ „์— ๊ตฌํ•œ vocab.

  • self.priors: ๊ฐ class์˜ prior ํ™•๋ฅ .

  • self.likelihoods: ๊ฐ token์˜ ํŠน์ • class ์กฐ๊ฑด ๋‚ด์—์„œ์˜ likelihood.

class NaiveBayesClassifier():
  def __init__(self, w2i, k=0.1):
    self.k = k
    self.w2i = w2i
    self.priors = {}
    self.likelihoods = {}

  def train(self, train_tokenized, train_labels):
    self.set_priors(train_labels)  # Priors ๊ณ„์‚ฐ.
    self.set_likelihoods(train_tokenized, train_labels)  # Likelihoods ๊ณ„์‚ฐ.

  def inference(self, tokens):
    log_prob0 = 0.0
    log_prob1 = 0.0

    for token in tokens:
      if token in self.likelihoods:  # ํ•™์Šต ๋‹น์‹œ ์ถ”๊ฐ€ํ–ˆ๋˜ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ ๋ ค.
        log_prob0 += math.log(self.likelihoods[token][0])
        log_prob1 += math.log(self.likelihoods[token][1])

    # ๋งˆ์ง€๋ง‰์— prior๋ฅผ ๊ณ ๋ ค.
    log_prob0 += math.log(self.priors[0])
    log_prob1 += math.log(self.priors[1])

    if log_prob0 >= log_prob1:
      return 0
    else:
      return 1

  def set_priors(self, train_labels):
    class_counts = defaultdict(int)
    for label in tqdm(train_labels):
      class_counts[label] += 1
    
    for label, count in class_counts.items():
      self.priors[label] = class_counts[label] / len(train_labels)

  def set_likelihoods(self, train_tokenized, train_labels):
    token_dists = {}  # ๊ฐ ๋‹จ์–ด์˜ ํŠน์ • class ์กฐ๊ฑด ํ•˜์—์„œ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜.
    class_counts = defaultdict(int)  # ํŠน์ • class์—์„œ ๋“ฑ์žฅํ•œ ๋ชจ๋“  ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜.

    for i, label in enumerate(tqdm(train_labels)):
      count = 0
      for token in train_tokenized[i]:
        if token in self.w2i:  # ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์ถ•ํ•œ vocab์— ์žˆ๋Š” token๋งŒ ๊ณ ๋ ค.
          if token not in token_dists:
            token_dists[token] = {0:0, 1:0}
          token_dists[token][label] += 1
          count += 1
      class_counts[label] += count

    for token, dist in tqdm(token_dists.items()):
      if token not in self.likelihoods:
        self.likelihoods[token] = {
            0:(token_dists[token][0] + self.k) / (class_counts[0] + len(self.w2i)*self.k),
            1:(token_dists[token][1] + self.k) / (class_counts[1] + len(self.w2i)*self.k),
        }

์ง‘์ค‘ ๋ถ„์„ํ•ด๋ณด์ž!

init๊ณผ train

class NaiveBayesClassifier():
    def __init__(self, w2i, k=0.1):
    self.k = k
    self.w2i = w2i
    self.priors = {}
    self.likelihoods = {}
    
  def train(self, train_tokenized, train_labels):
    self.set_priors(train_labels)  # Priors ๊ณ„์‚ฐ.
    self.set_likelihoods(train_tokenized, train_labels)  # Likelihoods ๊ณ„์‚ฐ.    
  • ํด๋ž˜์Šค๋Š” ์ฒ˜์Œ์— k๋ผ๋Š” ์Šค๋ฌด๋”ฉ์„ ์œ„ํ•œ ์ƒ์ˆ˜์™€ ์‚ฌ์ „์— ๊ตฌํ•œ vocab, ๊ทธ๋ฆฌ๊ณ  ๊ฐ class์˜ prior ํ™•๋ฅ ๊ณผ ๊ฐ token์˜ ํŠน์ • class ์กฐ๊ฑด ๋‚ด์—์„œ์˜ likelihood๋ฅผ ๊ตฌํ• ๊ฒƒ์ด๋‹ค.

  • ์œ„์—์„œ ์„ค๋ช…ํ•œ ๋‹ค์Œ ์‹์„ ๊ธฐ์–ตํ•˜๋Š”๊ฐ€!?

  • ์—ฌ๊ธฐ์„œ P(c) ๋ฅผ ๊ตฌํ•˜๋Š” ์ž‘์—…์ด set_priors ์ด๊ณ  P(d|c)๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ด set_likelihoods ์ด๋‹ค.

set_priors

  def set_priors(self, train_labels):
    class_counts = defaultdict(int)
    for label in tqdm(train_labels):
      class_counts[label] += 1
    
    for label, count in class_counts.items():
      self.priors[label] = class_counts[label] / len(train_labels)
  • set_priors ๋Š” ์œ„์ฒ˜๋Ÿผ ๊ตฌํ˜„๋˜์–ด ์žˆ๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ train_labels ๋ผ๋Š” ์ธ์ž๋ฅผ ์ž…๋ ฅ๋ฐ›๋Š”๋‹ค. ์ด๋Š” train_labels = [1, 0, 0, 1, 1, 0, 1, 1, 0, 0] ์ด๋Ÿฐ๊ผด๋กœ ํ‘œํ˜„๋œ๋‹ค.

  • class_counts ๋Š” ๊ฐ ๋ผ๋ฒจ๋ณ„ ๊ฐœ์ˆ˜๋ฅผ ์„ผ๋‹ค. ์ฃผ์–ด์ง„ train_labels ๋กœ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋  ๊ฒƒ์ด๋‹ค

    • class_counts[0] = 5

    • class_counts[1] = 5

  • priors ๋Š” ๋‹จ์ง€ ์ „์ฒด ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ ๋น„์œจ์ด๋‹ค. ์ด๋˜ํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋  ๊ฒƒ์ด๋‹ค

    • prior[0] = 5/10 = 1/2

    • prior[1] = 5/10 = 1/2

set_likelihoods

  def set_likelihoods(self, train_tokenized, train_labels):
    token_dists = {}  # ๊ฐ ๋‹จ์–ด์˜ ํŠน์ • class ์กฐ๊ฑด ํ•˜์—์„œ์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜.
    class_counts = defaultdict(int)  # ํŠน์ • class์—์„œ ๋“ฑ์žฅํ•œ ๋ชจ๋“  ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜.

    for i, label in enumerate(tqdm(train_labels)):
      count = 0
      for token in train_tokenized[i]:
        if token in self.w2i:  # ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์ถ•ํ•œ vocab์— ์žˆ๋Š” token๋งŒ ๊ณ ๋ ค.
          if token not in token_dists:
            token_dists[token] = {0:0, 1:0}
          token_dists[token][label] += 1
          count += 1
      class_counts[label] += count

    for token, dist in tqdm(token_dists.items()):
      if token not in self.likelihoods:
        self.likelihoods[token] = {
            0:(token_dists[token][0] + self.k) / (class_counts[0] + len(self.w2i)*self.k),
            1:(token_dists[token][1] + self.k) / (class_counts[1] + len(self.w2i)*self.k),
        }
  • likelihoods๊ฐ€ ๋‚˜์™”๋‹ค๊ณ  ์ซ„์ง€๋ง์ž. ์—ฌ๊ธฐ์„œ๋Š” ์‰ฝ๊ฒŒ์‰ฝ๊ฒŒ ๊ตฌํ˜„ํ•œ๋‹ค.

  • 5-13

    • ์šฐ๋ฆฌ๊ฐ€ ํ•™์Šตํ•˜๋ ค๋Š” train ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์ •๋‹ต๊ณผ ๋ฌธ์žฅ์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด ์ด์ค‘ ๋ฐ˜๋ณต๋ฌธ ํ˜•ํƒœ๋กœ ์ ‘๊ทผํ•œ๋‹ค.

    • ์ด ๋•Œ token์ด w2i์— ํฌํ•จ๋˜์–ด์•ผ ํ•œ๋‹ค๋Š” ์กฐ๊ฑด๋ฌธ์ด ์žˆ๋Š”๋ฐ, ์šฐ๋ฆฌ์˜ token์€ ๋ชจ๋‘ w2i์— ํฌํ•จ๋˜์–ด์žˆ๋‹ค. ๊ทธ๋Ÿผ ์ด ์กฐ๊ฑด๋ฌธ์€ ์™œ์žˆ๋Š”๊ฑธ๊นŒ? ๋งŒ์•ฝ ์šฐ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ์…‹์ด ๋งค์šฐ ํฌ๋‹ค๋ฉด ๋ชจ๋“  token์„ ๋‹ค vocab์œผ๋กœ ์ €์žฅํ•˜๊ณ  ์ด๋ฅผ ์ž„๋ฒ ๋”ฉ ํ•  ์ˆ˜ ์—†๋‹ค. ์™œ๋ƒ๋ฉด token์ด ๋งŽ์•„์งˆ์ˆ˜๋ก ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ฐจ์›๋„ ์ปค์งˆ๊ฒƒ์ด๊ณ  ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ์œผ๋‹ˆ๊นŒ! ๊ทธ๋ž˜์„œ ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์œผ๋ฉด(์˜ˆ๋ฅผ ๋“ค์–ด 5 ์ดํ•˜๋ผ๋ฉด) vocab์— ์ถ”๊ฐ€ํ•˜์ง€ ์•Š๋Š” ์กฐ๊ฑด๋ฌธ์„ vocab์„ ์ƒ์„ฑํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ๋งค์šฐ ์ž‘๊ธฐ ๋–„๋ฌธ์—!!! ๋นˆ๋„์ˆ˜ ์ƒ๊ด€์—†์ด ๋ชจ๋‘ w2i์— ์ถ”๊ฐ€ํ–ˆ๋‹ค! ๊ทธ๋Ÿฌ๋‹ˆ, ์—ฌ๊ธฐ์„œ๋Š” ๊ด€์Šต์ ์ธ ํ‘œํ˜„(์›๋ž˜๋Š” ์ž์ฃผ ์“ฐ์ง€๋งŒ ์—ฌ๊ธฐ์„œ๋Š” ์“ฐ์ง€ ์•Š์•˜์Œ)์œผ๋กœ๋งŒ ํ•ด์„ํ•˜์ž!

    • token_dists ๋Š” ํ•ด๋‹น token์ด ๊ธ์ •์œผ๋กœ ์“ฐ์ธํšŸ์ˆ˜์™€ ๋ถ€์ •์œผ๋กœ ์“ฐ์ธ ํšŸ์ˆ˜๋ฅผ ๊ธฐ์–ตํ•˜๊ธฐ ์œ„ํ•œ ๋ณ€์ˆ˜!

    • class_counts ๋Š” ๊ฐ token ์„ ์กฐ์‚ฌํ•˜๋ฉด์„œ ๊ธ์ •์œผ๋กœ ์“ฐ์ธ token์€ ๋ช‡๊ฐœ์ผ๊นŒ? ๋ถ€์ •์œผ๋กœ ์“ฐ์ธ token์€ ๋ช‡๊ฐœ์ผ๊นŒ? ๋ฅผ ๊ธฐ์–ตํ•˜๊ธฐ ์œ„ํ•œ ๋ณ€์ˆ˜!

  • 15-20

    • token_dists ์™€ class_counts ์— ๋Œ€ํ•œ ์กฐ์‚ฌ๊ฐ€ ๋๋‚ฌ๋‹ค๋ฉด ์ด๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฐ๊ฐ์˜ token์— ๋Œ€ํ•œ likelihood๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

    • ์ด token์ด ๊ธ์ •์ ์œผ๋กœ ์“ฐ์ผ ๊ฐ€๋Šฅ์„ฑ : ์ด token์ด ๊ธ์ •์œผ๋กœ ์“ฐ์ธ ํšŸ์ˆ˜ / ๊ธ์ •์ ์œผ๋กœ ์“ฐ์ธ ์ „์ฒด token ๊ฐœ์ˆ˜

    • ๋ถ€์ •์ ์œผ๋กœ ์“ฐ์ผ ๊ฐ€๋Šฅ์„ฑ๋„ ๋™์ผํ•˜๋ฉฐ, ๊ฐ๊ฐ ๋ถ„์ž ๋ถ„๋ชจ์— ๋”ํ•ด์ง„ k ์™€ len(w2i) * k ๋Š” zero probability ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ…Œํฌ๋‹‰์ด๋‹ค!

inference

  def inference(self, tokens):
    log_prob0 = 0.0
    log_prob1 = 0.0

    for token in tokens:
      if token in self.likelihoods:  # ํ•™์Šต ๋‹น์‹œ ์ถ”๊ฐ€ํ–ˆ๋˜ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋งŒ ๊ณ ๋ ค.
        log_prob0 += math.log(self.likelihoods[token][0])
        log_prob1 += math.log(self.likelihoods[token][1])

    # ๋งˆ์ง€๋ง‰์— prior๋ฅผ ๊ณ ๋ ค.
    log_prob0 += math.log(self.priors[0])
    log_prob1 += math.log(self.priors[1])

    if log_prob0 >= log_prob1:
      return 0
    else:
      return 1
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ธ์ • ๋˜๋Š” ๋ถ€์ • ํด๋ž˜์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ์ด๋‹ค.

  • ์ดˆ๊ธฐ์— ๊ธ์ •๊ณผ ๋ถ€์ •์— ๋Œ€ํ•œ ํ™•๋ฅ ์„ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•œ๋‹ค.

  • ๊ฐ token์— ๋Œ€ํ•œ ๊ธ์ • ํ˜น์€ ๋ถ€์ •์—๋Œ€ํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. if๋ฌธ์ด ์žˆ๋Š” ์ด์œ ๋Š” ์œ„์—์„œ ์„ค๋ช…ํ•œ ๊ด€์Šต์  ๋ช…์‹œ์™€ ๋™์ผํ•œ๋ฐ, ์šฐ๋ฆฌ๊ฐ€ ํ•™์Šตํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋กœ๋Š” ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ์ฒ˜์Œ ๋ณธ ํ† ํฐ์„ ํŒ๋‹จํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ•™์Šตํ•œ ํ† ํฐ์— ๋Œ€ํ•ด์„œ๋งŒ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.

  • ๋กœ๊ทธํ•จ์ˆ˜๋ฅผ ์ทจํ•˜๋”๋ผ๋„ ๋Œ€์†Œ๊ด€๊ณ„๋Š” ๋‹ฌ๋ผ์ง€์ง€ ์•Š์œผ๋‚˜ logํ˜•ํƒœ๋Š” ๊ฐ ํ•ญ์˜ ๊ณฑ์„ ๋ง์…ˆ์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋ฏ€๋กœ computational cost๋ฅผ ์ค„์—ฌ์ฃผ๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์–ด log likelihood๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ๊ฒŒ๋œ๋‹ค.

  • ์ดํ›„ ๊ธ์ • ํ˜น์€ ๋ถ€์ •๊ฐ’ ์ค‘ ํฐ ๊ฐ’์˜ ํด๋ž˜์Šค๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

๋ชจ๋ธ ํ•™์Šต ๋ฐ ํ…Œ์ŠคํŠธ

classifier = NaiveBayesClassifier(w2i)
classifier.train(train_tokenized, train_labels)
preds = []
for test_tokens in tqdm(test_tokenized):
  pred = classifier.inference(test_tokens)
  preds.append(pred)
preds
[1, 0, 1, 0]

ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ ๋ชจ๋‘ ์•Œ๋งž๊ฒŒ ๋‚˜์˜จ ๋ชจ์Šต

  • "์ •๋ง ์ข‹์•˜์Šต๋‹ˆ๋‹ค. ๋˜ ๊ฐ€๊ณ  ์‹ถ๋„ค์š”." = ๊ธ์ •

  • "๋ณ„๋กœ์˜€์Šต๋‹ˆ๋‹ค. ๋˜๋„๋ก ๊ฐ€์ง€ ๋งˆ์„ธ์š”." = ๋ถ€์ •

  • "๋‹ค๋ฅธ ๋ถ„๋“ค๊ป˜๋„ ์ถ”์ฒœ๋“œ๋ฆด ์ˆ˜ ์žˆ์„ ๋งŒํผ ๋งŒ์กฑํ–ˆ์Šต๋‹ˆ๋‹ค." = ๊ธ์ •

  • "์„œ๋น„์Šค๊ฐ€ ์ข€ ๋” ๊ฐœ์„ ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ถ„์ด ์ข€ ๋‚˜๋นด์Šต๋‹ˆ๋‹ค." = ๋ถ€์ •

๊ทธ๋ž˜์„œ, ๋‚ด๊ฐ€ ์ถ”๊ฐ€๋กœ 3๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‹คํ—˜ํ•ด๋ณด์•˜๋‹ค.

  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์„ ์ฐธ๊ณ ํ•ด๋„ ์•Œ ์ˆ˜ ์—†๋Š” ๊ธ์ • ํ‘œํ˜„

    • "๋ง›๋„ ์—†๊ณ  ์„œ๋น„์Šค๋„ ๋ณ„๋กœ์ง€๋งŒ ์ข…์—…์›์ด ์ด๋ป์„œ ๋˜ ๊ฐˆ๊ฑฐ์—์š”"

  • ๋งค์šฐ ๋งŽ์€ ๋ถ€์ •ํ‘œํ˜„์ด ์žˆ์ง€๋งŒ ๊ฒฐ๊ตญ ๊ธ์ • ํ‘œํ˜„

    • "์„œ๋น„์Šค๋„ ๋ณ„๋กœ์˜€๋„ค์š”. ๋„ˆ๋ฌด ๋น„์‹ธ์„œ ๊ฐ€๊ณ  ์‹ถ์ง€ ์•Š๊ณ  ์œ„์ƒ ์ƒํƒœ๊ฐ€ ์กฐ๊ธˆ ๋ถˆ์พŒํ–ˆ์Šต๋‹ˆ๋‹ค. ์Œ์‹๋„ ๋„ˆ๋ฌด ์งฐ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์šฐ๋ฆฌ ์—„๋งˆ ๊ฐ€๊ฒŒ๋ผ์„œ ์ถ”์ฒœํ•ฉ๋‹ˆ๋‹ค."

  • ๋งค์šฐ ๋งŽ์€ ๊ธ์ •ํ‘œํ˜„์ด ์žˆ์ง€๋งŒ ๊ฒฐ๊ตญ ๋ถ€์ • ํ‘œํ˜„

    • "์ •๋ง ๋ง›์žˆ์Šต๋‹ˆ๋‹ค. ์™„์ „ ์ตœ๊ณ ์ž…๋‹ˆ๋‹ค!. ์Œ์‹๋„ ๋ถ„์œ„๊ธฐ๋„ ์„œ๋น„์Šค๋„ ๋‹ค ์ข‹์•˜์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ€๊ฒฉ์ด ๋„ˆ๋ฌด ๋น„์‹ธ์„œ ๋ณ„๋กœ์˜€๋„ค์š”."

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ๊ธฐ์กด ๋ฐ์ดํ„ฐ์…‹์„ ์ฐธ๊ณ ํ•ด๋„ ์•Œ ์ˆ˜ ์—†๋Š” ๊ธ์ • ํ‘œํ˜„ => ๊ธ์ •

    • ์ฒœ์žฐ๊ฐ€?

    • "๋ง›๋„" => ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๊ธ์ •์—์„œ๋งŒ ์‚ฌ์šฉ

    • "์„œ๋น„์Šค๋„" => ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ ๊ธ์ •์—์„œ๋งŒ ์‚ฌ์šฉ

    • ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ๊ธ์ •์ด ๋‚˜์˜จ๊ฒƒ์œผ๋กœ ๋ณด์ž„. ๋ฐ˜๋Œ€๋กœ "๋ง›๋„ ์—†๊ณ  ์„œ๋น„์Šค๋„ ๋ณ„๋กœ๋„ค์š”" ์— ๋Œ€ํ•ด์„œ๋„ ๊ธ์ •์ด ๋‚˜์˜จ๋‹ค.

  • ๋งค์šฐ ๋งŽ์€ ๋ถ€์ •ํ‘œํ˜„์ด ์žˆ์ง€๋งŒ ๊ฒฐ๊ตญ ๊ธ์ • ํ‘œํ˜„ => ๋ถ€์ •

    • ๋ถ€์ • ๋‹จ์–ด๊ฐ€ ํ›จ์”ฌ ๋งŽ์•„์„œ ๋ถ€์ •, ๊ธ์ • ํ‘œํ˜„์€ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์—†์Œ

  • ๋งค์šฐ ๋งŽ์€ ๊ธ์ •ํ‘œํ˜„์ด ์žˆ์ง€๋งŒ ๊ฒฐ๊ตญ ๋ถ€์ • ํ‘œํ˜„ => ๊ธ์ •

    • ์œ„์™€ ๋งˆ์ฐฌ๊ฐ€์ง€์ด๋‹ค. ๊ธ์ • ํ‘œํ˜„๊ณผ ๋ถ€์ • ํ‘œํ˜„ ๋ชจ๋‘ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์žˆ์ง€๋งŒ ๊ฐœ์ˆ˜์˜ ์ฐจ์ด๋กœ ๊ธ์ •.

Previous(02๊ฐ•) Word EmbeddingNext[ํ•„์ˆ˜ ๊ณผ์ œ 4] Preprocessing for NMT Model

Last updated 3 years ago

Was this helpful?

For any pair of words, the distance is 2\sqrt {2} 2โ€‹

zero probability๊ฐ€ ๋ญ๋ƒ๊ณ ? !

์•„๊นŒ ๋งํ•œ ์ด ๋ถ€๋ถ„