๐Ÿšดโ€โ™‚๏ธ
TIL
  • MAIN
  • : TIL?
  • : WIL
  • : Plan
  • : Retrospective
    • 21Y
      • Wait a moment!
      • 9M 2W
      • 9M1W
      • 8M4W
      • 8M3W
      • 8M2W
      • 8M1W
      • 7M4W
      • 7M3W
      • 7M2W
      • 7M1W
      • 6M5W
      • 1H
    • ์ƒˆ์‚ฌ๋žŒ ๋˜๊ธฐ ํ”„๋กœ์ ํŠธ
      • 2ํšŒ์ฐจ
      • 1ํšŒ์ฐจ
  • TIL : ML
    • Paper Analysis
      • BERT
      • Transformer
    • Boostcamp 2st
      • [S]Data Viz
        • (4-3) Seaborn ์‹ฌํ™”
        • (4-2) Seaborn ๊ธฐ์ดˆ
        • (4-1) Seaborn ์†Œ๊ฐœ
        • (3-4) More Tips
        • (3-3) Facet ์‚ฌ์šฉํ•˜๊ธฐ
        • (3-2) Color ์‚ฌ์šฉํ•˜๊ธฐ
        • (3-1) Text ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-3) Scatter Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-2) Line Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (2-1) Bar Plot ์‚ฌ์šฉํ•˜๊ธฐ
        • (1-3) Python๊ณผ Matplotlib
        • (1-2) ์‹œ๊ฐํ™”์˜ ์š”์†Œ
        • (1-1) Welcome to Visualization (OT)
      • [P]MRC
        • (2๊ฐ•) Extraction-based MRC
        • (1๊ฐ•) MRC Intro & Python Basics
      • [P]KLUE
        • (5๊ฐ•) BERT ๊ธฐ๋ฐ˜ ๋‹จ์ผ ๋ฌธ์žฅ ๋ถ„๋ฅ˜ ๋ชจ๋ธ ํ•™์Šต
        • (4๊ฐ•) ํ•œ๊ตญ์–ด BERT ์–ธ์–ด ๋ชจ๋ธ ํ•™์Šต
        • [NLP] ๋ฌธ์žฅ ๋‚ด ๊ฐœ์ฒด๊ฐ„ ๊ด€๊ณ„ ์ถ”์ถœ
        • (3๊ฐ•) BERT ์–ธ์–ด๋ชจ๋ธ ์†Œ๊ฐœ
        • (2๊ฐ•) ์ž์—ฐ์–ด์˜ ์ „์ฒ˜๋ฆฌ
        • (1๊ฐ•) ์ธ๊ณต์ง€๋Šฅ๊ณผ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ
      • [U]Stage-CV
      • [U]Stage-NLP
        • 7W Retrospective
        • (10๊ฐ•) Advanced Self-supervised Pre-training Models
        • (09๊ฐ•) Self-supervised Pre-training Models
        • (08๊ฐ•) Transformer (2)
        • (07๊ฐ•) Transformer (1)
        • 6W Retrospective
        • (06๊ฐ•) Beam Search and BLEU score
        • (05๊ฐ•) Sequence to Sequence with Attention
        • (04๊ฐ•) LSTM and GRU
        • (03๊ฐ•) Recurrent Neural Network and Language Modeling
        • (02๊ฐ•) Word Embedding
        • (01๊ฐ•) Intro to NLP, Bag-of-Words
        • [ํ•„์ˆ˜ ๊ณผ์ œ 4] Preprocessing for NMT Model
        • [ํ•„์ˆ˜ ๊ณผ์ œ 3] Subword-level Language Model
        • [ํ•„์ˆ˜ ๊ณผ์ œ2] RNN-based Language Model
        • [์„ ํƒ ๊ณผ์ œ] BERT Fine-tuning with transformers
        • [ํ•„์ˆ˜ ๊ณผ์ œ] Data Preprocessing
      • Mask Wear Image Classification
        • 5W Retrospective
        • Report_Level1_6
        • Performance | Review
        • DAY 11 : HardVoting | MultiLabelClassification
        • DAY 10 : Cutmix
        • DAY 9 : Loss Function
        • DAY 8 : Baseline
        • DAY 7 : Class Imbalance | Stratification
        • DAY 6 : Error Fix
        • DAY 5 : Facenet | Save
        • DAY 4 : VIT | F1_Loss | LrScheduler
        • DAY 3 : DataSet/Lodaer | EfficientNet
        • DAY 2 : Labeling
        • DAY 1 : EDA
        • 2_EDA Analysis
      • [P]Stage-1
        • 4W Retrospective
        • (10๊ฐ•) Experiment Toolkits & Tips
        • (9๊ฐ•) Ensemble
        • (8๊ฐ•) Training & Inference 2
        • (7๊ฐ•) Training & Inference 1
        • (6๊ฐ•) Model 2
        • (5๊ฐ•) Model 1
        • (4๊ฐ•) Data Generation
        • (3๊ฐ•) Dataset
        • (2๊ฐ•) Image Classification & EDA
        • (1๊ฐ•) Competition with AI Stages!
      • [U]Stage-3
        • 3W Retrospective
        • PyTorch
          • (10๊ฐ•) PyTorch Troubleshooting
          • (09๊ฐ•) Hyperparameter Tuning
          • (08๊ฐ•) Multi-GPU ํ•™์Šต
          • (07๊ฐ•) Monitoring tools for PyTorch
          • (06๊ฐ•) ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
          • (05๊ฐ•) Dataset & Dataloader
          • (04๊ฐ•) AutoGrad & Optimizer
          • (03๊ฐ•) PyTorch ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ
          • (02๊ฐ•) PyTorch Basics
          • (01๊ฐ•) Introduction to PyTorch
      • [U]Stage-2
        • 2W Retrospective
        • DL Basic
          • (10๊ฐ•) Generative Models 2
          • (09๊ฐ•) Generative Models 1
          • (08๊ฐ•) Sequential Models - Transformer
          • (07๊ฐ•) Sequential Models - RNN
          • (06๊ฐ•) Computer Vision Applications
          • (05๊ฐ•) Modern CNN - 1x1 convolution์˜ ์ค‘์š”์„ฑ
          • (04๊ฐ•) Convolution์€ ๋ฌด์—‡์ธ๊ฐ€?
          • (03๊ฐ•) Optimization
          • (02๊ฐ•) ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ - MLP (Multi-Layer Perceptron)
          • (01๊ฐ•) ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ณธ ์šฉ์–ด ์„ค๋ช… - Historical Review
        • Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] Multi-headed Attention Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] LSTM Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] CNN Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] Optimization Assignment
          • [ํ•„์ˆ˜ ๊ณผ์ œ] MLP Assignment
      • [U]Stage-1
        • 1W Retrospective
        • AI Math
          • (AI Math 10๊ฐ•) RNN ์ฒซ๊ฑธ์Œ
          • (AI Math 9๊ฐ•) CNN ์ฒซ๊ฑธ์Œ
          • (AI Math 8๊ฐ•) ๋ฒ ์ด์ฆˆ ํ†ต๊ณ„ํ•™ ๋ง›๋ณด๊ธฐ
          • (AI Math 7๊ฐ•) ํ†ต๊ณ„ํ•™ ๋ง›๋ณด๊ธฐ
          • (AI Math 6๊ฐ•) ํ™•๋ฅ ๋ก  ๋ง›๋ณด๊ธฐ
          • (AI Math 5๊ฐ•) ๋”ฅ๋Ÿฌ๋‹ ํ•™์Šต๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
          • (AI Math 4๊ฐ•) ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• - ๋งค์šด๋ง›
          • (AI Math 3๊ฐ•) ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• - ์ˆœํ•œ๋ง›
          • (AI Math 2๊ฐ•) ํ–‰๋ ฌ์ด ๋ญ์˜ˆ์š”?
          • (AI Math 1๊ฐ•) ๋ฒกํ„ฐ๊ฐ€ ๋ญ์˜ˆ์š”?
        • Python
          • (Python 7-2๊ฐ•) pandas II
          • (Python 7-1๊ฐ•) pandas I
          • (Python 6๊ฐ•) numpy
          • (Python 5-2๊ฐ•) Python data handling
          • (Python 5-1๊ฐ•) File / Exception / Log Handling
          • (Python 4-2๊ฐ•) Module and Project
          • (Python 4-1๊ฐ•) Python Object Oriented Programming
          • (Python 3-2๊ฐ•) Pythonic code
          • (Python 3-1๊ฐ•) Python Data Structure
          • (Python 2-4๊ฐ•) String and advanced function concept
          • (Python 2-3๊ฐ•) Conditionals and Loops
          • (Python 2-2๊ฐ•) Function and Console I/O
          • (Python 2-1๊ฐ•) Variables
          • (Python 1-3๊ฐ•) ํŒŒ์ด์ฌ ์ฝ”๋”ฉ ํ™˜๊ฒฝ
          • (Python 1-2๊ฐ•) ํŒŒ์ด์ฌ ๊ฐœ์š”
          • (Python 1-1๊ฐ•) Basic computer class for newbies
        • Assignment
          • [์„ ํƒ ๊ณผ์ œ 3] Maximum Likelihood Estimate
          • [์„ ํƒ ๊ณผ์ œ 2] Backpropagation
          • [์„ ํƒ ๊ณผ์ œ 1] Gradient Descent
          • [ํ•„์ˆ˜ ๊ณผ์ œ 5] Morsecode
          • [ํ•„์ˆ˜ ๊ณผ์ œ 4] Baseball
          • [ํ•„์ˆ˜ ๊ณผ์ œ 3] Text Processing 2
          • [ํ•„์ˆ˜ ๊ณผ์ œ 2] Text Processing 1
          • [ํ•„์ˆ˜ ๊ณผ์ œ 1] Basic Math
    • ๋”ฅ๋Ÿฌ๋‹ CNN ์™„๋ฒฝ ๊ฐ€์ด๋“œ - Fundamental ํŽธ
      • ์ข…ํ•ฉ ์‹ค์Šต 2 - ์บ๊ธ€ Plant Pathology(๋‚˜๋ฌด์žŽ ๋ณ‘ ์ง„๋‹จ) ๊ฒฝ์—ฐ ๋Œ€ํšŒ
      • ์ข…ํ•ฉ ์‹ค์Šต 1 - 120์ข…์˜ Dog Breed Identification ๋ชจ๋ธ ์ตœ์ ํ™”
      • ์‚ฌ์ „ ํ›ˆ๋ จ ๋ชจ๋ธ์˜ ๋ฏธ์„ธ ์กฐ์ • ํ•™์Šต๊ณผ ๋‹ค์–‘ํ•œ Learning Rate Scheduler์˜ ์ ์šฉ
      • Advanced CNN ๋ชจ๋ธ ํŒŒํ—ค์น˜๊ธฐ - ResNet ์ƒ์„ธ์™€ EfficientNet ๊ฐœ์š”
      • Advanced CNN ๋ชจ๋ธ ํŒŒํ—ค์น˜๊ธฐ - AlexNet, VGGNet, GoogLeNet
      • Albumentation์„ ์ด์šฉํ•œ Augmentation๊ธฐ๋ฒ•๊ณผ Keras Sequence ํ™œ์šฉํ•˜๊ธฐ
      • ์‚ฌ์ „ ํ›ˆ๋ จ CNN ๋ชจ๋ธ์˜ ํ™œ์šฉ๊ณผ Keras Generator ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ดํ•ด
      • ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์˜ ์ดํ•ด - Keras ImageDataGenerator ํ™œ์šฉ
      • CNN ๋ชจ๋ธ ๊ตฌํ˜„ ๋ฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๊ธฐ๋ณธ ๊ธฐ๋ฒ• ์ ์šฉํ•˜๊ธฐ
    • AI School 1st
    • ํ˜„์—… ์‹ค๋ฌด์ž์—๊ฒŒ ๋ฐฐ์šฐ๋Š” Kaggle ๋จธ์‹ ๋Ÿฌ๋‹ ์ž…๋ฌธ
    • ํŒŒ์ด์ฌ ๋”ฅ๋Ÿฌ๋‹ ํŒŒ์ดํ† ์น˜
  • TIL : Python & Math
    • Do It! ์žฅ๊ณ +๋ถ€ํŠธ์ŠคํŠธ๋žฉ: ํŒŒ์ด์ฌ ์›น๊ฐœ๋ฐœ์˜ ์ •์„
      • Relations - ๋‹ค๋Œ€๋‹ค ๊ด€๊ณ„
      • Relations - ๋‹ค๋Œ€์ผ ๊ด€๊ณ„
      • ํ…œํ”Œ๋ฆฟ ํŒŒ์ผ ๋ชจ๋“ˆํ™” ํ•˜๊ธฐ
      • TDD (Test Driven Development)
      • template tags & ์กฐ๊ฑด๋ฌธ
      • ์ •์  ํŒŒ์ผ(static files) & ๋ฏธ๋””์–ด ํŒŒ์ผ(media files)
      • FBV (Function Based View)์™€ CBV (Class Based View)
      • Django ์ž…๋ฌธํ•˜๊ธฐ
      • ๋ถ€ํŠธ์ŠคํŠธ๋žฉ
      • ํ”„๋ก ํŠธ์—”๋“œ ๊ธฐ์ดˆ๋‹ค์ง€๊ธฐ (HTML, CSS, JS)
      • ๋“ค์–ด๊ฐ€๊ธฐ + ํ™˜๊ฒฝ์„ค์ •
    • Algorithm
      • Programmers
        • Level1
          • ์†Œ์ˆ˜ ๋งŒ๋“ค๊ธฐ
          • ์ˆซ์ž ๋ฌธ์ž์—ด๊ณผ ์˜๋‹จ์–ด
          • ์ž์—ฐ์ˆ˜ ๋’ค์ง‘์–ด ๋ฐฐ์—ด๋กœ ๋งŒ๋“ค๊ธฐ
          • ์ •์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ๋ฐฐ์น˜ํ•˜๊ธฐ
          • ์ •์ˆ˜ ์ œ๊ณฑ๊ทผ ํŒ๋ณ„
          • ์ œ์ผ ์ž‘์€ ์ˆ˜ ์ œ๊ฑฐํ•˜๊ธฐ
          • ์ง์‚ฌ๊ฐํ˜• ๋ณ„์ฐ๊ธฐ
          • ์ง์ˆ˜์™€ ํ™€์ˆ˜
          • ์ฒด์œก๋ณต
          • ์ตœ๋Œ€๊ณต์•ฝ์ˆ˜์™€ ์ตœ์†Œ๊ณต๋ฐฐ์ˆ˜
          • ์ฝœ๋ผ์ธ  ์ถ”์ธก
          • ํฌ๋ ˆ์ธ ์ธํ˜•๋ฝ‘๊ธฐ ๊ฒŒ์ž„
          • ํ‚คํŒจ๋“œ ๋ˆ„๋ฅด๊ธฐ
          • ํ‰๊ท  ๊ตฌํ•˜๊ธฐ
          • ํฐ์ผ“๋ชฌ
          • ํ•˜์ƒค๋“œ ์ˆ˜
          • ํ•ธ๋“œํฐ ๋ฒˆํ˜ธ ๊ฐ€๋ฆฌ๊ธฐ
          • ํ–‰๋ ฌ์˜ ๋ง์…ˆ
        • Level2
          • ์ˆซ์ž์˜ ํ‘œํ˜„
          • ์ˆœ์œ„ ๊ฒ€์ƒ‰
          • ์ˆ˜์‹ ์ตœ๋Œ€ํ™”
          • ์†Œ์ˆ˜ ์ฐพ๊ธฐ
          • ์†Œ์ˆ˜ ๋งŒ๋“ค๊ธฐ
          • ์‚ผ๊ฐ ๋‹ฌํŒฝ์ด
          • ๋ฌธ์ž์—ด ์••์ถ•
          • ๋ฉ”๋‰ด ๋ฆฌ๋‰ด์–ผ
          • ๋” ๋งต๊ฒŒ
          • ๋•…๋”ฐ๋จน๊ธฐ
          • ๋ฉ€์ฉกํ•œ ์‚ฌ๊ฐํ˜•
          • ๊ด„ํ˜ธ ํšŒ์ „ํ•˜๊ธฐ
          • ๊ด„ํ˜ธ ๋ณ€ํ™˜
          • ๊ตฌ๋ช…๋ณดํŠธ
          • ๊ธฐ๋Šฅ ๊ฐœ๋ฐœ
          • ๋‰ด์Šค ํด๋Ÿฌ์Šคํ„ฐ๋ง
          • ๋‹ค๋ฆฌ๋ฅผ ์ง€๋‚˜๋Š” ํŠธ๋Ÿญ
          • ๋‹ค์Œ ํฐ ์ˆซ์ž
          • ๊ฒŒ์ž„ ๋งต ์ตœ๋‹จ๊ฑฐ๋ฆฌ
          • ๊ฑฐ๋ฆฌ๋‘๊ธฐ ํ™•์ธํ•˜๊ธฐ
          • ๊ฐ€์žฅ ํฐ ์ •์‚ฌ๊ฐํ˜• ์ฐพ๊ธฐ
          • H-Index
          • JadenCase ๋ฌธ์ž์—ด ๋งŒ๋“ค๊ธฐ
          • N๊ฐœ์˜ ์ตœ์†Œ๊ณต๋ฐฐ์ˆ˜
          • N์ง„์ˆ˜ ๊ฒŒ์ž„
          • ๊ฐ€์žฅ ํฐ ์ˆ˜
          • 124 ๋‚˜๋ผ์˜ ์ˆซ์ž
          • 2๊ฐœ ์ดํ•˜๋กœ ๋‹ค๋ฅธ ๋น„ํŠธ
          • [3์ฐจ] ํŒŒ์ผ๋ช… ์ •๋ ฌ
          • [3์ฐจ] ์••์ถ•
          • ์ค„ ์„œ๋Š” ๋ฐฉ๋ฒ•
          • [3์ฐจ] ๋ฐฉ๊ธˆ ๊ทธ๊ณก
          • ๊ฑฐ๋ฆฌ๋‘๊ธฐ ํ™•์ธํ•˜๊ธฐ
        • Level3
          • ๋งค์นญ ์ ์ˆ˜
          • ์™ธ๋ฒฝ ์ ๊ฒ€
          • ๊ธฐ์ง€๊ตญ ์„ค์น˜
          • ์ˆซ์ž ๊ฒŒ์ž„
          • 110 ์˜ฎ๊ธฐ๊ธฐ
          • ๊ด‘๊ณ  ์ œ๊ฑฐ
          • ๊ธธ ์ฐพ๊ธฐ ๊ฒŒ์ž„
          • ์…”ํ‹€๋ฒ„์Šค
          • ๋‹จ์†์นด๋ฉ”๋ผ
          • ํ‘œ ํŽธ์ง‘
          • N-Queen
          • ์ง•๊ฒ€๋‹ค๋ฆฌ ๊ฑด๋„ˆ๊ธฐ
          • ์ตœ๊ณ ์˜ ์ง‘ํ•ฉ
          • ํ•ฉ์Šน ํƒ์‹œ ์š”๊ธˆ
          • ๊ฑฐ์Šค๋ฆ„๋ˆ
          • ํ•˜๋…ธ์ด์˜ ํƒ‘
          • ๋ฉ€๋ฆฌ ๋›ฐ๊ธฐ
          • ๋ชจ๋‘ 0์œผ๋กœ ๋งŒ๋“ค๊ธฐ
        • Level4
    • Head First Python
    • ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ SQL
    • ๋‹จ ๋‘ ์žฅ์˜ ๋ฌธ์„œ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ ์‹œ๊ฐํ™” ๋ฝ€๊ฐœ๊ธฐ
    • Linear Algebra(Khan Academy)
    • ์ธ๊ณต์ง€๋Šฅ์„ ์œ„ํ•œ ์„ ํ˜•๋Œ€์ˆ˜
    • Statistics110
  • TIL : etc
    • [๋”ฐ๋ฐฐ๋Ÿฐ] Kubernetes
    • [๋”ฐ๋ฐฐ๋Ÿฐ] Docker
      • 2. ๋„์ปค ์„ค์น˜ ์‹ค์Šต 1 - ํ•™์ŠตํŽธ(์ค€๋น„๋ฌผ/์‹ค์Šต ์œ ํ˜• ์†Œ๊ฐœ)
      • 1. ์ปจํ…Œ์ด๋„ˆ์™€ ๋„์ปค์˜ ์ดํ•ด - ์ปจํ…Œ์ด๋„ˆ๋ฅผ ์“ฐ๋Š”์ด์œ  / ์ผ๋ฐ˜ํ”„๋กœ๊ทธ๋žจ๊ณผ ์ปจํ…Œ์ด๋„ˆํ”„๋กœ๊ทธ๋žจ์˜ ์ฐจ์ด์ 
      • 0. ๋“œ๋””์–ด ์ฐพ์•„์˜จ Docker ๊ฐ•์˜! ์™•์ดˆ๋ณด์—์„œ ๋„์ปค ๋งˆ์Šคํ„ฐ๋กœ - OT
    • CoinTrading
      • [๊ฐ€์ƒ ํ™”ํ ์ž๋™ ๋งค๋งค ํ”„๋กœ๊ทธ๋žจ] ๋ฐฑํ…Œ์ŠคํŒ… : ๊ฐ„๋‹จํ•œ ํ…Œ์ŠคํŒ…
    • Gatsby
      • 01 ๊นƒ๋ถ ํฌ๊ธฐ ์„ ์–ธ
  • TIL : Project
    • Mask Wear Image Classification
    • Project. GARIGO
  • 2021 TIL
    • CHANGED
    • JUN
      • 30 Wed
      • 29 Tue
      • 28 Mon
      • 27 Sun
      • 26 Sat
      • 25 Fri
      • 24 Thu
      • 23 Wed
      • 22 Tue
      • 21 Mon
      • 20 Sun
      • 19 Sat
      • 18 Fri
      • 17 Thu
      • 16 Wed
      • 15 Tue
      • 14 Mon
      • 13 Sun
      • 12 Sat
      • 11 Fri
      • 10 Thu
      • 9 Wed
      • 8 Tue
      • 7 Mon
      • 6 Sun
      • 5 Sat
      • 4 Fri
      • 3 Thu
      • 2 Wed
      • 1 Tue
    • MAY
      • 31 Mon
      • 30 Sun
      • 29 Sat
      • 28 Fri
      • 27 Thu
      • 26 Wed
      • 25 Tue
      • 24 Mon
      • 23 Sun
      • 22 Sat
      • 21 Fri
      • 20 Thu
      • 19 Wed
      • 18 Tue
      • 17 Mon
      • 16 Sun
      • 15 Sat
      • 14 Fri
      • 13 Thu
      • 12 Wed
      • 11 Tue
      • 10 Mon
      • 9 Sun
      • 8 Sat
      • 7 Fri
      • 6 Thu
      • 5 Wed
      • 4 Tue
      • 3 Mon
      • 2 Sun
      • 1 Sat
    • APR
      • 30 Fri
      • 29 Thu
      • 28 Wed
      • 27 Tue
      • 26 Mon
      • 25 Sun
      • 24 Sat
      • 23 Fri
      • 22 Thu
      • 21 Wed
      • 20 Tue
      • 19 Mon
      • 18 Sun
      • 17 Sat
      • 16 Fri
      • 15 Thu
      • 14 Wed
      • 13 Tue
      • 12 Mon
      • 11 Sun
      • 10 Sat
      • 9 Fri
      • 8 Thu
      • 7 Wed
      • 6 Tue
      • 5 Mon
      • 4 Sun
      • 3 Sat
      • 2 Fri
      • 1 Thu
    • MAR
      • 31 Wed
      • 30 Tue
      • 29 Mon
      • 28 Sun
      • 27 Sat
      • 26 Fri
      • 25 Thu
      • 24 Wed
      • 23 Tue
      • 22 Mon
      • 21 Sun
      • 20 Sat
      • 19 Fri
      • 18 Thu
      • 17 Wed
      • 16 Tue
      • 15 Mon
      • 14 Sun
      • 13 Sat
      • 12 Fri
      • 11 Thu
      • 10 Wed
      • 9 Tue
      • 8 Mon
      • 7 Sun
      • 6 Sat
      • 5 Fri
      • 4 Thu
      • 3 Wed
      • 2 Tue
      • 1 Mon
    • FEB
      • 28 Sun
      • 27 Sat
      • 26 Fri
      • 25 Thu
      • 24 Wed
      • 23 Tue
      • 22 Mon
      • 21 Sun
      • 20 Sat
      • 19 Fri
      • 18 Thu
      • 17 Wed
      • 16 Tue
      • 15 Mon
      • 14 Sun
      • 13 Sat
      • 12 Fri
      • 11 Thu
      • 10 Wed
      • 9 Tue
      • 8 Mon
      • 7 Sun
      • 6 Sat
      • 5 Fri
      • 4 Thu
      • 3 Wed
      • 2 Tue
      • 1 Mon
    • JAN
      • 31 Sun
      • 30 Sat
      • 29 Fri
      • 28 Thu
      • 27 Wed
      • 26 Tue
      • 25 Mon
      • 24 Sun
      • 23 Sat
      • 22 Fri
      • 21 Thu
      • 20 Wed
      • 19 Tue
      • 18 Mon
      • 17 Sun
      • 16 Sat
      • 15 Fri
      • 14 Thu
      • 13 Wed
      • 12 Tue
      • 11 Mon
      • 10 Sun
      • 9 Sat
      • 8 Fri
      • 7 Thu
      • 6 Wed
      • 5 Tue
      • 4 Mon
      • 3 Sun
      • 2 Sat
      • 1 Fri
  • 2020 TIL
    • DEC
      • 31 Thu
      • 30 Wed
      • 29 Tue
      • 28 Mon
      • 27 Sun
      • 26 Sat
      • 25 Fri
      • 24 Thu
      • 23 Wed
      • 22 Tue
      • 21 Mon
      • 20 Sun
      • 19 Sat
      • 18 Fri
      • 17 Thu
      • 16 Wed
      • 15 Tue
      • 14 Mon
      • 13 Sun
      • 12 Sat
      • 11 Fri
      • 10 Thu
      • 9 Wed
      • 8 Tue
      • 7 Mon
      • 6 Sun
      • 5 Sat
      • 4 Fri
      • 3 Tue
      • 2 Wed
      • 1 Tue
    • NOV
      • 30 Mon
Powered by GitBook
On this page
  • [AI ์Šค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 3
  • End To End ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ
  • 1. ํฐ ๊ทธ๋ฆผ ๋ณด๊ธฐ
  • 2. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ
  • 3. ๋ฐ์ดํ„ฐ ์ดํ•ด๋ฅผ ์œ„ํ•œ ํƒ์ƒ‰๊ณผ ์‹œ๊ฐํ™”
  • 4. ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ค€๋น„
  • 5. ๋ชจ๋ธ ํ›ˆ๋ จ(Train a Model)
  • 6. ๋ชจ๋ธ ์„ธ๋ถ€ ํŠœ๋‹(Fine-Tune Your Model)
  • 7. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ตœ์ข… ํ‰๊ฐ€ํ•˜๊ธฐ
  • 8. ๋ก ์นญ, ๋ชจ๋‹ˆํ„ฐ๋ง, ์‹œ์Šคํ…œ ์œ ์ง€ ๋ณด์ˆ˜
  • [AI ์Šค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 2
  • Machine Learning ๊ธฐ์ดˆ - ๊ฒฐ์ •์ด๋ก 
  • Machine Learning ๊ธฐ์ดˆ - ์„ ํ˜•ํšŒ๊ท€

Was this helpful?

  1. 2021 TIL
  2. JAN

7 Thu

TIL

[AI ์Šค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 3

End To End ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ

๋ถ€๋™์‚ฐ ํšŒ์‚ฌ์— ๊ณ ์šฉ๋œ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ํ”„๋กœ์ ํŠธ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€(E2E) ์ง„ํ–‰ํ•˜๋Š” ๊ณผ์ •

  1. ํฐ ๊ทธ๋ฆผ์„ ๋ณธ๋‹ค

  2. ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•œ๋‹ค

  3. ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ†ต์ฐฐ์„ ์–ป๊ธฐ ์œ„ํ•ด ํƒ์ƒ‰ํ•˜๊ณ  ์‹œ๊ฐํ™”ํ•œ๋‹ค.

  4. ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ค€๋น„ํ•œ๋‹ค

  5. ๋ชจ๋ธ์„ ์„ ํƒํ•˜๊ณ  ํ›ˆ๋ จ์‹œํ‚จ๋‹ค

  6. ๋ชจ๋ธ์„ ์ƒ์„ธํ•˜๊ฒŒ ์กฐ์ •ํ•œ๋‹ค

  7. ์†”๋ฃจ์…˜์„ ์ œ์‹œํ•œ๋‹ค

  8. ์‹œ์Šคํ…œ์„ ๋ก ์นญํ•˜๊ณ  ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ  ์œ ์ง€ ๋ณด์ˆ˜ํ•œ๋‹ค

1. ํฐ ๊ทธ๋ฆผ ๋ณด๊ธฐ

ํ’€์–ด์•ผ ํ•  ๋ฌธ์ œ : ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ธ๊ตฌ์กฐ์‚ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ์บ˜๋ฆฌํฌ๋‹ˆ์•„์˜ ์ฃผํƒ ๊ฐ€๊ฒฉ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ

์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์–ด์•ผ ํ• ๊นŒ? ์ „๋ฌธ๊ฐ€๊ฐ€ ์ˆ˜๋™์œผ๋กœ? ๋ณต์žกํ•œ ๊ทœ์น™์„ ํ†ตํ•ด? ๋จธ์‹ ๋Ÿฌ๋‹์„ ์ด์šฉํ•ด?

๋ฌธ์ œ์ •์˜

  • ์ง€๋„ํ•™์Šต, ๋น„์ง€๋„ํ•™์Šต, ๊ฐ•ํ™”ํ•™์Šต ์ค‘์— ์–ด๋–ค ๊ฒฝ์šฐ์ธ๊ฐ€?

  • = ์ง€๋„ํ•™์Šต

  • ๋ถ„๋ฅ˜๋ฌธ์ œ์ธ๊ฐ€ ํšŒ๊ท€๋ฌธ์ œ์ธ๊ฐ€?

  • = ํšŒ๊ท€๋ฌธ์ œ

  • ๋ฐฐ์น˜ํ•™์Šต, ์˜จ๋ผ์ธํ•™์Šต ์ค‘ ์–ด๋–ค ๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”๊ฐ€?

  • = ๋ฐฐ์น˜ํ•™์Šต

์„ฑ๋Šฅ์ธก์ •์ง€ํ‘œ ์„ ํƒ

ํ‰๊ท ์ œ๊ณฑ๊ทผ ์˜ค์ฐจ, RMSE(Root Mean Squeare Error)

RMSE(X, h) = 1mโˆ‘i=1m(h(x(i)โˆ’y(i))2 \sqrt { \frac {1} {m} \sum^{m}_{i=1} (h(x^{(i)} - y ^{(i)})^2}m1โ€‹โˆ‘i=1mโ€‹(h(x(i)โˆ’y(i))2โ€‹

  • m m m : ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ์ƒ˜ํ”Œ ์ˆ˜

  • x(i) x^{(i)} x(i): i๋ฒˆ์งธ ์ƒ˜ํ”Œ์˜ ์ „์ฒด ํŠน์„ฑ๊ฐ’์˜ ๋ฒกํ„ฐ

  • y(i) y^{(i)} y(i): i๋ฒˆ์งธ ์ƒ˜ํ”Œ์˜ label

2. ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

์ž‘์—…ํ™˜๊ฒฝ ์„ค์ •

  • ML ๋””๋ ‰ํ† ๋ฆฌ ์ƒ์„ฑ

$ export ML_PATH = "$./ml" # You can change the path if you prefer
$ mkdir -p $ML_PATH
  • ๊ฐ€์ƒํ™˜๊ฒฝ ์„ค์ •

$ cd $ML_PATH
$ virtualenv env
  • ํŒจํ‚ค์ง€ ์„ค์น˜

$ pip3 install --upgrade jupyter matplotlib numpy pandas scipy scikit-learn
Collecting jupyter

๋ฐ์ดํ„ฐ ๋‹ค์šด๋กœ๋“œ

# Python โ‰ฅ3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn โ‰ฅ0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
fetch_housing_data()
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ํ›‘์–ด๋ณด๊ธฐ

housing = load_housing_data()
housing.head()

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

0

-122.23

37.88

41.0

880.0

129.0

322.0

126.0

8.3252

452600.0

NEAR BAY

1

-122.22

37.86

21.0

7099.0

1106.0

2401.0

1138.0

8.3014

358500.0

NEAR BAY

2

-122.24

37.85

52.0

1467.0

190.0

496.0

177.0

7.2574

352100.0

NEAR BAY

3

-122.25

37.85

52.0

1274.0

235.0

558.0

219.0

5.6431

341300.0

NEAR BAY

4

-122.25

37.85

52.0

1627.0

280.0

565.0

259.0

3.8462

342200.0

NEAR BAY

# ์ข€ ๋” ์ž์„ธํ•œ ๋ฐ์ดํ„ฐ ์ •๋ณด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Œ
housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
housing["ocean_proximity"].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
housing.describe()

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

count

20640.000000

20640.000000

20640.000000

20640.000000

20433.000000

20640.000000

20640.000000

20640.000000

20640.000000

mean

-119.569704

35.631861

28.639486

2635.763081

537.870553

1425.476744

499.539680

3.870671

206855.816909

std

2.003532

2.135952

12.585558

2181.615252

421.385070

1132.462122

382.329753

1.899822

115395.615874

min

-124.350000

32.540000

1.000000

2.000000

1.000000

3.000000

1.000000

0.499900

14999.000000

25%

-121.800000

33.930000

18.000000

1447.750000

296.000000

787.000000

280.000000

2.563400

119600.000000

50%

-118.490000

34.260000

29.000000

2127.000000

435.000000

1166.000000

409.000000

3.534800

179700.000000

75%

-118.010000

37.710000

37.000000

3148.000000

647.000000

1725.000000

605.000000

4.743250

264725.000000

max

-114.310000

41.950000

52.000000

39320.000000

6445.000000

35682.000000

6082.000000

15.000100

500001.000000

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show();
Saving figure attribute_histogram_plots

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋งŒ๋“ค๊ธฐ

์ข‹์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ ํ‰๊ฐ€๋งŒ์„ ์œ„ํ•œ "ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹"์„ ๋”ฐ๋กœ ๊ตฌ๋ถ„ํ•ด์•ผํ•œ๋‹ค. ์ดˆ๊ธฐ์— ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ .

np.random.seed(42)
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
a = np.random.permutation(10)
a
array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])
train_set, test_set = split_train_test(housing, 0.2) # train/test data split
len(train_set), len(test_set)
(16512, 4128)
  • ์œ„ ๋ฐฉ๋ฒ•์€ ๋ฌธ์ œ์ ์€? : ์—ฌ๋Ÿฌ ๋ฒˆ ์ˆ˜ํ–‰ํ•  ๊ฒฝ์šฐ ํ›ˆ๋ จ์šฉ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์˜ฎ๊ฒจ์ง€๊ฑฐ๋‚˜ ๊ทธ ๋ฐ˜๋Œ€๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค

  • ํ•ด๊ฒฐ๋ฐฉ์•ˆ: ๊ฐ ์ƒ˜ํ”Œ์˜ ์‹๋ณ„์ž(identifier)๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ถ„ํ• 

from zlib import crc32 # ํ•ด์‹ฑํ•จ์ˆ˜

# test set์— ์†ํ•˜๋ƒ
def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
# crc32(np.int64(identifier)) & 0xffffffff : ํ•ด์‹ฑํ•œ ๊ฐ’์„ 2์˜ 32์Šน์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค(?)

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
housing.head()

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

0

-122.23

37.88

41.0

880.0

129.0

322.0

126.0

8.3252

452600.0

NEAR BAY

1

-122.22

37.86

21.0

7099.0

1106.0

2401.0

1138.0

8.3014

358500.0

NEAR BAY

2

-122.24

37.85

52.0

1467.0

190.0

496.0

177.0

7.2574

352100.0

NEAR BAY

3

-122.25

37.85

52.0

1274.0

235.0

558.0

219.0

5.6431

341300.0

NEAR BAY

4

-122.25

37.85

52.0

1627.0

280.0

565.0

259.0

3.8462

342200.0

NEAR BAY

housing_with_id.head() # index column ์ถ”๊ฐ€ํ•œ data

index

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

0

0

-122.23

37.88

41.0

880.0

129.0

322.0

126.0

8.3252

452600.0

NEAR BAY

1

1

-122.22

37.86

21.0

7099.0

1106.0

2401.0

1138.0

8.3014

358500.0

NEAR BAY

2

2

-122.24

37.85

52.0

1467.0

190.0

496.0

177.0

7.2574

352100.0

NEAR BAY

3

3

-122.25

37.85

52.0

1274.0

235.0

558.0

219.0

5.6431

341300.0

NEAR BAY

4

4

-122.25

37.85

52.0

1627.0

280.0

565.0

259.0

3.8462

342200.0

NEAR BAY

  • ์œ„ ๋ฐฉ๋ฒ•์˜ ๋ฌธ์ œ์ ์€? : ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค ๊ฐฑ์‹  ์‹œ ํ–‰๋ฒˆํ˜ธ ์ˆœ์„œ๊ฐ€ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ

  • id๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐ ์•ˆ์ „ํ•œ feature๋“ค์„ ์‚ฌ์šฉํ•ด์•ผ ํ•จ

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
train_set.head()

index

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

id

0

0

-122.23

37.88

41.0

880.0

129.0

322.0

126.0

8.3252

452600.0

NEAR BAY

-122192.12

1

1

-122.22

37.86

21.0

7099.0

1106.0

2401.0

1138.0

8.3014

358500.0

NEAR BAY

-122182.14

2

2

-122.24

37.85

52.0

1467.0

190.0

496.0

177.0

7.2574

352100.0

NEAR BAY

-122202.15

3

3

-122.25

37.85

52.0

1274.0

235.0

558.0

219.0

5.6431

341300.0

NEAR BAY

-122212.15

4

4

-122.25

37.85

52.0

1627.0

280.0

565.0

259.0

3.8462

342200.0

NEAR BAY

-122212.15

test_set.head()

index

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

median_house_value

ocean_proximity

id

59

59

-122.29

37.82

2.0

158.0

43.0

94.0

57.0

2.5625

60000.0

NEAR BAY

-122252.18

60

60

-122.29

37.83

52.0

1121.0

211.0

554.0

187.0

3.3929

75700.0

NEAR BAY

-122252.17

61

61

-122.29

37.82

49.0

135.0

29.0

86.0

23.0

6.1183

75000.0

NEAR BAY

-122252.18

62

62

-122.29

37.81

50.0

760.0

190.0

377.0

122.0

0.9011

86100.0

NEAR BAY

-122252.19

67

67

-122.29

37.80

52.0

1027.0

244.0

492.0

147.0

2.6094

81300.0

NEAR BAY

-122252.20

train set์˜ feautre์˜ ๋น„์œจ์ด test set์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ธฐ๋ฅผ ๋ฐ”๋ž€๋‹ค. ๋”ฐ๋ผ์„œ ๊ณ„์ธต์  ์ƒ˜ํ”Œ๋ง์ด ํ•„์š”

๊ณ„์ธต์  ์ƒ˜ํ”Œ๋ง(stratified sampling)

  • ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์ธต(strata)๋ผ๋Š” ๋™์งˆ์˜ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„๊ณ , ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋Œ€ํ‘œํ•˜๋„๋ก ๊ฐ ๊ณ„์ธต์—์„œ ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜์˜ ์ƒ˜ํ”Œ์„ ์ถ”์ถœ

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing["median_income"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2de4e5de648>
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5]) #bins ์ ์ ˆํ•˜๊ฒŒ ๋‚˜๋ˆ ์•ผํ•จ
housing["income_cat"].value_counts()
3    7236
2    6581
4    3639
5    2362
1     822
Name: income_cat, dtype: int64
housing["income_cat"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x2de4f0f2748>
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
strat_train_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   longitude           16512 non-null  float64 
 1   latitude            16512 non-null  float64 
 2   housing_median_age  16512 non-null  float64 
 3   total_rooms         16512 non-null  float64 
 4   total_bedrooms      16354 non-null  float64 
 5   population          16512 non-null  float64 
 6   households          16512 non-null  float64 
 7   median_income       16512 non-null  float64 
 8   median_house_value  16512 non-null  float64 
 9   ocean_proximity     16512 non-null  object  
 10  income_cat          16512 non-null  category
dtypes: category(1), float64(9), object(1)
memory usage: 1.4+ MB
strat_test_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 5241 to 2398
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   longitude           4128 non-null   float64 
 1   latitude            4128 non-null   float64 
 2   housing_median_age  4128 non-null   float64 
 3   total_rooms         4128 non-null   float64 
 4   total_bedrooms      4079 non-null   float64 
 5   population          4128 non-null   float64 
 6   households          4128 non-null   float64 
 7   median_income       4128 non-null   float64 
 8   median_house_value  4128 non-null   float64 
 9   ocean_proximity     4128 non-null   object  
 10  income_cat          4128 non-null   category
dtypes: category(1), float64(9), object(1)
memory usage: 359.0+ KB
housing["income_cat"].value_counts() / len(housing)
# ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์˜ ๋น„์œจ
3    0.350581
2    0.318847
4    0.176308
5    0.114438
1    0.039826
Name: income_cat, dtype: float64
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
# test data set์˜ ๋น„์œจ : ์ „์ฒด ๋ฐ์ดํ„ฐ ์…‹ ๋น„์œจ๊ณผ ๊ฑฐ์˜ ๋™์ผ
3    0.350533
2    0.318798
4    0.176357
5    0.114583
1    0.039729
Name: income_cat, dtype: float64
def income_cat_proportions(data):
    return data["income_cat"].value_counts() / len(data)

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

compare_props = pd.DataFrame({
    "Overall": income_cat_proportions(housing),
    "Stratified": income_cat_proportions(strat_test_set),
    "Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
compare_props # Randomํ•˜๊ฒŒ ๋‚˜๋ˆˆ๊ฑฐ๋ž‘ Stratified์œผ๋กœ ๋‚˜๋ˆˆ๊ฑฐ๋ž‘ ์–ผ๋งˆ๋‚˜ ์ฐจ์ด๊ฐ€ ์žˆ๋‚˜
# ๊ณ„์ธต์  ์ƒ˜ํ”Œ๋งํ•œ ๊ฒƒ์ด error๊ฐ€ ๋‚ฎ์Œ

Overall

Stratified

Random

Rand. %error

Strat. %error

1

0.039826

0.039729

0.040213

0.973236

-0.243309

2

0.318847

0.318798

0.324370

1.732260

-0.015195

3

0.350581

0.350533

0.358527

2.266446

-0.013820

4

0.176308

0.176357

0.167393

-5.056334

0.027480

5

0.114438

0.114583

0.109496

-4.318374

0.127011

# ์›๋ž˜ ์ƒํƒœ๋กœ ๋˜๋Œ๋ฆผ
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

3. ๋ฐ์ดํ„ฐ ์ดํ•ด๋ฅผ ์œ„ํ•œ ํƒ์ƒ‰๊ณผ ์‹œ๊ฐํ™”

# ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ๋ณธ ๋งŒ๋“ค๊ธฐ (ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ค์ง€ ์•Š๊ธฐ ์œ„ํ•ด)
housing = strat_train_set.copy()

์ง€๋ฆฌ์  ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”

housing.plot(kind="scatter", x="longitude", y="latitude")
save_fig("bad_visualization_plot")
Saving figure bad_visualization_plot

๋ฐ€์ง‘๋œ ์˜์—ญ ํ‘œ์‹œ

  • alpha์˜ต์…˜

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
save_fig("better_visualization_plot")
Saving figure better_visualization_plot

๋” ๋‹ค์–‘ํ•œ ์ •๋ณด ํ‘œ์‹œ

  • s: ์›์˜ ๋ฐ˜์ง€๋ฆ„ => ์ธ๊ตฌ

  • c: ์ƒ‰์ƒ => ๊ฐ€๊ฒฉ

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()
save_fig("housing_prices_scatterplot")
Saving figure housing_prices_scatterplot
# Download the California image
images_path = os.path.join(PROJECT_ROOT_DIR, "images", "end_to_end_project")
os.makedirs(images_path, exist_ok=True)

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
filename = "california.png" # ์ง€๋„ ๊ทธ๋ฆผ x
import matplotlib.image as mpimg
california_img=mpimg.imread(os.path.join(images_path, filename))
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
Saving figure california_housing_prices_plot

์œ„์—์„œ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์‹ค์€(์ฃผํƒ๊ฐ€๊ฒฉ์ด ๋†’์€ ์ง€์—ญ)?

์ƒ๊ด€๊ด€๊ณ„(Correlations) ๊ด€์ฐฐํ•˜๊ธฐ

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64

scatter_matrix ์‚ฌ์šฉํ•ด์„œ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธํ•˜๊ธฐ

# from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix

# ํŠน์„ฑ ๋ช‡ ๊ฐœ๋งŒ ์‚ดํŽด๋ด„ 
attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
Saving figure scatter_matrix_plot
housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")
Saving figure income_vs_house_value_scatterplot

์œ„์—์„œ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์‹ค๋“ค?

  • 50๋งŒ๋ถˆ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์— ๋Œ€ํ•ด ์„ ์ฒ˜๋Ÿผ ๋‚˜ํƒ€๋‚จ, ๋˜ํ•œ ์ค‘๊ฐ„์— ํฌ๋ฏธํ•œ ์„  ์กด์žฌ

    ๋น„์ •์ƒ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ๋ฐ์ดํ„ฐ๋“ค์€ ๊ฐ€๋Šฅํ•˜๋ฉด train data set์—์„œ ์ œ๊ฑฐํ•ด์ฃผ๋Š” ๊ฒƒ์ด ๋ชจ๋ธํ•™์Šต์— ๋„์›€์ด ๋จ

ํŠน์„ฑ ์กฐํ•ฉ๋“ค ์‹คํ—˜

  • ์—ฌ๋Ÿฌ ํŠน์„ฑ(feature, attribute)๋“ค์˜ ์กฐํ•ฉ์œผ๋กœ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ์ •์˜ํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ

  • ์˜ˆ๋ฅผ ๋“ค์ž๋ฉด, ๊ฐ€๊ตฌ๋‹น ๋ฐฉ ๊ฐœ์ˆ˜, ์นจ๋Œ€๋ฐฉ(bedroom)์˜ ๋น„์œจ, ๊ฐ€๊ตฌ๋‹น ์ธ์›

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

์œ„์—์„œ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์‚ฌ์‹ค๋“ค?

  • bedrooms_per_room : ๊ฐ•ํ•œ ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง

  • rooms_per_household : 2๋ฒˆ์งธ๋กœ ๋†’์€ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ฐ€์ง

์ƒˆ๋กœ ๋งŒ๋“  feature๊ฐ€ ์ง‘์ด ์–ผ๋งˆ๋‚˜ ํฐ์ง€ ๊ฐ„์ ‘์ ์œผ๋กœ ๋“œ๋Ÿฌ๋ƒ„

๋ฐ์ดํ„ฐ ํƒ์ƒ‰๊ณผ์ •์€ ๋Œ€๋ถ€๋ถ„ ํ•œ ๋ฒˆ์œผ๋กœ ๋๋‚˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ๋ฌธ์ œ์ ์„ ๋ถ„์„ํ•œ ๋’ค ๋‹ค์‹œ ์‹คํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

4. ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฐ์ดํ„ฐ ์ค€๋น„

๋ฐ์ดํ„ฐ ์ค€๋น„๋Š” ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜(data transformation)๊ณผ์ •์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ˆ˜๋™๋ณ€ํ™˜ vs. ์ž๋™๋ณ€ํ™˜(ํ•จ์ˆ˜๋งŒ๋“ค๊ธฐ)

๋ฐ์ดํ„ฐ ์ž๋™๋ณ€ํ™˜์˜ ์žฅ์ ๋“ค

  • ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ณ€ํ™˜์„ ์†์‰ฝ๊ฒŒ ์žฌ์ƒ์‚ฐ(reproduce)ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ํ–ฅํ›„์— ์žฌ์‚ฌ์šฉ(reuse)ํ•  ์ˆ˜ ์žˆ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • ์‹ค์ œ ์‹œ์Šคํ…œ์—์„œ ๊ฐ€๊ณต๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ(raw data)๋ฅผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์‰ฝ๊ฒŒ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค๋‹ˆ๋‹ค.

  • ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๋ฐฉ๋ฒ•์„ ์‰ฝ๊ฒŒ ์‹œ๋„ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

๋ฐ์ดํ„ฐ ์ •์ œ(Data Cleaning)

๋ˆ„๋ฝ๋œ ๊ฐ’(missing values) ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•๋“ค

  • ํ•ด๋‹น ๊ตฌ์—ญ์„ ์ œ๊ฑฐ(ํ–‰์„ ์ œ๊ฑฐ)

  • ํ•ด๋‹น ํŠน์„ฑ์„ ์ œ๊ฑฐ(์—ด์„ ์ œ๊ฑฐ)

  • ์–ด๋–ค ๊ฐ’์œผ๋กœ ์ฑ„์›€(0, ํ‰๊ท , ์ค‘๊ฐ„๊ฐ’ ๋“ฑ)

housing.isnull().any(axis=1)
17606    False
18632    False
14650    False
3230     False
3555     False
         ...  
6563     False
12053    False
13908    False
11159    False
15775    False
Length: 16512, dtype: bool
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head() # True if there is a null feature
sample_incomplete_rows

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

ocean_proximity

4629

-118.30

34.07

18.0

3759.0

NaN

3296.0

1462.0

2.2708

<1H OCEAN

6068

-117.86

34.01

16.0

4632.0

NaN

3038.0

727.0

5.1762

<1H OCEAN

17923

-121.97

37.35

30.0

1955.0

NaN

999.0

386.0

4.6328

<1H OCEAN

13656

-117.30

34.05

6.0

2155.0

NaN

1039.0

391.0

1.6675

INLAND

19252

-122.79

38.48

7.0

6837.0

NaN

3468.0

1405.0

3.1

sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # option 1

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

m

sample_incomplete_rows.drop("total_bedrooms", axis=1)       # option 2

longitude

latitude

housing_median_age

total_rooms

population

households

median_income

ocean_proximity

4629

-118.30

34.07

18.0

3759.0

3296.0

1462.0

2.2708

<1H OCEAN

6068

-117.86

34.01

16.0

4632.0

3038.0

727.0

5.1762

<1H OCEAN

17923

-121.97

37.35

30.0

1955.0

999.0

386.0

4.6328

<1H OCEAN

13656

-117.30

34.05

6.0

2155.0

1039.0

391.0

1.6675

INLAND

19252

-122.79

38.48

7.0

6837.0

3468.0

1405.0

3.1662

median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
median
433.0
sample_incomplete_rows

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

ocean_proximity

4629

-118.30

34.07

18.0

3759.0

433.0

3296.0

1462.0

2.2708

<1H OCEAN

6068

-117.86

34.01

16.0

4632.0

433.0

3038.0

727.0

5.1762

<1H OCEAN

17923

-121.97

37.35

30.0

1955.0

433.0

999.0

386.0

4.6328

<1H OCEAN

13656

-117.30

34.05

6.0

2155.0

433.0

1039.0

391.0

1.6675

INLAND

19252

-122.79

38.48

7.0

6837.0

433.0

3468.0

1405.0

3.1

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median") # ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋Š”๊ฒฝ์šฐ median์œผ๋กœ ์ฑ„์›Œ๋„ฃ์Œ
# ์ค‘๊ฐ„๊ฐ’์€ ์ˆ˜์น˜ํ˜• ํŠน์„ฑ์—์„œ๋งŒ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ…์ŠคํŠธ ํŠน์„ฑ์„ ์ œ์™ธํ•œ ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑ
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
SimpleImputer(strategy='median')
imputer.statistics_
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])
housing_num.median().values
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])

์ด์ œ ํ•™์Šต๋œ imputer ๊ฐ์ฒด๋ฅผ ์‚ฌ์šฉํ•ด ๋ˆ„๋ฝ๋œ ๊ฐ’์„ ์ค‘๊ฐ„๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

X = imputer.transform(housing_num)
X
array([[-121.89  ,   37.29  ,   38.    , ...,  710.    ,  339.    ,
           2.7042],
       [-121.93  ,   37.05  ,   14.    , ...,  306.    ,  113.    ,
           6.4214],
       [-117.2   ,   32.77  ,   31.    , ...,  936.    ,  462.    ,
           2.8621],
       ...,
       [-116.4   ,   34.09  ,    9.    , ..., 2098.    ,  765.    ,
           3.2723],
       [-118.01  ,   33.82  ,   31.    , ..., 1356.    ,  356.    ,
           4.0625],
       [-122.45  ,   37.77  ,   52.    , ..., 1269.    ,  639.    ,
           3.575 ]])

์œ„ X๋Š” NumPy array์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๋‹ค์‹œ pandas DataFrame์œผ๋กœ ๋˜๋Œ๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

์ œ๋Œ€๋กœ ์ฑ„์›Œ์ ธ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ด…๋‹ˆ๋‹ค.

sample_incomplete_rows.index.values
array([ 4629,  6068, 17923, 13656, 19252], dtype=int64)
housing_num.loc[sample_incomplete_rows.index.values] # MA๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

4629

-118.30

34.07

18.0

3759.0

NaN

3296.0

1462.0

2.2708

6068

-117.86

34.01

16.0

4632.0

NaN

3038.0

727.0

5.1762

17923

-121.97

37.35

30.0

1955.0

NaN

999.0

386.0

4.6328

13656

-117.30

34.05

6.0

2155.0

NaN

1039.0

391.0

1.6675

19252

-122.79

38.48

7.0

6837.0

NaN

3468.0

1405.0

3.1662

housing_tr.loc[sample_incomplete_rows.index.values] # imputer๋ฅผ ํ†ตํ•ด MA๋ฅผ ์ฑ„์›Œ๋„ฃ์Œ

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

4629

-118.30

34.07

18.0

3759.0

433.0

3296.0

1462.0

2.2708

6068

-117.86

34.01

16.0

4632.0

433.0

3038.0

727.0

5.1762

17923

-121.97

37.35

30.0

1955.0

433.0

999.0

386.0

4.6328

13656

-117.30

34.05

6.0

2155.0

433.0

1039.0

391.0

1.6675

19252

-122.79

38.48

7.0

6837.0

433.0

3468.0

1405.0

3.1662

Estimator, Transformer, Predictor

  • ์ถ”์ •๊ธฐ(estimator): ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ถ”์ •ํ•˜๋Š” ๊ฐ์ฒด๋ฅผ ์ถ”์ •๊ธฐ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค(์˜ˆ๋ฅผ ๋“ค์ž๋ฉด imputer). ์ถ”์ •์ž์ฒด๋Š” fit() method์— ์˜ํ•ด์„œ ์ˆ˜ํ–‰๋˜๊ณ  ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹์„ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ „๋‹ฌ๋ฐ›์Šต๋‹ˆ๋‹ค(์ง€๋„ํ•™์Šต์˜ ๊ฒฝ์šฐ label์„ ๋‹ด๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์ถ”๊ฐ€์ ์ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ์ „๋‹ฌ).

  • ๋ณ€ํ™˜๊ธฐ(transformer): (imputer๊ฐ™์ด) ๋ฐ์ดํ„ฐ์…‹์„ ๋ณ€ํ™˜ํ•˜๋Š” ์ถ”์ •๊ธฐ๋ฅผ ๋ณ€ํ™˜๊ธฐ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋ณ€ํ™˜์€ transform() method๊ฐ€ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

  • ์˜ˆ์ธก๊ธฐ(predictor): ์ผ๋ถ€ ์ถ”์ •๊ธฐ๋Š” ์ฃผ์–ด์ง„ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์˜ˆ์ธก๊ฐ’์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ LinearRegression๋„ ์˜ˆ์ธก๊ธฐ์ž…๋‹ˆ๋‹ค. ์˜ˆ์ธก๊ธฐ์˜ predict() method๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์…‹์„ ๋ฐ›์•„ ์˜ˆ์ธก๊ฐ’์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  score() method๋Š” ์˜ˆ์ธก๊ฐ’์— ๋Œ€ํ•œ ํ‰๊ฐ€์ง€ํ‘œ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํ…์ŠคํŠธ์™€ ๋ฒ”์ฃผํ˜• ํŠน์„ฑ ๋‹ค๋ฃจ๊ธฐ

housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

ocean_proximity

17606

<1H OCEAN

18632

<1H OCEAN

14650

NEAR OCEAN

3230

INLAND

3555

<1H OCEAN

19480

INLAND

8879

<1H OCEAN

13685

INLAND

4937

<1H OCEAN

4861

<1H OCEAN

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])
ordinal_encoder.categories_ # class ๊ฐ’ ๋ฐฐ์ •
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

์ด ํ‘œํ˜„๋ฐฉ์‹์˜ ๋ฌธ์ œ์ ?

  • "ํŠน์„ฑ์˜ ๊ฐ’์ด ๋น„์Šทํ• ์ˆ˜๋ก ๋‘ ๊ฐœ์˜ ์ƒ˜ํ”Œ์ด ๋น„์Šทํ•˜๋‹ค"๊ฐ€ ์„ฑ๋ฆฝํ•  ๋•Œ ๋ชจ๋ธํ•™์Šต์ด ์‰ฌ์›Œ์ง

    ํŠน์„ฑ์˜ ๊ฐ’ ์ˆœ์„œ๊ฐ€ ๋ฐ”๋‹ค์— ๊ฐ€๊นŒ์šด ์ •๋„๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š์Œ.

One-hot encoding

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

์œ„ ์ถœ๋ ฅ์„ ๋ณด๋ฉด ์ผ๋ฐ˜์ ์ธ ๋ฐฐ์—ด์ด ์•„๋‹ˆ๊ณ  "sparse matrix"์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

housing_cat_1hot.toarray() # class์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์ด 1
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
cat_encoder = OneHotEncoder(sparse=False) # sparse ์˜ต์…˜ ์„ค์ •์— ๋”ฐ๋ผ arr/sapse matix ์ถœ๋ ฅ
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
cat_encoder.categories_
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

๋‚˜๋งŒ์˜ ๋ณ€ํ™˜๊ธฐ(Custom Transformers) ๋งŒ๋“ค๊ธฐ

Scikit-Learn์ด ์œ ์šฉํ•œ ๋ณ€ํ™˜๊ธฐ๋ฅผ ๋งŽ์ด ์ œ๊ณตํ•˜์ง€๋งŒ ํ”„๋กœ์ ํŠธ๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํ•œ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ์ž‘์—…์„ ํ•ด์•ผ ํ•  ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด ๋•Œ ๋‚˜๋งŒ์˜ ๋ณ€ํ™˜๊ธฐ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ˜๋“œ์‹œ ๊ตฌํ˜„ํ•ด์•ผ ํ•  method๋“ค

  • fit()

  • transform()

์•„๋ž˜์˜ custom tranformer๋Š” rooms_per_household, population_per_household ๋‘ ๊ฐœ์˜ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋ฐ์ดํ„ฐ์…‹์— ์ถ”๊ฐ€ํ•˜๋ฉฐ add_bedrooms_per_room = True๋กœ ์ฃผ์–ด์ง€๋ฉด bedrooms_per_room ํŠน์„ฑ๊นŒ์ง€ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. add_bedrooms_per_room์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ.

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    # X : arr(np)
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            # concatenate
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Numpy๋ฐ์ดํ„ฐ๋ฅผ DataFrame์œผ๋กœ ๋ณ€ํ™˜

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

ocean_proximity

rooms_per_household

population_per_household

17606

-121.89

37.29

38

1568

351

710

339

2.7042

<1H OCEAN

4.62537

2.0944

18632

-121.93

37.05

14

679

108

306

113

6.4214

<1H OCEAN

6.00885

2.70796

14650

-117.2

32.77

31

1952

471

936

462

2.8621

NEAR OCEAN

4.22511

2.02597

3230

-119.61

36.31

25

1847

371

1460

353

1.8839

INLAND

5.23229

4.13598

3555

-118.59

34.23

17

6592

1525

4459

1463

3.0347

<1H OCEAN

4.50581

3.04785

ํŠน์„ฑ ์Šค์ผ€์ผ๋ง(Feature Scaling)

  • Min-max scaling: 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์ด ๋˜๋„๋ก ์กฐ์ •

  • ํ‘œ์ค€ํ™”(standardization): ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1์ด ๋˜๋„๋ก ๋งŒ๋“ค์–ด ์คŒ(์‚ฌ์ดํ‚ท๋Ÿฐ์˜ StandardScaler์‚ฌ์šฉ)

๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ(Transformation Pipelines)

์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ€ํ™˜์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•  ๊ฒฝ์šฐ Pipeline class๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํŽธํ•ฉ๋‹ˆ๋‹ค.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

์ด๋ฆ„, ์ถ”์ •๊ธฐ ์Œ์˜ ๋ชฉ๋ก

๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋‘ ๋ณ€ํ™˜๊ธฐ์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค(fit_transform() method๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•จ).

ํŒŒ์ดํ”„๋ผ์ธ์˜ fit() method๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ชจ๋“  ๋ณ€ํ™˜๊ธฐ์˜ fit_transform() method๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ˜ธ์ถœํ•˜๋ฉด์„œ ํ•œ ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ์„ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ๋Š” fit() method๋งŒ ํ˜ธ์ถœํ•ฉ๋‹ˆ๋‹ค.

housing_num_tr
array([[-1.15604281,  0.77194962,  0.74333089, ..., -0.31205452,
        -0.08649871,  0.15531753],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.21768338,
        -0.03353391, -0.83628902],
       [ 1.18684903, -1.34218285,  0.18664186, ..., -0.46531516,
        -0.09240499,  0.4222004 ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.3469342 ,
        -0.03055414, -0.52177644],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.02499488,
         0.06150916, -0.30340741],
       [-1.43579109,  0.99645926,  1.85670895, ..., -0.22852947,
        -0.09586294,  0.10180567]])

๊ฐ ์—ด(column) ๋งˆ๋‹ค ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค!

์˜ˆ๋ฅผ ๋“ค์–ด ์ˆ˜์น˜ํ˜• ํŠน์„ฑ๋“ค๊ณผ ๋ฒ”์ฃผํ˜• ํŠน์„ฑ๋“ค์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ๋ณ€ํ™˜์ด ํ•„์š”ํ•˜๋‹ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ColumnTransformer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
housing_prepared.shape, housing.shape
((16512, 16), (16512, 9))

5. ๋ชจ๋ธ ํ›ˆ๋ จ(Train a Model)

๋“œ๋””์–ด ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ฌ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

์ง€๋‚œ ์‹œ๊ฐ„์— ๋ฐฐ์› ๋˜ ์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ(linear regression)์„ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
LinearRegression()

๋ชจ๋ธํ›ˆ๋ จ์€ ๋”ฑ 3์ค„์˜ ์ฝ”๋“œ๋ฉด ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค!

๋ช‡ ๊ฐœ์˜ ์ƒ˜ํ”Œ์— ๋ชจ๋ธ์„ ์ ์šฉํ•ด์„œ ์˜ˆ์ธก๊ฐ’์„ ํ™•์ธํ•ด๋ณด๊ณ  ์‹ค์ œ๊ฐ’๊ณผ ๋น„๊ตํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

lin_reg.coef_
array([-55650.4116403 , -56716.45236929,  13732.83841856,  -1933.1277138 ,
         7330.04062103, -45708.26306673,  45455.47519691,  74714.39134154,
         6605.12802802,   1042.95709453,   9249.75886697, -18016.52432168,
       -55219.15208555, 110357.78363967, -22479.84008184, -14642.2671506 ])
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(lin_reg.coef_, attributes), reverse=True)
[(110357.78363966991, 'ISLAND'),
 (74714.39134153843, 'median_income'),
 (45455.47519691441, 'households'),
 (13732.83841855541, 'housing_median_age'),
 (9249.75886697368, 'bedrooms_per_room'),
 (7330.040621029702, 'total_bedrooms'),
 (6605.128028015065, 'rooms_per_hhold'),
 (1042.9570945281878, 'pop_per_hhold'),
 (-1933.127713800795, 'total_rooms'),
 (-14642.267150598302, 'NEAR OCEAN'),
 (-18016.52432168299, '<1H OCEAN'),
 (-22479.840081835082, 'NEAR BAY'),
 (-45708.263066728214, 'population'),
 (-55219.15208555335, 'INLAND'),
 (-55650.41164030249, 'longitude'),
 (-56716.45236929203, 'latitude')]
# ๋ช‡ ๊ฐœ์˜ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ๋ณ€ํ™˜ ๋ฐ ์˜ˆ์ธก์„ ํ•ด๋ณด์ž
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared).round(decimals=1))
Predictions: [210644.6 317768.8 210956.4  59219.  189747.6]
print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

์ „์ฒด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ RMSE๋ฅผ ์ธก์ •ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
68628.19819848923

ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์˜ RMSE๊ฐ€ ์ด ๊ฒฝ์šฐ์ฒ˜๋Ÿผ ํฐ ๊ฒฝ์šฐ => ๊ณผ์†Œ์ ํ•ฉ(under-fitting)

๊ณผ์†Œ์ ํ•ฉ์ด ์ผ์–ด๋‚˜๋Š” ์ด์œ ?

  • ํŠน์„ฑ๋“ค(features)์ด ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ๋ชปํ•จ

  • ๋ชจ๋ธ์ด ์ถฉ๋ถ„ํžˆ ๊ฐ•๋ ฅํ•˜์ง€ ๋ชปํ•จ

๊ฐ•๋ ฅํ•œ ๋น„์„ ํ˜•๋ชจ๋ธ์ธ DecisionTreeRegressor๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels) # ํ•™์Šต
DecisionTreeRegressor(random_state=42)
housing_predictions = tree_reg.predict(housing_prepared) # ์˜ˆ์ธก
tree_mse = mean_squared_error(housing_labels, housing_predictions) # RMSE
tree_rmse = np.sqrt(tree_mse)
tree_rmse
0.0

์ด ๋ชจ๋ธ์ด ์„ ํ˜•๋ชจ๋ธ๋ณด๋‹ค ๋‚ซ๋‹ค๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”? ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์„๊นŒ์š”?

  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ๊ฒ€์ฆ

    ์ด๋Ÿฐ์‹์œผ๋กœ ํ•˜๋ฉด, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ๋“ค์—ฌ๋‹ค๋ณด๊ฒŒ ๋˜๊ณ  ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์— ์˜ํ–ฅ์„ ๋ฏธ์นจ. ๋˜๋‹ค๋ฅธ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ ์ข‹์ง€ ๋ชปํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง€๊ฒŒ ๋จ

  • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์˜ ์ผ๋ถ€๋ฅผ ๊ฒ€์ฆ๋ฐ์ดํ„ฐ(validation data)์…‹์œผ๋กœ ๋ถ„๋ฆฌํ•ด์„œ ๊ฒ€์ฆ

  • k-๊ฒน ๊ต์ฐจ ๊ฒ€์ฆ(k-fold cross-validation)

๊ต์ฐจ ๊ฒ€์ฆ(Cross-Validation)์„ ์‚ฌ์šฉํ•œ ํ‰๊ฐ€

๊ฒฐ์ •ํŠธ๋ฆฌ ๋ชจ๋ธ์— ๋Œ€ํ•œ ํ‰๊ฐ€

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004

์„ ํ˜•ํšŒ๊ท€๋ชจ๋ธ์— ๋Œ€ํ•œ ํ‰๊ฐ€

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.6740017983425

RandomForestRegressor์— ๋Œ€ํ•œ ํ‰๊ฐ€

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42) # n_estimators=#tree
forest_reg.fit(housing_prepared, housing_labels)
RandomForestRegressor(random_state=42)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
18603.515021376355
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Scores: [49519.80364233 47461.9115823  50029.02762854 52325.28068953
 49308.39426421 53446.37892622 48634.8036574  47585.73832311
 53490.10699751 50021.5852922 ]
Mean: 50182.303100336096
Standard deviation: 2097.0810550985693

Random forest ๋ชจ๋ธ์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„ (๋ชจ๋ธ ์„ ํƒ)

๋ชจ๋ธ ์„ ํƒ ํ›„, ์„ธ๋ถ€์ ์œผ๋กœ ํŠœ๋‹ ์ง„ํ–‰

6. ๋ชจ๋ธ ์„ธ๋ถ€ ํŠœ๋‹(Fine-Tune Your Model)

๋ชจ๋ธ์˜ ์ข…๋ฅ˜๋ฅผ ์„ ํƒํ•œ ํ›„์— ๋ชจ๋ธ์„ ์„ธ๋ถ€ ํŠœ๋‹ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ ํ•™์Šต์„ ์œ„ํ•œ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๋Š” ๊ณผ์ •์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆฌ๋“œ ์ฐธ์ƒ‰(Grid Search)

์ˆ˜๋™์œผ๋กœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์„ ์‹œ๋„ํ•˜๋Š” ๋Œ€์‹  GridSearchCV๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3ร—4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2ร—3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')
grid_search.best_params_ # ๊ฐ€์žฅ ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ
{'max_features': 8, 'n_estimators': 30}
grid_search.best_estimator_ # ๊ฐ€์žฅ ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ–ˆ์„๋•Œ ํ•™์Šตํ•œ ๋ชจ๋ธ๋„ ์ €์žฅ
RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)
cvres = grid_search.cv_results_

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์— ๋”ฐ๋ผ์„œ mean_score๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ฐ”๋€Œ๋Š”์ง€
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
63669.11631261028 {'max_features': 2, 'n_estimators': 3}
55627.099719926795 {'max_features': 2, 'n_estimators': 10}
53384.57275149205 {'max_features': 2, 'n_estimators': 30}
60965.950449450494 {'max_features': 4, 'n_estimators': 3}
52741.04704299915 {'max_features': 4, 'n_estimators': 10}
50377.40461678399 {'max_features': 4, 'n_estimators': 30}
58663.93866579625 {'max_features': 6, 'n_estimators': 3}
52006.19873526564 {'max_features': 6, 'n_estimators': 10}
50146.51167415009 {'max_features': 6, 'n_estimators': 30}
57869.25276169646 {'max_features': 8, 'n_estimators': 3}
51711.127883959234 {'max_features': 8, 'n_estimators': 10}
49682.273345071546 {'max_features': 8, 'n_estimators': 30}
62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

๋žœ๋ค ํƒ์ƒ‰(Randomized Search)

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์˜ ์ˆ˜๊ฐ€ ํฐ ๊ฒฝ์šฐ์— ์œ ๋ฆฌ. ์ง€์ •ํ•œ ํšŸ์ˆ˜๋งŒํผ๋งŒ ํ‰๊ฐ€.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000018C1A5BC508>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000018C1A5AD148>},
                   random_state=42, scoring='neg_mean_squared_error')
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
49150.70756927707 {'max_features': 7, 'n_estimators': 180}
51389.889203389284 {'max_features': 5, 'n_estimators': 15}
50796.155224308866 {'max_features': 3, 'n_estimators': 72}
50835.13360315349 {'max_features': 5, 'n_estimators': 21}
49280.9449827171 {'max_features': 7, 'n_estimators': 122}
50774.90662363929 {'max_features': 3, 'n_estimators': 75}
50682.78888164288 {'max_features': 3, 'n_estimators': 88}
49608.99608105296 {'max_features': 5, 'n_estimators': 100}
50473.61930350219 {'max_features': 3, 'n_estimators': 150}
64429.84143294435 {'max_features': 5, 'n_estimators': 2}

ํŠน์„ฑ ์ค‘์š”๋„, ์—๋Ÿฌ ๋ถ„์„

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
[(0.36615898061813423, 'median_income'),
 (0.16478099356159054, 'INLAND'),
 (0.10879295677551575, 'pop_per_hhold'),
 (0.07334423551601243, 'longitude'),
 (0.06290907048262032, 'latitude'),
 (0.056419179181954014, 'rooms_per_hhold'),
 (0.053351077347675815, 'bedrooms_per_room'),
 (0.04114379847872964, 'housing_median_age'),
 (0.014874280890402769, 'population'),
 (0.014672685420543239, 'total_rooms'),
 (0.014257599323407808, 'households'),
 (0.014106483453584104, 'total_bedrooms'),
 (0.010311488326303788, '<1H OCEAN'),
 (0.0028564746373201584, 'NEAR OCEAN'),
 (0.0019604155994780706, 'NEAR BAY'),
 (6.0280386727366e-05, 'ISLAND')]

7. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ตœ์ข… ํ‰๊ฐ€ํ•˜๊ธฐ

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
47730.22690385927

8. ๋ก ์นญ, ๋ชจ๋‹ˆํ„ฐ๋ง, ์‹œ์Šคํ…œ ์œ ์ง€ ๋ณด์ˆ˜

์ƒ์šฉํ™˜๊ฒฝ์— ๋ฐฐํฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ์™€ ๋ชจ๋ธ์˜ ์˜ˆ์ธก์ด ํฌํ•จ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค์–ด ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.In [142]:

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ]) 

full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)
# ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ, ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์˜ˆ์ธก ํ•˜๋‚˜๋กœ ๋ฌถ๊ณ  ์‹คํ–‰
array([210644.60459286, 317768.80697211, 210956.43331178,  59218.98886849,
       189747.55849879])
some_data

longitude

latitude

housing_median_age

total_rooms

total_bedrooms

population

households

median_income

ocean_proximity

17606

-121.89

37.29

38.0

1568.0

351.0

710.0

339.0

2.7042

<1H OCEAN

18632

-121.93

37.05

14.0

679.0

108.0

306.0

113.0

6.4214

<1H OCEAN

14650

-117.20

32.77

31.0

1952.0

471.0

936.0

462.0

2.8621

NEAR OCEAN

3230

-119.61

36.31

25.0

1847.0

371.0

1460.0

353.0

1.8839

INLAND

3555

-118.59

34.23

17.0

6592.0

1525.0

4459.0

1463.0

3.0347

<1H OCEAN

my_model = full_pipeline_with_predictor
import joblib
joblib.dump(my_model, "my_model.pkl")
# pickle ํŒŒ์ผ๋กœ ๋งŒ๋“ค์–ด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ €์žฅ
my_model_loaded = joblib.load("my_model.pkl")
my_model_loaded.predict(some_data)
array([210644.60459286, 317768.80697211, 210956.43331178,  59218.98886849,
       189747.55849879])

๋ก ์นญํ›„ ์‹œ์Šคํ…œ ๋ชจ๋‹ˆํ„ฐ๋ง

  • ์‹œ๊ฐ„์ด ์ง€๋‚˜๋ฉด ๋ชจ๋ธ์ด ๋‚™ํ›„๋˜๋ฉด์„œ ์„ฑ๋Šฅ์ด ์ €ํ•˜

    ๋”ฐ๋ผ์„œ ๋ก ์นญ ํ›„ ์‹œ์Šคํ…œ์„ ๊ณ„์†ํ•ด์„œ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š” โ—

    ๊ฐ€๋Šฅํ•˜๋ฉด ์‹œ์Šคํ…œ์ด ์ž˜ ๋Œ์•„๊ฐ€๊ณ ์žˆ๋Š”์ง€ ๋ชจ๋‹ˆํ„ฐ๋ง ์‹œ์Šคํ…œ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์ข‹์Œ

  • ์ž๋™๋ชจ๋‹ˆํ„ฐ๋ง: ์ถ”์ฒœ์‹œ์Šคํ…œ์˜ ๊ฒฝ์šฐ, ์ถ”์ฒœ๋œ ์ƒํ’ˆ์˜ ํŒ๋งค๋Ÿ‰์ด ์ค„์–ด๋“œ๋Š”์ง€?

  • ์ˆ˜๋™๋ชจ๋‹ˆํ„ฐ๋ง: ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ, ๋ถ„๋ฅ˜๋œ ์ด๋ฏธ์ง€๋“ค ์ค‘ ์ผ๋ถ€๋ฅผ ์ „๋ฌธ๊ฐ€์—๊ฒŒ ๊ฒ€ํ† ์‹œํ‚ด

  • ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜๋น ์ง„ ๊ฒฝ์šฐ

    • ๋ฐ์ดํ„ฐ ์ž…๋ ฅ์˜ ํ’ˆ์งˆ์ด ๋‚˜๋น ์กŒ๋Š”์ง€? ์„ผ์„œ๊ณ ์žฅ?

    • ํŠธ๋ Œ๋“œ์˜ ๋ณ€ํ™”? ๊ณ„์ ˆ์  ์š”์ธ?

์œ ์ง€๋ณด์ˆ˜

  • ์ •๊ธฐ์ ์œผ๋กœ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘(๋ ˆ์ด๋ธ”)

  • ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ, ํ˜„์žฌ์˜ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ๋กœ ํŽธ์ž…

  • ๋‹ค์‹œ ํ•™์Šตํ›„, ์ƒˆ๋กœ์šด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•ด ํ˜„์žฌ ๋ชจ๋ธ๊ณผ ์ƒˆ ๋ชจ๋ธ์„ ํ‰๊ฐ€, ๋น„๊ต

[AI ์Šค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 2

Machine Learning ๊ธฐ์ดˆ - ๊ฒฐ์ •์ด๋ก 

๊ฒฐ์ •์ด๋ก ์ด๋ž€?

์ƒˆ๋กœ์šด ๊ฐ’ x๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ™•๋ฅ ๋ชจ๋ธ p(x, t)์— ๊ธฐ๋ฐ˜ํ•ด ์ตœ์ ์˜ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ

  • ์ถ”๋ก ๋‹จ๊ณ„ : ๊ฒฐํ•ฉํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ

  • ๊ฒฐ์ •๋‹จ๊ณ„ : ์ƒํ™ฉ์— ๋Œ€ํ•œ ํ™•๋ฅ ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ตœ์ ์˜ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ

๊ฒฐ์ •์˜์—ญ - ์ด์ง„๋ถ„๋ฅ˜

๋ฌด์Šจ ๋ง์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ, ๊ทธ๋ž˜ํ”„ ๋ฉด์ ์ด ์˜ค๋ฅ˜๋ฅผ ์˜๋ฏธํ•˜๊ณ  ์˜ค๋ฅ˜๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ์ชฝ์œผ๋กœ ํ•ด์•ผํ•˜๋Š”๋ฐ ๊ทธ ๋ถ€๋ถ„์ด ๋‘ ๊ทธ๋ž˜ํ”„์˜ ๊ต์ 

๊ฒฐ์ •์ด๋ก ์˜ ๋ชฉํ‘œ (๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ)

๊ฒฐํ•ฉํ™•๋ฅ ๋ถ„ํฌ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ตœ์ ์˜ ๊ฒฐ์ •์˜์—ญ๋“ค์„ ์ฐพ๋Š” ๊ฒƒ.

๊ธฐ๋Œ€์†์‹ค ์ตœ์†Œํ™”

๋ชจ๋“  ๊ฒฐ์ •์ด ๋™์ผํ•œ ๋ฆฌ์Šคํฌ๋ฅผ ๊ฐ–์ง€ ์•Š์Œ

  • ์•”์ด ์•„๋‹Œ๋ฐ ์•”์ธ ๊ฒƒ์œผ๋กœ ์ง„๋‹จ

  • ์•”์ด ๋งž๋Š”๋ฐ ์•”์ด ์•„๋‹Œ ๊ฒƒ์œผ๋กœ ์ง„๋‹จ

๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ชจ๋“  ์ง€์‹์€ ํ™•๋ฅ ๋ถ„ํฌ๋กœ ํ‘œํ˜„๋œ๋‹ค. ํ•œ ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์˜ ์‹ค์ œ ํด๋ž˜์Šค๋ฅผ ๊ฒฐ์ •๋ก ์ ์œผ๋กœ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ทธ๊ฒƒ์˜ ํ™•๋ฅ ๋งŒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค. ์ฆ‰, ์šฐ๋ฆฌ๊ฐ€ ๊ด€์ฐฐํ•  ์ˆ˜ ์žˆ๋Š” ์ƒ˜ํ”Œ์€ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ํ†ตํ•ด์„œ ์ƒ์„ฑ๋œ ๊ฒƒ์ด๋‹ค.

Machine Learning ๊ธฐ์ดˆ - ์„ ํ˜•ํšŒ๊ท€

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์„ ์„ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ง์„ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„๋‹คy=ax+by=ax+b ์—ฌ๊ธฐ์„œ a๋Š” ๊ธฐ์šธ๊ธฐ, b๋Š” y์ ˆํŽธ์ด๋‹ค. ์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” ๊ธฐ์šธ๊ธฐ๊ฐ€ 2์ด๊ณ  y์ ˆํŽธ์ด -5์ธ ์ง์„ ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
''' numpy.random.RandomState๋Š” class๋ช…์œผ๋กœ ๋žœ๋ค๋„˜๋ฒ„ ์ƒ์„ฑ๊ธฐ์ธ ๋žœ๋คํ•จ์ˆ˜๋“ค์„ ํฌํ•จํ•˜๋Š” ํด๋ž˜์Šค๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
RandomState๋Š” ๋‹ค์–‘ํ•œ ํ™•๋ฅ ๋ถ„ํฌ์ธก๋ฉด์— ์ˆ˜ ๋งŽ์€ ๋žœ๋ค๋„˜๋ฒ„ ์ƒ์„ฑ๊ธฐ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
ex) numpy.random.uniform(๊ท ๋“ฑ๋ถ„ํฌ์—์„œ ๋ณ€์ˆ˜ ์ถ”์ถœ), numpy.random.nomal(์ •๊ทœ๋ถ„ํฌ์—์„œ ๋ณ€์ˆ˜ ์ถ”์ถœ) ๋“ฑ
๊ฐ ๋ฐฉ๋ฒ•๋“ค์€ size๋ฅผ argument๋กœ ์ทจํ•˜๋Š”๋ฐ ๊ธฐ๋ณธ๊ฐ’์€ None์ด๋‹ค.
๋งŒ์•ฝ size๊ฐ€ None์ด๋ผ๋ฉด, ํ•˜๋‚˜์˜ ๊ฐ’์ด ์ƒ์„ฑ๋˜๊ณ  ๋ฐ˜ํ™˜๋œ๋‹ค. ๋งŒ์•ฝ size๊ฐ€ ์ •์ˆ˜๋ผ๋ฉด, 1-D ํ–‰๋ ฌ์ด ๋žœ๋ค๋ณ€์ˆ˜๋“ค๋กœ ์ฑ„์›Œ์ ธ ๋ฐ˜ํ™˜๋œ๋‹ค.
๋งŒ์•ฝ size๊ฐ€ tuple์ด๋ผ๋ฉด ํ–‰๋ ฌ์ด ๊ทธ ํ˜•ํƒœ์— ๋งž์ถ”์–ด ๋žœ๋ค๋ณ€์ˆ˜๋“ค๋กœ ์ฑ„์›Œ์ ธ ๋ฐ˜ํ™˜๋œ๋‹ค. '''
rng = np.random.RandomState(1)

x = 10 * rng.rand(50) # 0~10 ์‚ฌ์ด
y = 2 * x - 5 + rng.randn(50)

plt.scatter(x, y);
# ๊ท ์ผ๋ถ„ํฌ์˜ m-1๊นŒ์ง€์˜ ์ •์ˆ˜ ๋‚œ์ˆ˜๋ฅผ ์ƒ์„ฑ
np.random.randint(6)
4
# 0๋ถ€ํ„ฐ 1๊นŒ์ง€์˜ ๊ท ์ผ๋ถ„ํฌ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์˜ ๋‚œ์ˆ˜๋ฅผ ๊ฐ€์ง„ m๊ฐœ์˜ ํ–‰, n๊ฐœ์˜ ์—ด array๋ฅผ ์ƒ์„ฑ
np.random.rand(3, 3)
array([[0.41972662, 0.91250462, 0.32922597],
       [0.35029654, 0.08989692, 0.93321008],
       [0.04695859, 0.02030855, 0.82914045]])
# ํ‰๊ท  0, ํ‘œ์ค€ํŽธ์ฐจ 1์ธ ๊ฐ€์šฐ์‹œ์•ˆ ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ์˜ ๋‚œ์ˆ˜๋ฅผ ๊ฐ€์ง„ m๊ฐœ์˜ ํ–‰, n๊ฐœ์˜ ์—ด array๋ฅผ ์ƒ์„ฑ
np.random.randn(3, 3)
array([[ 1.85523496, -1.2565864 , -0.83251712],
       [-0.99798501, -0.94665524,  0.73052923],
       [-0.11112169,  0.12296838,  1.37482645]])
x
array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00,
       1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00,
       3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00,
       2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00,
       4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00,
       8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00,
       8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01,
       1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00,
       9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00,
       6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00,
       9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00,
       1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00,
       2.87775339e+00, 1.30028572e+00])
x.shape
(50,)
# ์ฐจ์›์„ ๋Š˜๋ ค์ค€๋‹ค.
# ์•„๋ž˜์— ์‚ฌ์šฉํ•  model.fit์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ 2์ฐจ์› array๋ฅผ ๋ฐ›๋Š”๋‹ค.
x[:, np.newaxis][:10]
array([[4.17022005e+00],
       [7.20324493e+00],
       [1.14374817e-03],
       [3.02332573e+00],
       [1.46755891e+00],
       [9.23385948e-01],
       [1.86260211e+00],
       [3.45560727e+00],
       [3.96767474e+00],
       [5.38816734e+00]])
# 0๋ถ€ํ„ฐ 10๊นŒ์ง€์˜ ๊ฐ’๋“ค์„ 1000์˜ ๊ฐ„๊ฒฉ์„ ๋‘” array๋ฅผ ์ƒ์„ฑํ•œ๋‹ค
np.linspace(0, 10, 1000)[range(50,1000,20)], np.linspace(0, 10, 1000)[[0, 500, 999]]
(array([0.5005005 , 0.7007007 , 0.9009009 , 1.1011011 , 1.3013013 ,
        1.5015015 , 1.7017017 , 1.9019019 , 2.1021021 , 2.3023023 ,
        2.5025025 , 2.7027027 , 2.9029029 , 3.1031031 , 3.3033033 ,
        3.5035035 , 3.7037037 , 3.9039039 , 4.1041041 , 4.3043043 ,
        4.5045045 , 4.7047047 , 4.9049049 , 5.10510511, 5.30530531,
        5.50550551, 5.70570571, 5.90590591, 6.10610611, 6.30630631,
        6.50650651, 6.70670671, 6.90690691, 7.10710711, 7.30730731,
        7.50750751, 7.70770771, 7.90790791, 8.10810811, 8.30830831,
        8.50850851, 8.70870871, 8.90890891, 9.10910911, 9.30930931,
        9.50950951, 9.70970971, 9.90990991]),
 array([ 0.        ,  5.00500501, 10.        ]))

Scikit-Learn์˜ LinearRegression estimator๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์œ„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ํ‘œํ˜„ํ•˜๋Š” ์ง์„ ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค.

from sklearn.linear_model import LinearRegression

# ๋ชจ๋ธ์˜ ํด๋ž˜์Šค ์ •์˜
model = LinearRegression(fit_intercept=True) 

# fit์˜ ์ธ์ž๋“ค๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ x, ๋ ˆ์ด๋ธ” ๋ฐ์ดํ„ฐ y๋กœ ์ „๋‹ฌ๋œ๋‹ค

model.fit(x[:, np.newaxis], y) 

# ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋Š” ๊ณผ์ •
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

x, y๋Š” ์ ์œผ๋กœ ์ฐํ˜€์žˆ๊ณ  xfit๊ณผ yfit๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ ์œผ๋กœ ์ฐํ˜€์žˆ์ง€๋งŒ 1000๊ฐœ์˜ ์ ์ด๊ธฐ ๋•Œ๋ฌธ์— ์„ ์œผ๋กœ ๋ณด์ธ๋‹ค

๋ชจ๋ธ ํ•™์Šต์ด ๋๋‚œ ํ›„ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ model."ํŒŒ๋ผ๋ฏธํ„ฐ์ด๋ฆ„"_ ์˜ ํ˜•ํƒœ๋กœ ์ €์žฅ๋œ๋‹ค. ๊ธฐ์šธ๊ธฐ์™€ y์ ˆํŽธ์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ๋‹ค.

print("Model slope : ", model.coef_[0])
print("Model intercept : ", model.intercept_)
Model slope :  2.0272088103606944
Model intercept :  -4.9985770855532

LinearRegression estimator๋Š” ์œ„์˜ ์˜ˆ์ œ์™€ ๊ฐ™์€ 1์ฐจ์› ์ž…๋ ฅ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์ฐจ์› ์ž…๋ ฅ์„ ์‚ฌ์šฉํ•œ ์„ ํ˜•๋ชจ๋ธ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์ฐจ์› ์„ ํ˜•๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„๋‹ค. y=a0+a1x1+a2x2+...y=a0+a1x1+a2x2+...๊ธฐํ•˜ํ•™์ ์œผ๋กœ ์ด๊ฒƒ์€ hyper_plane์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‹ค.

# 100๊ฐœ์˜ ํ–‰, 3๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ๋žœ๋ค๊ฐ’ ์ƒ์„ฑ
rng.rand(100, 3)[:5], rng.rand(100, 3).shape
(array([[0.76778898, 0.53600849, 0.03985993],
        [0.13479312, 0.1934164 , 0.3356638 ],
        [0.05231295, 0.60511678, 0.51206103],
        [0.61746101, 0.43235559, 0.84770047],
        [0.45405906, 0.01540352, 0.87306815]]),
 (100, 3))
rnp = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])

model.fit(X, y)
print("Model intercept : ", model.intercept_)
print("Model slope : ", model.coef_)
Model intercept :  0.5000000000000087
Model slope :  [ 1.5 -2.   1. ]

y๊ฐ’๋“ค์€ ๋žœ๋คํ•˜๊ฒŒ ์ƒ์„ฑ๋œ 3์ฐจ์›์˜ x๊ฐ’๊ณผ ๊ณ„์ˆ˜๋“ค์„ ๊ณฑํ•จ์œผ๋กœ์จ ์ƒ์„ฑ๋˜์—ˆ๋Š”๋ฐ, linear regression์„ ํ†ตํ•ด์„œ ์ด ๊ณ„์ˆ˜๋“ค์„ ๊ณ„์‚ฐํ•ด๋‚ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋งŒ์•ฝ ๋ฐ์ดํ„ฐ๊ฐ€ ์„ ํ˜•์ ์ธ ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋‹ค๋ฉด?

์„ ํ˜• ๊ธฐ์ €ํ•จ์ˆ˜ ๋ชจ๋ธ (Linear Basis Fucntion Models)

๋น„์„ ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜•ํ•จ์ˆ˜๋กœ ๋ชจ๋ธ๋งํ•˜๋Š” ํ•œ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์€ ๊ธฐ์ €ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์„ ํ˜•ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•˜์ž. y=a0+a1x1+a2x2+a3x3+...y=a0+a1x1+a2x2+a3x3+...์—ฌ๊ธฐ์„œ x1,x2,x3x1,x2,x3 ๋“ฑ์„ 1์ฐจ์› xx๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. xn=fn(x)xn=fn(x) ์ด๊ณ  fnfn ์„ ๊ธฐ์ €ํ•จ์ˆ˜๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๋งŒ์•ฝ fn(x)=xnfn(x)=xn ๋ผ๋Š” ๊ธฐ์ €ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ตœ์ข…์ ์ธ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๊ฒƒ์ด๋‹ค.y=a0+a1x+a2x2+a3x3+...y=a0+a1x+a2x2+a3x3+...์ด ๋ชจ๋ธ์€ ์—ฌ์ „ํžˆ ๊ณ„์ˆ˜์— ๊ด€ํ•ด์„œ๋Š” ์„ ํ˜•ํ•จ์ˆ˜์ด๋‹ค. ๋”ฐ๋ผ์„œ 1์ฐจ์› ๋ณ€์ˆ˜์ธ xx๋ฅผ ๊ธฐ์ €ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‹ค์ฐจ์›์œผ๋กœ ํ™•์žฅ์‹œํ‚ด์œผ๋กœ์จ ์šฐ๋ฆฌ๋Š” ์—ฌ์ „ํžˆ ์„ ํ˜•๋ชจ๋ธ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ๋œ๋‹ค.

๋‹คํ•ญ ๊ธฐ์ €ํ•จ์ˆ˜ (Ploynomial Basis Function)

f(x)=xnf(x)=xn ํ˜•ํƒœ์˜ ํ•จ์ˆ˜๋ฅผ ๋‹คํ•ญ ๊ธฐ์ €ํ•จ์ˆ˜๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. Scikit-Learn์€ PolynomialFeatures์ด๋ผ๋Š” transformer๋ฅผ ์ด๋ฏธ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค.

from sklearn.preprocessing import PolynomialFeatures
x = np.array([2, 3, 4])
poly = PolynomialFeatures(3, include_bias=False)
poly.fit_transform(x[:, None])
array([[ 2.,  4.,  8.],
       [ 3.,  9., 27.],
       [ 4., 16., 64.]])
'''
array([[ 2.,  4.,  8.],  2์˜ ์ œ๊ณฑ  
       [ 3.,  9., 27.],  3์˜ ์ œ๊ณฑ  
       [ 4., 16., 64.]])  4์˜ ์ œ๊ณฑ  
'''
'\narray([[ 2.,  4.,  8.],  2์˜ ์ œ๊ณฑ  \n       [ 3.,  9., 27.],  3์˜ ์ œ๊ณฑ  \n       [ 4., 16., 64.]])  4์˜ ์ œ๊ณฑ  \n'

PolynomialFeatures๊ฐ€ 1์ฐจ์› array๋ฅผ 3์ฐจ์› array๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์„ ํ˜•๋ชจ๋ธ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

7์ฐจ์› ๋ณ€ํ™˜์„ ์ ์šฉํ•ด๋ณด์ž

from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(PolynomialFeatures(7),
                          LinearRegression())

๋‹ค์ฐจ์› ๋ณ€ํ™˜์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด sine ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ๋ชจ๋ธ๋งํ•ด๋ณด์ž.

rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
plt.scatter(x, y);
poly_model.fit(x[:, np.newaxis], y)
yfit = poly_model.predict(xfit[:, np.newaxis])

plt.scatter(x, y)
plt.plot(xfit, yfit);

๊ฐ€์šฐ์‹œ์•ˆ ๊ธฐ์ €ํ•จ์ˆ˜ (Gaussian Basis Function)

๋‹คํ•ญ ๊ธฐ์ €ํ•จ์ˆ˜ ์™ธ์— ๋‹ค๋ฅธ ๊ธฐ์ €ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด๋ณด์ž. ๊ฐ€์šฐ์‹œ์•ˆ ๊ธฐ์ €ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.exp{โˆ’(xโˆ’uj)22s2}exp{โˆ’(xโˆ’uj)22s2}

ujuj๋Š” ํ•จ์ˆ˜์˜ ์œ„์น˜, ss๋Š” ํญ์„ ๊ฒฐ์ •ํ•œ๋‹ค. ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ๊ธฐ์ €ํ•จ์ˆ˜๋“ค์˜ ํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ ค๊ณ  ์‹œ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

Previous8 FriNext6 Wed

Last updated 4 years ago

Was this helpful?

In [62]: