[AI ์ค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 3
End To End ๋จธ์ ๋ฌ๋ ํ๋ก์ ํธ
๋ถ๋์ฐ ํ์ฌ์ ๊ณ ์ฉ๋ ๋ฐ์ดํฐ ๊ณผํ์๊ฐ ํ๋ก์ ํธ๋ฅผ ์ฒ์๋ถํฐ ๋๊น์ง(E2E) ์งํํ๋ ๊ณผ์
๋ฐ์ดํฐ๋ก๋ถํฐ ํต์ฐฐ์ ์ป๊ธฐ ์ํด ํ์ํ๊ณ ์๊ฐํํ๋ค.
๋จธ์ ๋ฌ๋ ์๊ณ ๋ฆฌ์ฆ์ ์ํด ๋ฐ์ดํฐ๋ฅผ ์ค๋นํ๋ค
๋ชจ๋ธ์ ์ ํํ๊ณ ํ๋ จ์ํจ๋ค
๋ชจ๋ธ์ ์์ธํ๊ฒ ์กฐ์ ํ๋ค
์๋ฃจ์
์ ์ ์ํ๋ค
์์คํ
์ ๋ก ์นญํ๊ณ ๋ชจ๋ํฐ๋งํ๊ณ ์ ์ง ๋ณด์ํ๋ค
1. ํฐ ๊ทธ๋ฆผ ๋ณด๊ธฐ
ํ์ด์ผ ํ ๋ฌธ์ : ์บ๋ฆฌํฌ๋์ ์ธ๊ตฌ์กฐ์ฌ ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด ์บ๋ฆฌํฌ๋์์ ์ฃผํ ๊ฐ๊ฒฉ ๋ชจ๋ธ์ ๋ง๋๋ ๊ฒ
์ด๋ป๊ฒ ๋ง๋ค์ด์ผ ํ ๊น? ์ ๋ฌธ๊ฐ๊ฐ ์๋์ผ๋ก? ๋ณต์กํ ๊ท์น์ ํตํด? ๋จธ์ ๋ฌ๋์ ์ด์ฉํด?
๋ฌธ์ ์ ์
์ง๋ํ์ต, ๋น์ง๋ํ์ต, ๊ฐํํ์ต ์ค์ ์ด๋ค ๊ฒฝ์ฐ์ธ๊ฐ?
๋ถ๋ฅ๋ฌธ์ ์ธ๊ฐ ํ๊ท๋ฌธ์ ์ธ๊ฐ?
๋ฐฐ์นํ์ต, ์จ๋ผ์ธํ์ต ์ค ์ด๋ค ๊ฒ์ ์ฌ์ฉํด์ผ ํ๋๊ฐ?
์ฑ๋ฅ์ธก์ ์งํ ์ ํ
ํ๊ท ์ ๊ณฑ๊ทผ ์ค์ฐจ, RMSE(Root Mean Squeare Error)
2. ๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ
์์
ํ๊ฒฝ ์ค์
Copy $ export ML_PATH = "$./ml" # You can change the path if you prefer
$ mkdir -p $ML_PATH
Copy $ cd $ML_PATH
$ virtualenv env
Copy $ pip3 install --upgrade jupyter matplotlib numpy pandas scipy scikit-learn
Collecting jupyter
๋ฐ์ดํฐ ๋ค์ด๋ก๋
Copy # Python โฅ3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Scikit-Learn โฅ0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"
# Common imports
import numpy as np
import os
# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "end_to_end_project"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)
# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
Copy import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
Copy import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
๋ฐ์ดํฐ ๊ตฌ์กฐ ํ์ด๋ณด๊ธฐ
Copy housing = load_housing_data()
housing.head()
Copy # ์ข ๋ ์์ธํ ๋ฐ์ดํฐ ์ ๋ณด๋ฅผ ๋ณผ ์ ์์
housing.info()
Copy <class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
Copy housing["ocean_proximity"].value_counts()
Copy <1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
Copy import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show();
Copy Saving figure attribute_histogram_plots
ํ
์คํธ ๋ฐ์ดํฐ์
๋ง๋ค๊ธฐ
์ข์ ๋ชจ๋ธ์ ๋ง๋ค๊ธฐ ์ํด์๋ ํ๋ จ์ ์ฌ์ฉ๋์ง ์๊ณ ๋ชจ๋ธ ํ๊ฐ๋ง์ ์ํ "ํ
์คํธ ๋ฐ์ดํฐ์
"์ ๋ฐ๋ก ๊ตฌ๋ถํด์ผํ๋ค. ์ด๊ธฐ์ ๋ถ๋ฆฌํ๋ ๊ฒ์ด ์ผ๋ฐ์ .
Copy np.random.seed(42)
# For illustration only. Sklearn has train_test_split()
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Copy a = np.random.permutation(10)
a
Copy array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])
Copy train_set, test_set = split_train_test(housing, 0.2) # train/test data split
len(train_set), len(test_set)
์ ๋ฐฉ๋ฒ์ ๋ฌธ์ ์ ์? : ์ฌ๋ฌ ๋ฒ ์ํํ ๊ฒฝ์ฐ ํ๋ จ์ฉ ๋ฐ์ดํฐ๊ฐ ํ์ต ๋ฐ์ดํฐ๋ก ์ฎ๊ฒจ์ง๊ฑฐ๋ ๊ทธ ๋ฐ๋๊ฐ ๋ ์ ์๋ค
ํด๊ฒฐ๋ฐฉ์: ๊ฐ ์ํ์ ์๋ณ์(identifier)๋ฅผ ์ฌ์ฉํด์ ๋ถํ
Copy from zlib import crc32 # ํด์ฑํจ์
# test set์ ์ํ๋
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
# crc32(np.int64(identifier)) & 0xffffffff : ํด์ฑํ ๊ฐ์ 2์ 32์น์ผ๋ก ๋๋๋ค(?)
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
Copy housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
housing.head()
Copy housing_with_id.head() # index column ์ถ๊ฐํ data
์ ๋ฐฉ๋ฒ์ ๋ฌธ์ ์ ์? : ๋ฐ์ดํฐ ๋ฒ ์ด์ค ๊ฐฑ์ ์ ํ๋ฒํธ ์์๊ฐ ๋ฌ๋ผ์ง ์ ์์
id๋ฅผ ๋ง๋๋ ๋ฐ ์์ ํ feature๋ค์ ์ฌ์ฉํด์ผ ํจ
Copy housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
train_set.head()
train set์ feautre์ ๋น์จ์ด test set์์๋ ๋์ผํ๊ฒ ๋ํ๋๊ธฐ๋ฅผ ๋ฐ๋๋ค.
๋ฐ๋ผ์ ๊ณ์ธต์ ์ํ๋ง์ด ํ์
๊ณ์ธต์ ์ํ๋ง(stratified sampling)
์ ์ฒด ๋ฐ์ดํฐ๋ฅผ ๊ณ์ธต(strata)๋ผ๋ ๋์ง์ ๊ทธ๋ฃน์ผ๋ก ๋๋๊ณ , ํ
์คํธ ๋ฐ์ดํฐ๊ฐ ์ ์ฒด ๋ฐ์ดํฐ๋ฅผ ์ ๋ํํ๋๋ก ๊ฐ ๊ณ์ธต์์ ์ฌ๋ฐ๋ฅธ ์์ ์ํ์ ์ถ์ถ
Copy from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
Copy housing["median_income"].hist()
Copy <matplotlib.axes._subplots.AxesSubplot at 0x2de4e5de648>
Copy housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5]) #bins ์ ์ ํ๊ฒ ๋๋ ์ผํจ
Copy housing["income_cat"].value_counts()
Copy 3 7236
2 6581
4 3639
5 2362
1 822
Name: income_cat, dtype: int64
Copy housing["income_cat"].hist()
Copy <matplotlib.axes._subplots.AxesSubplot at 0x2de4f0f2748>
Copy from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
Copy strat_train_set.info()
Copy <class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 16512 non-null float64
1 latitude 16512 non-null float64
2 housing_median_age 16512 non-null float64
3 total_rooms 16512 non-null float64
4 total_bedrooms 16354 non-null float64
5 population 16512 non-null float64
6 households 16512 non-null float64
7 median_income 16512 non-null float64
8 median_house_value 16512 non-null float64
9 ocean_proximity 16512 non-null object
10 income_cat 16512 non-null category
dtypes: category(1), float64(9), object(1)
memory usage: 1.4+ MB
Copy strat_test_set.info()
Copy <class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 5241 to 2398
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 4128 non-null float64
1 latitude 4128 non-null float64
2 housing_median_age 4128 non-null float64
3 total_rooms 4128 non-null float64
4 total_bedrooms 4079 non-null float64
5 population 4128 non-null float64
6 households 4128 non-null float64
7 median_income 4128 non-null float64
8 median_house_value 4128 non-null float64
9 ocean_proximity 4128 non-null object
10 income_cat 4128 non-null category
dtypes: category(1), float64(9), object(1)
memory usage: 359.0+ KB
Copy housing["income_cat"].value_counts() / len(housing)
# ์ ์ฒด ๋ฐ์ดํฐ์
์ ๋น์จ
Copy 3 0.350581
2 0.318847
4 0.176308
5 0.114438
1 0.039826
Name: income_cat, dtype: float64
Copy strat_test_set["income_cat"].value_counts() / len(strat_test_set)
# test data set์ ๋น์จ : ์ ์ฒด ๋ฐ์ดํฐ ์
๋น์จ๊ณผ ๊ฑฐ์ ๋์ผ
Copy 3 0.350533
2 0.318798
4 0.176357
5 0.114583
1 0.039729
Name: income_cat, dtype: float64
Copy def income_cat_proportions(data):
return data["income_cat"].value_counts() / len(data)
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
compare_props = pd.DataFrame({
"Overall": income_cat_proportions(housing),
"Stratified": income_cat_proportions(strat_test_set),
"Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
Copy compare_props # Randomํ๊ฒ ๋๋๊ฑฐ๋ Stratified์ผ๋ก ๋๋๊ฑฐ๋ ์ผ๋ง๋ ์ฐจ์ด๊ฐ ์๋
# ๊ณ์ธต์ ์ํ๋งํ ๊ฒ์ด error๊ฐ ๋ฎ์
Copy # ์๋ ์ํ๋ก ๋๋๋ฆผ
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
3. ๋ฐ์ดํฐ ์ดํด๋ฅผ ์ํ ํ์๊ณผ ์๊ฐํ
Copy # ๋ฐ์ดํฐ ๋ณต์ฌ๋ณธ ๋ง๋ค๊ธฐ (ํ๋ จ๋ฐ์ดํฐ๋ฅผ ์์์ํค์ง ์๊ธฐ ์ํด)
housing = strat_train_set.copy()
์ง๋ฆฌ์ ๋ฐ์ดํฐ ์๊ฐํ
Copy housing.plot(kind="scatter", x="longitude", y="latitude")
save_fig("bad_visualization_plot")
Copy Saving figure bad_visualization_plot
๋ฐ์ง๋ ์์ญ ํ์
Copy housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
save_fig("better_visualization_plot")
Copy Saving figure better_visualization_plot
๋ ๋ค์ํ ์ ๋ณด ํ์
s: ์์ ๋ฐ์ง๋ฆ => ์ธ๊ตฌ
Copy housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
save_fig("housing_prices_scatterplot")
Copy Saving figure housing_prices_scatterplot
Copy # Download the California image
images_path = os.path.join(PROJECT_ROOT_DIR, "images", "end_to_end_project")
os.makedirs(images_path, exist_ok=True)
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
filename = "california.png" # ์ง๋ ๊ทธ๋ฆผ x
Copy import matplotlib.image as mpimg
california_img=mpimg.imread(os.path.join(images_path, filename))
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=housing['population']/100, label="Population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=False, alpha=0.4,
)
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
Copy Saving figure california_housing_prices_plot
์์์ ๊ด์ฐฐํ ์ ์๋ ์ฌ์ค์(์ฃผํ๊ฐ๊ฒฉ์ด ๋์ ์ง์ญ)?
์๊ด๊ด๊ณ(Correlations) ๊ด์ฐฐํ๊ธฐ
Copy corr_matrix = housing.corr()
Copy corr_matrix["median_house_value"].sort_values(ascending=False)
Copy median_house_value 1.000000
median_income 0.687160
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population -0.026920
longitude -0.047432
latitude -0.142724
Name: median_house_value, dtype: float64
scatter_matrix ์ฌ์ฉํด์ ์๊ด๊ด๊ณ ํ์ธํ๊ธฐ
Copy # from pandas.tools.plotting import scatter_matrix # For older versions of Pandas
from pandas.plotting import scatter_matrix
# ํน์ฑ ๋ช ๊ฐ๋ง ์ดํด๋ด
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")
Copy Saving figure scatter_matrix_plot
Copy housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")
Copy Saving figure income_vs_house_value_scatterplot
์์์ ๊ด์ฐฐํ ์ ์๋ ์ฌ์ค๋ค?
50๋ง๋ถ์ ํด๋นํ๋ ๊ฐ์ ๋ํด ์ ์ฒ๋ผ ๋ํ๋จ, ๋ํ ์ค๊ฐ์ ํฌ๋ฏธํ ์ ์กด์ฌ
๋น์ ์์ฒ๋ผ ๋ณด์ด๋ ๋ฐ์ดํฐ๋ค์ ๊ฐ๋ฅํ๋ฉด train data set์์ ์ ๊ฑฐํด์ฃผ๋ ๊ฒ์ด ๋ชจ๋ธํ์ต์ ๋์์ด ๋จ
ํน์ฑ ์กฐํฉ๋ค ์คํ
์ฌ๋ฌ ํน์ฑ(feature, attribute)๋ค์ ์กฐํฉ์ผ๋ก ์๋ก์ด ํน์ฑ์ ์ ์ํด๋ณผ ์ ์์
์๋ฅผ ๋ค์๋ฉด, ๊ฐ๊ตฌ๋น ๋ฐฉ ๊ฐ์ , ์นจ๋๋ฐฉ(bedroom)์ ๋น์จ , ๊ฐ๊ตฌ๋น ์ธ์
Copy housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
Copy corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Copy median_house_value 1.000000
median_income 0.687160
rooms_per_household 0.146285
total_rooms 0.135097
housing_median_age 0.114110
households 0.064506
total_bedrooms 0.047689
population_per_household -0.021985
population -0.026920
longitude -0.047432
latitude -0.142724
bedrooms_per_room -0.259984
Name: median_house_value, dtype: float64
์์์ ๊ด์ฐฐํ ์ ์๋ ์ฌ์ค๋ค?
bedrooms_per_room : ๊ฐํ ์์ ์๊ด๊ด๊ณ๋ฅผ ๊ฐ์ง
rooms_per_household : 2๋ฒ์งธ๋ก ๋์ ์๊ด๊ณ์๋ฅผ ๊ฐ์ง
์๋ก ๋ง๋ feature๊ฐ ์ง์ด ์ผ๋ง๋ ํฐ์ง ๊ฐ์ ์ ์ผ๋ก ๋๋ฌ๋
๋ฐ์ดํฐ ํ์๊ณผ์ ์ ๋๋ถ๋ถ ํ ๋ฒ์ผ๋ก ๋๋์ง ์๊ณ ๋ชจ๋ธ์ ๋ง๋ค๊ณ ๋ฌธ์ ์ ์ ๋ถ์ํ ๋ค ๋ค์ ์คํํ๊ฒ ๋ฉ๋๋ค.
4. ๋จธ์ ๋ฌ๋ ์๊ณ ๋ฆฌ์ฆ์ ์ํ ๋ฐ์ดํฐ ์ค๋น
๋ฐ์ดํฐ ์ค๋น๋ ๋ฐ์ดํฐ ๋ณํ(data transformation)๊ณผ์ ์ผ๋ก ๋ณผ ์ ์์ต๋๋ค.
๋ฐ์ดํฐ ์๋๋ณํ vs. ์๋๋ณํ(ํจ์๋ง๋ค๊ธฐ)
๋ฐ์ดํฐ ์๋๋ณํ์ ์ฅ์ ๋ค
์๋ก์ด ๋ฐ์ดํฐ์ ๋ํ ๋ณํ์ ์์ฝ๊ฒ ์ฌ์์ฐ(reproduce)ํ ์ ์์ต๋๋ค.
ํฅํ์ ์ฌ์ฌ์ฉ(reuse)ํ ์ ์๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ๊ตฌ์ถํ๊ฒ ๋ฉ๋๋ค.
์ค์ ์์คํ
์์ ๊ฐ๊ณต๋์ง ์์ ๋ฐ์ดํฐ(raw data)๋ฅผ ์๊ณ ๋ฆฌ์ฆ์ ์ฝ๊ฒ ์
๋ ฅ์ผ๋ก ์ฌ์ฉํ ์ ์๋๋ก ํด์ค๋๋ค.
์ฌ๋ฌ ๋ฐ์ดํฐ ๋ณํ ๋ฐฉ๋ฒ์ ์ฝ๊ฒ ์๋ํด ๋ณผ ์ ์์ต๋๋ค.
Copy housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()
๋ฐ์ดํฐ ์ ์ (Data Cleaning)
๋๋ฝ๋ ๊ฐ(missing values) ๋ค๋ฃจ๋ ๋ฐฉ๋ฒ๋ค
ํด๋น ๊ตฌ์ญ์ ์ ๊ฑฐ(ํ์ ์ ๊ฑฐ)
ํด๋น ํน์ฑ์ ์ ๊ฑฐ(์ด์ ์ ๊ฑฐ)
์ด๋ค ๊ฐ์ผ๋ก ์ฑ์(0, ํ๊ท , ์ค๊ฐ๊ฐ ๋ฑ)
Copy housing.isnull().any(axis=1)
Copy 17606 False
18632 False
14650 False
3230 False
3555 False
...
6563 False
12053 False
13908 False
11159 False
15775 False
Length: 16512, dtype: bool
Copy sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head() # True if there is a null feature
sample_incomplete_rows
Copy sample_incomplete_rows.dropna(subset=["total_bedrooms"]) # option 1
Copy sample_incomplete_rows.drop("total_bedrooms", axis=1) # option 2
Copy median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3
Copy sample_incomplete_rows
Copy from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median") # ๋ฐ์ดํฐ๊ฐ ์๋๊ฒฝ์ฐ median์ผ๋ก ์ฑ์๋ฃ์
Copy # ์ค๊ฐ๊ฐ์ ์์นํ ํน์ฑ์์๋ง ๊ณ์ฐ๋ ์ ์๊ธฐ ๋๋ฌธ์ ํ
์คํธ ํน์ฑ์ ์ ์ธํ ๋ณต์ฌ๋ณธ์ ์์ฑ
housing_num = housing.drop("ocean_proximity", axis=1)
Copy imputer.fit(housing_num)
Copy SimpleImputer(strategy='median')
Copy array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,
408. , 3.5409])
Copy housing_num.median().values
Copy array([-118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. ,
408. , 3.5409])
์ด์ ํ์ต๋ imputer ๊ฐ์ฒด๋ฅผ ์ฌ์ฉํด ๋๋ฝ๋ ๊ฐ์ ์ค๊ฐ๊ฐ์ผ๋ก ๋ฐ๊ฟ ์ ์์ต๋๋ค.
Copy X = imputer.transform(housing_num)
Copy array([[-121.89 , 37.29 , 38. , ..., 710. , 339. ,
2.7042],
[-121.93 , 37.05 , 14. , ..., 306. , 113. ,
6.4214],
[-117.2 , 32.77 , 31. , ..., 936. , 462. ,
2.8621],
...,
[-116.4 , 34.09 , 9. , ..., 2098. , 765. ,
3.2723],
[-118.01 , 33.82 , 31. , ..., 1356. , 356. ,
4.0625],
[-122.45 , 37.77 , 52. , ..., 1269. , 639. ,
3.575 ]])
์ X๋ NumPy array์
๋๋ค. ์ด๋ฅผ ๋ค์ pandas DataFrame์ผ๋ก ๋๋๋ฆด ์ ์์ต๋๋ค.
Copy housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing.index)
์ ๋๋ก ์ฑ์์ ธ ์๋์ง ํ์ธํด๋ด
๋๋ค.
Copy sample_incomplete_rows.index.values
Copy array([ 4629, 6068, 17923, 13656, 19252], dtype=int64)
Copy housing_num.loc[sample_incomplete_rows.index.values] # MA๋ฅผ ๊ฐ์ง ๋ฐ์ดํฐ
Copy housing_tr.loc[sample_incomplete_rows.index.values] # imputer๋ฅผ ํตํด MA๋ฅผ ์ฑ์๋ฃ์
์ถ์ ๊ธฐ(estimator) : ๋ฐ์ดํฐ์
์ ๊ธฐ๋ฐ์ผ๋ก ๋ชจ๋ธ ํ๋ผ๋ฏธํฐ๋ค์ ์ถ์ ํ๋ ๊ฐ์ฒด๋ฅผ ์ถ์ ๊ธฐ๋ผ๊ณ ํฉ๋๋ค(์๋ฅผ ๋ค์๋ฉด imputer). ์ถ์ ์์ฒด๋ fit() method์ ์ํด์ ์ํ๋๊ณ ํ๋์ ๋ฐ์ดํฐ์
์ ๋งค๊ฐ๋ณ์๋ก ์ ๋ฌ๋ฐ์ต๋๋ค(์ง๋ํ์ต์ ๊ฒฝ์ฐ label์ ๋ด๊ณ ์๋ ๋ฐ์ดํฐ์
์ ์ถ๊ฐ์ ์ธ ๋งค๊ฐ๋ณ์๋ก ์ ๋ฌ).
๋ณํ๊ธฐ(transformer) : (imputer๊ฐ์ด) ๋ฐ์ดํฐ์
์ ๋ณํํ๋ ์ถ์ ๊ธฐ๋ฅผ ๋ณํ๊ธฐ๋ผ๊ณ ํฉ๋๋ค. ๋ณํ์ transform() method๊ฐ ์ํํฉ๋๋ค. ๊ทธ๋ฆฌ๊ณ ๋ณํ๋ ๋ฐ์ดํฐ์
์ ๋ฐํํฉ๋๋ค.
์์ธก๊ธฐ(predictor) : ์ผ๋ถ ์ถ์ ๊ธฐ๋ ์ฃผ์ด์ง ์๋ก์ด ๋ฐ์ดํฐ์
์ ๋ํด ์์ธก๊ฐ์ ์์ฑํ ์ ์์ต๋๋ค. ์์์ ์ฌ์ฉํ๋ LinearRegression๋ ์์ธก๊ธฐ์
๋๋ค. ์์ธก๊ธฐ์ predict() method๋ ์๋ก์ด ๋ฐ์ดํฐ์
์ ๋ฐ์ ์์ธก๊ฐ์ ๋ฐํํฉ๋๋ค. ๊ทธ๋ฆฌ๊ณ score() method๋ ์์ธก๊ฐ์ ๋ํ ํ๊ฐ์งํ๋ฅผ ๋ฐํํฉ๋๋ค.
ํ
์คํธ์ ๋ฒ์ฃผํ ํน์ฑ ๋ค๋ฃจ๊ธฐ
Copy housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)
Copy from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
Copy array([[0.],
[0.],
[4.],
[1.],
[0.],
[1.],
[0.],
[1.],
[0.],
[0.]])
Copy ordinal_encoder.categories_ # class ๊ฐ ๋ฐฐ์
Copy [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]
์ด ํํ๋ฐฉ์์ ๋ฌธ์ ์ ?
"ํน์ฑ์ ๊ฐ์ด ๋น์ทํ ์๋ก ๋ ๊ฐ์ ์ํ์ด ๋น์ทํ๋ค"๊ฐ ์ฑ๋ฆฝํ ๋ ๋ชจ๋ธํ์ต์ด ์ฌ์์ง
ํน์ฑ์ ๊ฐ ์์๊ฐ ๋ฐ๋ค์ ๊ฐ๊น์ด ์ ๋๋ฅผ ๋ณด์ฅํ์ง ์์.
One-hot encoding
Copy from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
Copy <16512x5 sparse matrix of type '<class 'numpy.float64'>'
with 16512 stored elements in Compressed Sparse Row format>
์ ์ถ๋ ฅ์ ๋ณด๋ฉด ์ผ๋ฐ์ ์ธ ๋ฐฐ์ด์ด ์๋๊ณ "sparse matrix"์์ ์ ์ ์์ต๋๋ค.
Copy housing_cat_1hot.toarray() # class์ ํด๋นํ๋ ๊ฐ์ด 1
Copy array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
...,
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
Copy cat_encoder = OneHotEncoder(sparse=False) # sparse ์ต์
์ค์ ์ ๋ฐ๋ผ arr/sapse matix ์ถ๋ ฅ
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
Copy array([[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
...,
[0., 1., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0.]])
Copy cat_encoder.categories_
Copy [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
dtype=object)]
Scikit-Learn์ด ์ ์ฉํ ๋ณํ๊ธฐ๋ฅผ ๋ง์ด ์ ๊ณตํ์ง๋ง ํ๋ก์ ํธ๋ฅผ ์ํด ํน๋ณํ ๋ฐ์ดํฐ ์ฒ๋ฆฌ ์์
์ ํด์ผ ํ ๊ฒฝ์ฐ๊ฐ ๋ง์ต๋๋ค. ์ด ๋ ๋๋ง์ ๋ณํ๊ธฐ๋ฅผ ๋ง๋ค ์ ์์ต๋๋ค.
๋ฐ๋์ ๊ตฌํํด์ผ ํ method๋ค
์๋์ custom tranformer๋ rooms_per_household, population_per_household ๋ ๊ฐ์ ์๋ก์ด ํน์ฑ์ ๋ฐ์ดํฐ์
์ ์ถ๊ฐํ๋ฉฐ add_bedrooms_per_room = True๋ก ์ฃผ์ด์ง๋ฉด bedrooms_per_room ํน์ฑ๊น์ง ์ถ๊ฐํฉ๋๋ค. add_bedrooms_per_room์ ํ์ดํผํ๋ผ๋ฏธํฐ.
Copy from sklearn.base import BaseEstimator, TransformerMixin
# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
# X : arr(np)
def transform(self, X):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
# concatenate
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
Numpy๋ฐ์ดํฐ๋ฅผ DataFrame์ผ๋ก ๋ณํ
Copy housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,
columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
index=housing.index)
housing_extra_attribs.head()
ํน์ฑ ์ค์ผ์ผ๋ง(Feature Scaling)
Min-max scaling : 0๊ณผ 1์ฌ์ด์ ๊ฐ์ด ๋๋๋ก ์กฐ์
ํ์คํ(standardization) : ํ๊ท ์ด 0, ๋ถ์ฐ์ด 1์ด ๋๋๋ก ๋ง๋ค์ด ์ค(์ฌ์ดํท๋ฐ์ StandardScaler์ฌ์ฉ)
์ฌ๋ฌ ๊ฐ์ ๋ณํ์ด ์์ฐจ์ ์ผ๋ก ์ด๋ฃจ์ด์ ธ์ผ ํ ๊ฒฝ์ฐ Pipeline class๋ฅผ ์ฌ์ฉํ๋ฉด ํธํฉ๋๋ค.
Copy from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
์ด๋ฆ, ์ถ์ ๊ธฐ ์์ ๋ชฉ๋ก
๋ง์ง๋ง ๋จ๊ณ๋ฅผ ์ ์ธํ๊ณ ๋ชจ๋ ๋ณํ๊ธฐ์ฌ์ผ ํฉ๋๋ค(fit_transform() method๋ฅผ ๊ฐ์ง๊ณ ์์ด์ผ ํจ).
ํ์ดํ๋ผ์ธ์ fit() method๋ฅผ ํธ์ถํ๋ฉด ๋ชจ๋ ๋ณํ๊ธฐ์ fit_transform() method๋ฅผ ์์๋๋ก ํธ์ถํ๋ฉด์ ํ ๋จ๊ณ์ ์ถ๋ ฅ์ ๋ค์ ๋จ๊ณ์ ์
๋ ฅ์ผ๋ก ์ ๋ฌํฉ๋๋ค. ๋ง์ง๋ง ๋จ๊ณ์์๋ fit() method๋ง ํธ์ถํฉ๋๋ค.
Copy array([[-1.15604281, 0.77194962, 0.74333089, ..., -0.31205452,
-0.08649871, 0.15531753],
[-1.17602483, 0.6596948 , -1.1653172 , ..., 0.21768338,
-0.03353391, -0.83628902],
[ 1.18684903, -1.34218285, 0.18664186, ..., -0.46531516,
-0.09240499, 0.4222004 ],
...,
[ 1.58648943, -0.72478134, -1.56295222, ..., 0.3469342 ,
-0.03055414, -0.52177644],
[ 0.78221312, -0.85106801, 0.18664186, ..., 0.02499488,
0.06150916, -0.30340741],
[-1.43579109, 0.99645926, 1.85670895, ..., -0.22852947,
-0.09586294, 0.10180567]])
๊ฐ ์ด(column) ๋ง๋ค ๋ค๋ฅธ ํ์ดํ๋ผ์ธ์ ์ ์ฉํ ์๋ ์์ต๋๋ค!
์๋ฅผ ๋ค์ด ์์นํ ํน์ฑ๋ค๊ณผ ๋ฒ์ฃผํ ํน์ฑ๋ค์ ๋ํด ๋ณ๋์ ๋ณํ ์ด ํ์ํ๋ค๋ฉด ์๋์ ๊ฐ์ด ColumnTransformer ๋ฅผ ์ฌ์ฉํ๋ฉด ๋ฉ๋๋ค.
Copy from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
Copy array([[-1.15604281, 0.77194962, 0.74333089, ..., 0. ,
0. , 0. ],
[-1.17602483, 0.6596948 , -1.1653172 , ..., 0. ,
0. , 0. ],
[ 1.18684903, -1.34218285, 0.18664186, ..., 0. ,
0. , 1. ],
...,
[ 1.58648943, -0.72478134, -1.56295222, ..., 0. ,
0. , 0. ],
[ 0.78221312, -0.85106801, 0.18664186, ..., 0. ,
0. , 0. ],
[-1.43579109, 0.99645926, 1.85670895, ..., 0. ,
1. , 0. ]])
Copy housing_prepared.shape, housing.shape
Copy ((16512, 16), (16512, 9))
5. ๋ชจ๋ธ ํ๋ จ(Train a Model)
๋๋์ด ๋ชจ๋ธ์ ํ๋ จ์ํฌ ์ค๋น๊ฐ ๋์์ต๋๋ค!
์ง๋ ์๊ฐ์ ๋ฐฐ์ ๋ ์ ํํ๊ท๋ชจ๋ธ(linear regression)์ ์ฌ์ฉํด๋ณด๊ฒ ์ต๋๋ค.
Copy from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
๋ชจ๋ธํ๋ จ์ ๋ฑ 3์ค์ ์ฝ๋๋ฉด ์ถฉ๋ถํฉ๋๋ค!
๋ช ๊ฐ์ ์ํ์ ๋ชจ๋ธ์ ์ ์ฉํด์ ์์ธก๊ฐ์ ํ์ธํด๋ณด๊ณ ์ค์ ๊ฐ๊ณผ ๋น๊ตํด๋ณด๊ฒ ์ต๋๋ค.
Copy array([-55650.4116403 , -56716.45236929, 13732.83841856, -1933.1277138 ,
7330.04062103, -45708.26306673, 45455.47519691, 74714.39134154,
6605.12802802, 1042.95709453, 9249.75886697, -18016.52432168,
-55219.15208555, 110357.78363967, -22479.84008184, -14642.2671506 ])
Copy extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(lin_reg.coef_, attributes), reverse=True)
Copy [(110357.78363966991, 'ISLAND'),
(74714.39134153843, 'median_income'),
(45455.47519691441, 'households'),
(13732.83841855541, 'housing_median_age'),
(9249.75886697368, 'bedrooms_per_room'),
(7330.040621029702, 'total_bedrooms'),
(6605.128028015065, 'rooms_per_hhold'),
(1042.9570945281878, 'pop_per_hhold'),
(-1933.127713800795, 'total_rooms'),
(-14642.267150598302, 'NEAR OCEAN'),
(-18016.52432168299, '<1H OCEAN'),
(-22479.840081835082, 'NEAR BAY'),
(-45708.263066728214, 'population'),
(-55219.15208555335, 'INLAND'),
(-55650.41164030249, 'longitude'),
(-56716.45236929203, 'latitude')]
Copy # ๋ช ๊ฐ์ ์ํ์ ๋ํด ๋ฐ์ดํฐ๋ณํ ๋ฐ ์์ธก์ ํด๋ณด์
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared).round(decimals=1))
Copy Predictions: [210644.6 317768.8 210956.4 59219. 189747.6]
Copy print("Labels:", list(some_labels))
Copy Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
์ ์ฒด ํ๋ จ ๋ฐ์ดํฐ์
์ ๋ํ RMSE๋ฅผ ์ธก์ ํด๋ณด๊ฒ ์ต๋๋ค.
Copy from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
ํ๋ จ ๋ฐ์ดํฐ์
์ RMSE๊ฐ ์ด ๊ฒฝ์ฐ์ฒ๋ผ ํฐ ๊ฒฝ์ฐ => ๊ณผ์์ ํฉ(under-fitting)
๊ณผ์์ ํฉ์ด ์ผ์ด๋๋ ์ด์ ?
ํน์ฑ๋ค(features)์ด ์ถฉ๋ถํ ์ ๋ณด๋ฅผ ์ ๊ณตํ์ง ๋ชปํจ
๋ชจ๋ธ์ด ์ถฉ๋ถํ ๊ฐ๋ ฅํ์ง ๋ชปํจ
๊ฐ๋ ฅํ ๋น์ ํ๋ชจ๋ธ์ธ DecisionTreeRegressor๋ฅผ ์ฌ์ฉํด๋ณด๊ฒ ์ต๋๋ค.
Copy from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels) # ํ์ต
Copy DecisionTreeRegressor(random_state=42)
Copy housing_predictions = tree_reg.predict(housing_prepared) # ์์ธก
tree_mse = mean_squared_error(housing_labels, housing_predictions) # RMSE
tree_rmse = np.sqrt(tree_mse)
tree_rmse
์ด ๋ชจ๋ธ์ด ์ ํ๋ชจ๋ธ๋ณด๋ค ๋ซ๋ค๊ณ ๋งํ ์ ์์๊น์? ์ด๋ป๊ฒ ์ ์ ์์๊น์?
ํ
์คํธ ๋ฐ์ดํฐ์
์ ์ด์ฉํ ๊ฒ์ฆ
์ด๋ฐ์์ผ๋ก ํ๋ฉด, ํ
์คํธ ๋ฐ์ดํฐ์
์ ๋ค์ฌ๋ค๋ณด๊ฒ ๋๊ณ ํ์ตํ๋ ๊ณผ์ ์ ์ํฅ์ ๋ฏธ์นจ. ๋๋ค๋ฅธ ํ
์คํธ ๋ฐ์ดํฐ์
์ ๋ํด์ ์ข์ง ๋ชปํ ๊ฒฐ๊ณผ๊ฐ ๋์ฌ ๊ฐ๋ฅ์ฑ์ด ๋์์ง๊ฒ ๋จ
ํ๋ จ ๋ฐ์ดํฐ์
์ ์ผ๋ถ๋ฅผ ๊ฒ์ฆ๋ฐ์ดํฐ(validation data)์
์ผ๋ก ๋ถ๋ฆฌํด์ ๊ฒ์ฆ
k-๊ฒน ๊ต์ฐจ ๊ฒ์ฆ(k-fold cross-validation)
๊ต์ฐจ ๊ฒ์ฆ(Cross-Validation)์ ์ฌ์ฉํ ํ๊ฐ
๊ฒฐ์ ํธ๋ฆฌ ๋ชจ๋ธ ์ ๋ํ ํ๊ฐ
Copy from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
Copy def display_scores(scores):
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
Copy Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
71115.88230639 75585.14172901 70262.86139133 70273.6325285
75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004
์ ํํ๊ท๋ชจ๋ธ ์ ๋ํ ํ๊ฐ
Copy lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Copy Scores: [66782.73843989 66960.118071 70347.95244419 74739.57052552
68031.13388938 71193.84183426 64969.63056405 68281.61137997
71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.6740017983425
RandomForestRegressor ์ ๋ํ ํ๊ฐ
Copy from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42) # n_estimators=#tree
forest_reg.fit(housing_prepared, housing_labels)
Copy RandomForestRegressor(random_state=42)
Copy housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
Copy from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Copy Scores: [49519.80364233 47461.9115823 50029.02762854 52325.28068953
49308.39426421 53446.37892622 48634.8036574 47585.73832311
53490.10699751 50021.5852922 ]
Mean: 50182.303100336096
Standard deviation: 2097.0810550985693
Random forest ๋ชจ๋ธ์ด ๊ฐ์ฅ ์ข์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ (๋ชจ๋ธ ์ ํ)
๋ชจ๋ธ ์ ํ ํ, ์ธ๋ถ์ ์ผ๋ก ํ๋ ์งํ
6. ๋ชจ๋ธ ์ธ๋ถ ํ๋(Fine-Tune Your Model)
๋ชจ๋ธ์ ์ข
๋ฅ๋ฅผ ์ ํํ ํ์ ๋ชจ๋ธ์ ์ธ๋ถ ํ๋ํ๋ ๊ฒ์ด ํ์ํฉ๋๋ค. ๋ชจ๋ธ ํ์ต์ ์ํ ์ต์ ์ ํ์ดํผํ๋ผ๋ฏธํฐ๋ฅผ ์ฐพ๋ ๊ณผ์ ์ด๋ผ๊ณ ๋งํ ์ ์์ต๋๋ค.
๊ทธ๋ฆฌ๋ ์ฐธ์(Grid Search)
์๋์ผ๋ก ํ์ดํผํ๋ผ๋ฏธํฐ ์กฐํฉ์ ์๋ํ๋ ๋์ GridSearchCV ๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ด ์ข์ต๋๋ค.
Copy from sklearn.model_selection import GridSearchCV
param_grid = [
# try 12 (3ร4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2ร3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
Copy GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
param_grid=[{'max_features': [2, 4, 6, 8],
'n_estimators': [3, 10, 30]},
{'bootstrap': [False], 'max_features': [2, 3, 4],
'n_estimators': [3, 10]}],
return_train_score=True, scoring='neg_mean_squared_error')
Copy grid_search.best_params_ # ๊ฐ์ฅ ์ข์ ํ๋ผ๋ฏธํฐ
Copy {'max_features': 8, 'n_estimators': 30}
Copy grid_search.best_estimator_ # ๊ฐ์ฅ ์ข์ ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์๋ ํ์ตํ ๋ชจ๋ธ๋ ์ ์ฅ
Copy RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)
Copy cvres = grid_search.cv_results_
# ํ์ดํผํ๋ผ๋ฏธํฐ ์กฐํฉ์ ๋ฐ๋ผ์ mean_score๊ฐ ์ด๋ป๊ฒ ๋ฐ๋๋์ง
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Copy 63669.11631261028 {'max_features': 2, 'n_estimators': 3}
55627.099719926795 {'max_features': 2, 'n_estimators': 10}
53384.57275149205 {'max_features': 2, 'n_estimators': 30}
60965.950449450494 {'max_features': 4, 'n_estimators': 3}
52741.04704299915 {'max_features': 4, 'n_estimators': 10}
50377.40461678399 {'max_features': 4, 'n_estimators': 30}
58663.93866579625 {'max_features': 6, 'n_estimators': 3}
52006.19873526564 {'max_features': 6, 'n_estimators': 10}
50146.51167415009 {'max_features': 6, 'n_estimators': 30}
57869.25276169646 {'max_features': 8, 'n_estimators': 3}
51711.127883959234 {'max_features': 8, 'n_estimators': 10}
49682.273345071546 {'max_features': 8, 'n_estimators': 30}
62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
๋๋ค ํ์(Randomized Search)
ํ์ดํผํ๋ผ๋ฏธํฐ ์กฐํฉ์ ์๊ฐ ํฐ ๊ฒฝ์ฐ ์ ์ ๋ฆฌ. ์ง์ ํ ํ์๋งํผ๋ง ํ๊ฐ.
Copy from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
Copy RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000018C1A5BC508>,
'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000018C1A5AD148>},
random_state=42, scoring='neg_mean_squared_error')
Copy cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Copy 49150.70756927707 {'max_features': 7, 'n_estimators': 180}
51389.889203389284 {'max_features': 5, 'n_estimators': 15}
50796.155224308866 {'max_features': 3, 'n_estimators': 72}
50835.13360315349 {'max_features': 5, 'n_estimators': 21}
49280.9449827171 {'max_features': 7, 'n_estimators': 122}
50774.90662363929 {'max_features': 3, 'n_estimators': 75}
50682.78888164288 {'max_features': 3, 'n_estimators': 88}
49608.99608105296 {'max_features': 5, 'n_estimators': 100}
50473.61930350219 {'max_features': 3, 'n_estimators': 150}
64429.84143294435 {'max_features': 5, 'n_estimators': 2}
ํน์ฑ ์ค์๋, ์๋ฌ ๋ถ์
Copy feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Copy array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])
Copy extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Copy [(0.36615898061813423, 'median_income'),
(0.16478099356159054, 'INLAND'),
(0.10879295677551575, 'pop_per_hhold'),
(0.07334423551601243, 'longitude'),
(0.06290907048262032, 'latitude'),
(0.056419179181954014, 'rooms_per_hhold'),
(0.053351077347675815, 'bedrooms_per_room'),
(0.04114379847872964, 'housing_median_age'),
(0.014874280890402769, 'population'),
(0.014672685420543239, 'total_rooms'),
(0.014257599323407808, 'households'),
(0.014106483453584104, 'total_bedrooms'),
(0.010311488326303788, '<1H OCEAN'),
(0.0028564746373201584, 'NEAR OCEAN'),
(0.0019604155994780706, 'NEAR BAY'),
(6.0280386727366e-05, 'ISLAND')]
7. ํ
์คํธ ๋ฐ์ดํฐ์
์ผ๋ก ์ต์ข
ํ๊ฐํ๊ธฐ
Copy final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
8. ๋ก ์นญ, ๋ชจ๋ํฐ๋ง, ์์คํ
์ ์ง ๋ณด์
์์ฉํ๊ฒฝ์ ๋ฐฐํฌํ๊ธฐ ์ํด์ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ์ ๋ชจ๋ธ์ ์์ธก์ด ํฌํจ๋ ํ์ดํ๋ผ์ธ์ ๋ง๋ค์ด ์ ์ฅํ๋ ๊ฒ ์ด ์ข์ต๋๋ค.In [142]:
Copy full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)
# ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ, ๋ชจ๋ธ ํ์ต ๋ฐ ์์ธก ํ๋๋ก ๋ฌถ๊ณ ์คํ
Copy array([210644.60459286, 317768.80697211, 210956.43331178, 59218.98886849,
189747.55849879])
Copy my_model = full_pipeline_with_predictor
Copy import joblib
joblib.dump(my_model, "my_model.pkl")
# pickle ํ์ผ๋ก ๋ง๋ค์ด์ ํ๋ผ๋ฏธํฐ ์ ์ฅ
my_model_loaded = joblib.load("my_model.pkl")
Copy my_model_loaded.predict(some_data)
Copy array([210644.60459286, 317768.80697211, 210956.43331178, 59218.98886849,
189747.55849879])
๋ก ์นญํ ์์คํ
๋ชจ๋ํฐ๋ง
์๊ฐ์ด ์ง๋๋ฉด ๋ชจ๋ธ์ด ๋ํ๋๋ฉด์ ์ฑ๋ฅ์ด ์ ํ
๋ฐ๋ผ์ ๋ก ์นญ ํ ์์คํ
์ ๊ณ์ํด์ ๋ชจ๋ํฐ๋งํ๋ ๊ฒ์ด ์ค์ โ
๊ฐ๋ฅํ๋ฉด ์์คํ
์ด ์ ๋์๊ฐ๊ณ ์๋์ง ๋ชจ๋ํฐ๋ง ์์คํ
๋ง๋๋ ๊ฒ์ด ์ข์
์๋๋ชจ๋ํฐ๋ง: ์ถ์ฒ์์คํ
์ ๊ฒฝ์ฐ, ์ถ์ฒ๋ ์ํ์ ํ๋งค๋์ด ์ค์ด๋๋์ง?
์๋๋ชจ๋ํฐ๋ง: ์ด๋ฏธ์ง ๋ถ๋ฅ์ ๊ฒฝ์ฐ, ๋ถ๋ฅ๋ ์ด๋ฏธ์ง๋ค ์ค ์ผ๋ถ๋ฅผ ์ ๋ฌธ๊ฐ์๊ฒ ๊ฒํ ์ํด
๊ฒฐ๊ณผ๊ฐ ๋๋น ์ง ๊ฒฝ์ฐ
๋ฐ์ดํฐ ์
๋ ฅ์ ํ์ง์ด ๋๋น ์ก๋์ง? ์ผ์๊ณ ์ฅ?
ํธ๋ ๋์ ๋ณํ? ๊ณ์ ์ ์์ธ?
์ ์ง๋ณด์
์ ๊ธฐ์ ์ผ๋ก ์๋ก์ด ๋ฐ์ดํฐ ์์ง(๋ ์ด๋ธ)
์๋ก์ด ๋ฐ์ดํฐ๋ฅผ ํ
์คํธ ๋ฐ์ดํฐ๋ก, ํ์ฌ์ ํ
์คํธ ๋ฐ์ดํฐ๋ ํ์ต๋ฐ์ดํฐ๋ก ํธ์
๋ค์ ํ์ตํ, ์๋ก์ด ํ
์คํธ ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํด ํ์ฌ ๋ชจ๋ธ๊ณผ ์ ๋ชจ๋ธ์ ํ๊ฐ, ๋น๊ต
[AI ์ค์ฟจ 1๊ธฐ] 5์ฃผ์ฐจ DAY 2
Machine Learning ๊ธฐ์ด - ๊ฒฐ์ ์ด๋ก
๊ฒฐ์ ์ด๋ก ์ด๋?
์๋ก์ด ๊ฐ x๊ฐ ์ฃผ์ด์ก์ ๋ ํ๋ฅ ๋ชจ๋ธ p(x, t)์ ๊ธฐ๋ฐํด ์ต์ ์ ๊ฒฐ์ ์ ๋ด๋ฆฌ๋ ๊ฒ
์ถ๋ก ๋จ๊ณ : ๊ฒฐํฉํ๋ฅ ๋ถํฌ๋ฅผ ๊ตฌํ๋ ๊ฒ
๊ฒฐ์ ๋จ๊ณ : ์ํฉ์ ๋ํ ํ๋ฅ ์ด ์ฃผ์ด์ก์ ๋ ์ต์ ์ ๊ฒฐ์ ์ ๋ด๋ฆฌ๋ ๊ฒ
๊ฒฐ์ ์์ญ - ์ด์ง๋ถ๋ฅ
๋ฌด์จ ๋ง์ธ์ง ๋ชจ๋ฅด๊ฒ ์ง๋ง, ๊ทธ๋ํ ๋ฉด์ ์ด ์ค๋ฅ๋ฅผ ์๋ฏธํ๊ณ ์ค๋ฅ๋ฅผ ์ต์ํ ํ๋ ์ชฝ์ผ๋ก ํด์ผํ๋๋ฐ ๊ทธ ๋ถ๋ถ์ด ๋ ๊ทธ๋ํ์ ๊ต์
๊ฒฐ์ ์ด๋ก ์ ๋ชฉํ (๋ถ๋ฅ์ ๊ฒฝ์ฐ)
๊ฒฐํฉํ๋ฅ ๋ถํฌ๊ฐ ์ฃผ์ด์ก์ ๋ ์ต์ ์ ๊ฒฐ์ ์์ญ๋ค์ ์ฐพ๋ ๊ฒ.
๊ธฐ๋์์ค ์ต์ํ
๋ชจ๋ ๊ฒฐ์ ์ด ๋์ผํ ๋ฆฌ์คํฌ๋ฅผ ๊ฐ์ง ์์
์์ด ์๋๋ฐ ์์ธ ๊ฒ์ผ๋ก ์ง๋จ
์์ด ๋ง๋๋ฐ ์์ด ์๋ ๊ฒ์ผ๋ก ์ง๋จ
๋ฐ์ดํฐ์ ๋ํ ๋ชจ๋ ์ง์์ ํ๋ฅ ๋ถํฌ๋ก ํํ๋๋ค. ํ ๋ฐ์ดํฐ ์ํ์ ์ค์ ํด๋์ค๋ฅผ ๊ฒฐ์ ๋ก ์ ์ผ๋ก ์๊ณ ์๋ ๊ฒ์ด ์๋๋ผ ๊ทธ๊ฒ์ ํ๋ฅ ๋ง์ ์ ์ ์๋ค๊ณ ๊ฐ์ ํ๋ค. ์ฆ, ์ฐ๋ฆฌ๊ฐ ๊ด์ฐฐํ ์ ์๋ ์ํ์ ํ๋ฅ ๋ถํฌ๋ฅผ ํตํด์ ์์ฑ๋ ๊ฒ์ด๋ค.
Machine Learning ๊ธฐ์ด - ์ ํํ๊ท
์ฃผ์ด์ง ๋ฐ์ดํฐ๋ฅผ ์ง์ ์ ์ฌ์ฉํด ๋ชจ๋ธ๋ง ํ๋ ๋ฐฉ๋ฒ
์ง์ ํจ์๋ ๋ค์๊ณผ ๊ฐ์ ํํ๋ฅผ ๊ฐ์ง๋คy=ax+by=ax+b
์ฌ๊ธฐ์ a๋ ๊ธฐ์ธ๊ธฐ, b๋ y์ ํธ์ด๋ค.
์๋ ๊ทธ๋ํ๋ ๊ธฐ์ธ๊ธฐ๊ฐ 2์ด๊ณ y์ ํธ์ด -5์ธ ์ง์ ์ผ๋ก๋ถํฐ ์์ฑ๋ ๋ฐ์ดํฐ๋ฅผ ๋ณด์ฌ์ค๋ค.
Copy import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Copy ''' numpy.random.RandomState๋ class๋ช
์ผ๋ก ๋๋ค๋๋ฒ ์์ฑ๊ธฐ์ธ ๋๋คํจ์๋ค์ ํฌํจํ๋ ํด๋์ค๋ผ๊ณ ํ ์ ์๋ค.
RandomState๋ ๋ค์ํ ํ๋ฅ ๋ถํฌ์ธก๋ฉด์ ์ ๋ง์ ๋๋ค๋๋ฒ ์์ฑ๊ธฐ๋ค์ ๊ฐ์ง๊ณ ์๋ค.
ex) numpy.random.uniform(๊ท ๋ฑ๋ถํฌ์์ ๋ณ์ ์ถ์ถ), numpy.random.nomal(์ ๊ท๋ถํฌ์์ ๋ณ์ ์ถ์ถ) ๋ฑ
๊ฐ ๋ฐฉ๋ฒ๋ค์ size๋ฅผ argument๋ก ์ทจํ๋๋ฐ ๊ธฐ๋ณธ๊ฐ์ None์ด๋ค.
๋ง์ฝ size๊ฐ None์ด๋ผ๋ฉด, ํ๋์ ๊ฐ์ด ์์ฑ๋๊ณ ๋ฐํ๋๋ค. ๋ง์ฝ size๊ฐ ์ ์๋ผ๋ฉด, 1-D ํ๋ ฌ์ด ๋๋ค๋ณ์๋ค๋ก ์ฑ์์ ธ ๋ฐํ๋๋ค.
๋ง์ฝ size๊ฐ tuple์ด๋ผ๋ฉด ํ๋ ฌ์ด ๊ทธ ํํ์ ๋ง์ถ์ด ๋๋ค๋ณ์๋ค๋ก ์ฑ์์ ธ ๋ฐํ๋๋ค. '''
rng = np.random.RandomState(1)
x = 10 * rng.rand(50) # 0~10 ์ฌ์ด
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);
Copy # ๊ท ์ผ๋ถํฌ์ m-1๊น์ง์ ์ ์ ๋์๋ฅผ ์์ฑ
np.random.randint(6)
Copy # 0๋ถํฐ 1๊น์ง์ ๊ท ์ผ๋ถํฌ ํ์ค์ ๊ท๋ถํฌ์ ๋์๋ฅผ ๊ฐ์ง m๊ฐ์ ํ, n๊ฐ์ ์ด array๋ฅผ ์์ฑ
np.random.rand(3, 3)
Copy array([[0.41972662, 0.91250462, 0.32922597],
[0.35029654, 0.08989692, 0.93321008],
[0.04695859, 0.02030855, 0.82914045]])
Copy # ํ๊ท 0, ํ์คํธ์ฐจ 1์ธ ๊ฐ์ฐ์์ ํ์ค์ ๊ท๋ถํฌ์ ๋์๋ฅผ ๊ฐ์ง m๊ฐ์ ํ, n๊ฐ์ ์ด array๋ฅผ ์์ฑ
np.random.randn(3, 3)
Copy array([[ 1.85523496, -1.2565864 , -0.83251712],
[-0.99798501, -0.94665524, 0.73052923],
[-0.11112169, 0.12296838, 1.37482645]])
Copy array([4.17022005e+00, 7.20324493e+00, 1.14374817e-03, 3.02332573e+00,
1.46755891e+00, 9.23385948e-01, 1.86260211e+00, 3.45560727e+00,
3.96767474e+00, 5.38816734e+00, 4.19194514e+00, 6.85219500e+00,
2.04452250e+00, 8.78117436e+00, 2.73875932e-01, 6.70467510e+00,
4.17304802e+00, 5.58689828e+00, 1.40386939e+00, 1.98101489e+00,
8.00744569e+00, 9.68261576e+00, 3.13424178e+00, 6.92322616e+00,
8.76389152e+00, 8.94606664e+00, 8.50442114e-01, 3.90547832e-01,
1.69830420e+00, 8.78142503e+00, 9.83468338e-01, 4.21107625e+00,
9.57889530e+00, 5.33165285e+00, 6.91877114e+00, 3.15515631e+00,
6.86500928e+00, 8.34625672e+00, 1.82882773e-01, 7.50144315e+00,
9.88861089e+00, 7.48165654e+00, 2.80443992e+00, 7.89279328e+00,
1.03226007e+00, 4.47893526e+00, 9.08595503e+00, 2.93614148e+00,
2.87775339e+00, 1.30028572e+00])
Copy # ์ฐจ์์ ๋๋ ค์ค๋ค.
# ์๋์ ์ฌ์ฉํ model.fit์ ํ์ต ๋ฐ์ดํฐ๋ก 2์ฐจ์ array๋ฅผ ๋ฐ๋๋ค.
x[:, np.newaxis][:10]
Copy array([[4.17022005e+00],
[7.20324493e+00],
[1.14374817e-03],
[3.02332573e+00],
[1.46755891e+00],
[9.23385948e-01],
[1.86260211e+00],
[3.45560727e+00],
[3.96767474e+00],
[5.38816734e+00]])
Copy # 0๋ถํฐ 10๊น์ง์ ๊ฐ๋ค์ 1000์ ๊ฐ๊ฒฉ์ ๋ array๋ฅผ ์์ฑํ๋ค
np.linspace(0, 10, 1000)[range(50,1000,20)], np.linspace(0, 10, 1000)[[0, 500, 999]]
Copy (array([0.5005005 , 0.7007007 , 0.9009009 , 1.1011011 , 1.3013013 ,
1.5015015 , 1.7017017 , 1.9019019 , 2.1021021 , 2.3023023 ,
2.5025025 , 2.7027027 , 2.9029029 , 3.1031031 , 3.3033033 ,
3.5035035 , 3.7037037 , 3.9039039 , 4.1041041 , 4.3043043 ,
4.5045045 , 4.7047047 , 4.9049049 , 5.10510511, 5.30530531,
5.50550551, 5.70570571, 5.90590591, 6.10610611, 6.30630631,
6.50650651, 6.70670671, 6.90690691, 7.10710711, 7.30730731,
7.50750751, 7.70770771, 7.90790791, 8.10810811, 8.30830831,
8.50850851, 8.70870871, 8.90890891, 9.10910911, 9.30930931,
9.50950951, 9.70970971, 9.90990991]),
array([ 0. , 5.00500501, 10. ]))
Scikit-Learn์ LinearRegression
estimator๋ฅผ ์ฌ์ฉํด์ ์ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ฅ ์ ํํํ๋ ์ง์ ์ ์ฐพ์ ์ ์๋ค.
Copy from sklearn.linear_model import LinearRegression
# ๋ชจ๋ธ์ ํด๋์ค ์ ์
model = LinearRegression(fit_intercept=True)
# fit์ ์ธ์๋ค๋ก ํ์ต ๋ฐ์ดํฐ x, ๋ ์ด๋ธ ๋ฐ์ดํฐ y๋ก ์ ๋ฌ๋๋ค
model.fit(x[:, np.newaxis], y)
# ์๋ก์ด ํ์ต ๋ฐ์ดํฐ๋ฅผ ์์ฑํ์ฌ ์์ธกํ๋ ๊ณผ์
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);
x, y๋ ์ ์ผ๋ก ์ฐํ์๊ณ
xfit๊ณผ yfit๋ ๋ง์ฐฌ๊ฐ์ง๋ก ์ ์ผ๋ก ์ฐํ์์ง๋ง
1000๊ฐ์ ์ ์ด๊ธฐ ๋๋ฌธ์ ์ ์ผ๋ก ๋ณด์ธ๋ค
๋ชจ๋ธ ํ์ต์ด ๋๋ ํ ํ์ต๋ ํ๋ผ๋ฏธํฐ๋ค์ model."ํ๋ผ๋ฏธํฐ์ด๋ฆ"_ ์ ํํ๋ก ์ ์ฅ๋๋ค. ๊ธฐ์ธ๊ธฐ์ y์ ํธ์ ์๋์ ๊ฐ์ด ์ถ๋ ฅํ ์ ์๋ค.
Copy print("Model slope : ", model.coef_[0])
print("Model intercept : ", model.intercept_)
Copy Model slope : 2.0272088103606944
Model intercept : -4.9985770855532
LinearRegression estimator๋ ์์ ์์ ์ ๊ฐ์ 1์ฐจ์ ์
๋ ฅ๋ฟ๋ง ์๋๋ผ ๋ค์ฐจ์ ์
๋ ฅ์ ์ฌ์ฉํ ์ ํ๋ชจ๋ธ์ ๋ค๋ฃฐ ์ ์๋ค. ๋ค์ฐจ์ ์ ํ๋ชจ๋ธ์ ๋ค์๊ณผ ๊ฐ์ ํํ๋ฅผ ๊ฐ์ง๋ค.
y=a0+a1x1+a2x2+...y=a0+a1x1+a2x2+...๊ธฐํํ์ ์ผ๋ก ์ด๊ฒ์ hyper_plane์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ํํํ๋ ๊ฒ์ด๋ผ๊ณ ๋งํ ์ ์๋ค.
Copy # 100๊ฐ์ ํ, 3๊ฐ์ ์ด์ ๊ฐ์ง ๋๋ค๊ฐ ์์ฑ
rng.rand(100, 3)[:5], rng.rand(100, 3).shape
Copy (array([[0.76778898, 0.53600849, 0.03985993],
[0.13479312, 0.1934164 , 0.3356638 ],
[0.05231295, 0.60511678, 0.51206103],
[0.61746101, 0.43235559, 0.84770047],
[0.45405906, 0.01540352, 0.87306815]]),
(100, 3))
Copy rnp = np.random.RandomState(1)
X = 10 * rng.rand(100, 3)
y = 0.5 + np.dot(X, [1.5, -2., 1.])
model.fit(X, y)
print("Model intercept : ", model.intercept_)
print("Model slope : ", model.coef_)
Copy Model intercept : 0.5000000000000087
Model slope : [ 1.5 -2. 1. ]
y๊ฐ๋ค์ ๋๋คํ๊ฒ ์์ฑ๋ 3์ฐจ์์ x๊ฐ๊ณผ ๊ณ์๋ค์ ๊ณฑํจ์ผ๋ก์จ ์์ฑ๋์๋๋ฐ, linear regression์ ํตํด์ ์ด ๊ณ์๋ค์ ๊ณ์ฐํด๋ผ ์ ์์๋ค.
๋ง์ฝ ๋ฐ์ดํฐ๊ฐ ์ ํ์ ์ธ ๊ด๊ณ๋ฅผ ๊ฐ์ง๊ณ ์์ง ์๋ค๋ฉด?
์ ํ ๊ธฐ์ ํจ์ ๋ชจ๋ธ (Linear Basis Fucntion Models)
๋น์ ํ ๋ฐ์ดํฐ๋ฅผ ์ ํํจ์๋ก ๋ชจ๋ธ๋งํ๋ ํ ๊ฐ์ง ๋ฐฉ๋ฒ์ ๊ธฐ์ ํจ์๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ด๋ค.
์๋ฅผ ๋ค์ด, ๋ค์๊ณผ ๊ฐ์ ์ ํํจ์๋ฅผ ์ฌ์ฉํ๋ค๊ณ ํ์.
y=a0+a1x1+a2x2+a3x3+...y=a0+a1x1+a2x2+a3x3+...์ฌ๊ธฐ์ x1,x2,x3x1,x2,x3 ๋ฑ์ 1์ฐจ์ xx๋ก๋ถํฐ ์์ฑํ ์ ์๋ค.
xn=fn(x)xn=fn(x) ์ด๊ณ fnfn ์ ๊ธฐ์ ํจ์๋ผ๊ณ ๋ถ๋ฅธ๋ค. ๋ง์ฝ fn(x)=xnfn(x)=xn ๋ผ๋ ๊ธฐ์ ํจ์๋ฅผ ์ฌ์ฉํ๋ฉด ์ต์ข
์ ์ธ ๋ชจ๋ธ์ ๋ค์๊ณผ ๊ฐ์ ๊ฒ์ด๋ค.y=a0+a1x+a2x2+a3x3+...y=a0+a1x+a2x2+a3x3+...์ด ๋ชจ๋ธ์ ์ฌ์ ํ ๊ณ์์ ๊ดํด์๋ ์ ํํจ์์ด๋ค. ๋ฐ๋ผ์ 1์ฐจ์ ๋ณ์์ธ xx๋ฅผ ๊ธฐ์ ํจ์๋ฅผ ํตํด ๋ค์ฐจ์์ผ๋ก ํ์ฅ์ํด์ผ๋ก์จ ์ฐ๋ฆฌ๋ ์ฌ์ ํ ์ ํ๋ชจ๋ธ์ ์ฌ์ฉํ ์ ์๊ฒ๋๋ค.
๋คํญ ๊ธฐ์ ํจ์ (Ploynomial Basis Function)
f(x)=xnf(x)=xn ํํ์ ํจ์๋ฅผ ๋คํญ ๊ธฐ์ ํจ์๋ผ๊ณ ๋ถ๋ฅธ๋ค. Scikit-Learn์ PolynomialFeatures
์ด๋ผ๋ transformer๋ฅผ ์ด๋ฏธ ํฌํจํ๊ณ ์๋ค.
Copy from sklearn.preprocessing import PolynomialFeatures
x = np.array([2, 3, 4])
poly = PolynomialFeatures(3, include_bias=False)
poly.fit_transform(x[:, None])
Copy array([[ 2., 4., 8.],
[ 3., 9., 27.],
[ 4., 16., 64.]])
Copy '''
array([[ 2., 4., 8.], 2์ ์ ๊ณฑ
[ 3., 9., 27.], 3์ ์ ๊ณฑ
[ 4., 16., 64.]]) 4์ ์ ๊ณฑ
'''
Copy '\narray([[ 2., 4., 8.], 2์ ์ ๊ณฑ \n [ 3., 9., 27.], 3์ ์ ๊ณฑ \n [ 4., 16., 64.]]) 4์ ์ ๊ณฑ \n'
PolynomialFeatures๊ฐ 1์ฐจ์ array๋ฅผ 3์ฐจ์ array๋ก ๋ณํํ ๊ฒ์ ๋ณผ ์ ์๋ค. ์ด๋ ๊ฒ ๋ณํ๋ ๋ฐ์ดํฐ๋ฅผ ์ ํ๋ชจ๋ธ์ ์ ์ฉํ ์ ์๋ค.
7์ฐจ์ ๋ณํ์ ์ ์ฉํด๋ณด์
Copy from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(PolynomialFeatures(7),
LinearRegression())
๋ค์ฐจ์ ๋ณํ์ ์ฌ์ฉํ๋ฉด ๋ณต์กํ ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ธ๋งํ ์ ์๊ฒ ๋๋ค. ์๋ฅผ ๋ค์ด sine ํจ์๋ฅผ ์ฌ์ฉํด์ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ๊ณ ๋ชจ๋ธ๋งํด๋ณด์.
Copy rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
plt.scatter(x, y);
Copy poly_model.fit(x[:, np.newaxis], y)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);
๊ฐ์ฐ์์ ๊ธฐ์ ํจ์ (Gaussian Basis Function)
๋คํญ ๊ธฐ์ ํจ์ ์ธ์ ๋ค๋ฅธ ๊ธฐ์ ํจ์๋ฅผ ์ฌ์ฉํด๋ณด์. ๊ฐ์ฐ์์ ๊ธฐ์ ํจ์๋ ๋ค์๊ณผ ๊ฐ์ด ์ ์๋๋ค.exp{โ(xโuj)22s2}exp{โ(xโuj)22s2}
ujuj๋ ํจ์์ ์์น, ss๋ ํญ์ ๊ฒฐ์ ํ๋ค. ์ฃผ์ด์ง ๋ฐ์ดํฐ๋ฅผ ์ฌ๋ฌ ๊ฐ์ ๊ฐ์ฐ์์ ๊ธฐ์ ํจ์๋ค์ ํฉ์ผ๋ก ํํํ๋ ค๊ณ ์๋ํ ์ ์๋ค.