[DAY 35] Machine Learning 실습

[천재교육] 프로젝트 기반 빅데이터 서비스 개발자 양성 과정 9기
학습일 : 2024.08.30

📕 학습 목록

SNS 광고 타겟 클러스터링 모델 개발

📗 프로젝트 작업 내역

1) 프로젝트 제목

SNS 광고 타겟 설정을 위한 고객 클러스터링 모델 개발

2) 프로젝트 목표

구매율이 높은 고객층을 식별하고, 이를 대상으로 효과적인 SNS 타겟 마케팅 전략을 구축하여 마케팅 효율성을 극대화하고 고객 만족도를 높임

3) 사용한 데이터 셋

데이터: SNS 사용자 정보 및 제품 구매 여부 데이터
- 고객의 나이와 연소득을 기반으로 구매 패턴을 분석하여, 다양한 고객 군집을 클러스터링하고 타겟 고객을 식별

4) 워크플로우

① 패키지 임포트

사용한 주요 패키지: pandas, seaborn, numpy, matplotlib, scikit-learn

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

② 데이터 로드 및 요약

df = pd.read_csv("sns_data.csv")
print(df.info())
print(df.describe())

③ 데이터 전처리

결측값 처리 및 이상치 제거

# 결측치 처리 및 이상치 제거
df.dropna(inplace=True)
df = df[df['EstimatedSalary'] < df['EstimatedSalary'].quantile(0.95)]

# 'Gender' 인코딩
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})

④ 스케일링 및 변수 변환

StandardScaler로 스케일링

scaler = StandardScaler()
df[['Age', 'EstimatedSalary']] = scaler.fit_transform(df[['Age', 'EstimatedSalary']])

⑤ 탐색적 데이터 분석(EDA)

# Age와 EstimatedSalary의 분포 확인
plt.figure(figsize=(10, 5))
sns.histplot(df['Age'], kde=True)
sns.histplot(df['EstimatedSalary'], kde=True)
plt.show()

# t-SNE로 잠재 클러스터 구조 확인
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
tsne_results = tsne.fit_transform(df[['Age', 'EstimatedSalary']])
plt.scatter(tsne_results[:, 0], tsne_results[:, 1], c=df['Gender'])
plt.title("t-SNE Visualization")
plt.show()

⑥ 모델 학습

KMeans / Hierarchical Clustering

# KMeans 클러스터링
kmeans = KMeans(n_clusters=4, random_state=42)
df['Cluster_KMeans'] = kmeans.fit_predict(df[['Age', 'EstimatedSalary']])

# Hierarchical Clustering
hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
df['Cluster_Hierarchical'] = hc.fit_predict(df[['Age', 'EstimatedSalary']])

⑦ 성능 평가 및 시각화

# Elbow Method
inertia = []
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(df[['Age', 'EstimatedSalary']])
    inertia.append(kmeans.inertia_)
plt.plot(range(1, 10), inertia, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")
plt.show()

# Silhouette Score
score_kmeans = silhouette_score(df[['Age', 'EstimatedSalary']], df['Cluster_KMeans'])
score_hc = silhouette_score(df[['Age', 'EstimatedSalary']], df['Cluster_Hierarchical'])
print(f"KMeans Silhouette Score: {score_kmeans}")
print(f"Hierarchical Clustering Silhouette Score: {score_hc}")

5) 프로젝트 결과

구현 기능
- K-Means와 Hierarchical Clustering을 통한 고객 클러스터링
- 최적의 클러스터(k=4) 선
- 실루엣 점수와 구매율을 바탕으로 각 클러스터 성능 평가

6) 트러블 슈팅

오류: "ValueError: array must not contain infs or NaNs"
- 데이터의 연소득에 이상치가 포함되어 발생한 오류
해결 방법: 연소득 변수에 로그 변환 후 표준화를 통해 이상치 영향 제거

7) 프로젝트를 통해 얻은 역량

클러스터링 모델을 통해 고객을 분류하고 타겟 고객을 식별
실루엣 점수와 구매율 기반의 성능 평가
EDA 기법 활용

📙 내일 일정

머신러닝 실습

'TIL _Today I Learned > 2024.08' 카테고리의 다른 글

[DAY 34] Machine Learning 심화 (0)	2024.08.29
[DAY 33] xAI (0)	2024.08.28
[DAY 32] 머신러닝 모델의 검증 (0)	2024.08.27
[DAY 31] Machine Learning 실습 (0)	2024.08.26
[DAY 30] 추천 시스템, 최적화 (1)	2024.08.23