背景和目标
行业内竞争
民航的竞争除了三大航空公司之间的竞争之外,还将加入新崛起的各类小型航空公司、民营航空公司,甚至国外航空巨头。航空产品生产过剩,产品同质化特征愈加明显,于是航空公司从价格、服务间的竞争逐渐转向对客户的竞争。
行业外竞争
随着高铁、动车等铁路运输的兴建,航空公司受到巨大冲击。
目标
借助航空公司客户数据,对客户进行分类。
对不同的客户类别进行特征分析,比较不同类别客户的客户价值。
对不同价值的客户类别提供个性化服务,制定相应的营销策略。
import pandas as pd数据集下载
读取数据
data = pd.read_csv(air_data.csv, encoding=gb18030)data.head()MEMBER_NOFFP_DATEFIRST_FLIGHT_DATEGENDERFFP_TIERWORK_CITYWORK_PROVINCEWORK_COUNTRYAGELOAD_TIME...ADD_Point_SUMEli_Add_Point_SumL1Y_ELi_Add_PointsPoints_SumL1Y_Points_SumRation_L1Y_Flight_CountRation_P1Y_Flight_CountRation_P1Y_BPSRation_L1Y_BPSPoint_NotFlight0549932006/11/22008/12/24男6.北京CN31.02014/3/31...399921144521111006197603702110.5095240.4904760.4872210.512777501280652007/2/192007/8/3男6NaN北京CN42.02014/3/31...1200053288532884157682384100.5142860.4857140.4892890.510708332551062007/2/12007/8/30男6.北京CN40.02014/3/31...1549155202517114063612337980.5185190.4814810.4814670.518530263211892008/8/222008/8/23男5Los AngelesCAUS64.02014/3/31...034890348903722041861000.4347830.5652170.5517220.448275124395462009/4/102009/4/15男6贵阳贵州CN48.02014/3/31...2270464969649693388132103650.5328950.4671050.4690540.530943395 rows × 44 columns
数据探索
查看数据的摘要信息
data.info()<class pandas.core.frame.DataFrame>RangeIndex: 62988 entries, 0 to 62987Data columns (total 44 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MEMBER_NO 62988 non-null int64 1 FFP_DATE 62988 non-null object 2 FIRST_FLIGHT_DATE 62988 non-null object 3 GENDER 62985 non-null object 4 FFP_TIER 62988 non-null int64 5 WORK_CITY 60719 non-null object 6 WORK_PROVINCE 59740 non-null object 7 WORK_COUNTRY 62962 non-null object 8 AGE 62568 non-null float64 9 LOAD_TIME 62988 non-null object 10 FLIGHT_COUNT 62988 non-null int64 11 BP_SUM 62988 non-null int64 12 EP_SUM_YR_1 62988 non-null int64 13 EP_SUM_YR_2 62988 non-null int64 14 SUM_YR_1 62437 non-null float64 15 SUM_YR_2 62850 non-null float64 16 SEG_KM_SUM 62988 non-null int64 17 WEIGHTED_SEG_KM 62988 non-null float64 18 LAST_FLIGHT_DATE 62988 non-null object 19 AVG_FLIGHT_COUNT 62988 non-null float64 20 AVG_BP_SUM 62988 non-null float64 21 BEGIN_TO_FIRST 62988 non-null int64 22 LAST_TO_END 62988 non-null int64 23 AVG_INTERVAL 62988 non-null float64 24 MAX_INTERVAL 62988 non-null int64 25 ADD_POINTS_SUM_YR_1 62988 non-null int64 26 ADD_POINTS_SUM_YR_2 62988 non-null int64 27 EXCHANGE_COUNT 62988 non-null int64 28 avg_discount 62988 non-null float64 29 P1Y_Flight_Count 62988 non-null int64 30 L1Y_Flight_Count 62988 non-null int64 31 P1Y_BP_SUM 62988 non-null int64 32 L1Y_BP_SUM 62988 non-null int64 33 EP_SUM 62988 non-null int64 34 ADD_Point_SUM 62988 non-null int64 35 Eli_Add_Point_Sum 62988 non-null int64 36 L1Y_ELi_Add_Points 62988 non-null int64 37 Points_Sum 62988 non-null int64 38 L1Y_Points_Sum 62988 non-null int64 39 Ration_L1Y_Flight_Count 62988 non-null float64 40 Ration_P1Y_Flight_Count 62988 non-null float64 41 Ration_P1Y_BPS 62988 non-null float64 42 Ration_L1Y_BPS 62988 non-null float64 43 Point_NotFlight 62988 non-null int64 dtypes: float64(12), int64(24), object(8)memory usage: 21.1+ MB有大量的列存在缺失值存在8列object数据查看数据集的重复情况
data.shape(62988, 44)data.drop_duplicates().shape(62988, 44)说明数据集不存在重复值
查看数据集具体的缺失值情况
data.isna().sum()MEMBER_NO 0FFP_DATE 0FIRST_FLIGHT_DATE 0GENDER 3FFP_TIER 0WORK_CITY 2269WORK_PROVINCE 3248WORK_COUNTRY 26AGE 420LOAD_TIME 0FLIGHT_COUNT 0BP_SUM 0EP_SUM_YR_1 0EP_SUM_YR_2 0SUM_YR_1 551SUM_YR_2 138SEG_KM_SUM 0WEIGHTED_SEG_KM 0LAST_FLIGHT_DATE 0AVG_FLIGHT_COUNT 0AVG_BP_SUM 0BEGIN_TO_FIRST 0LAST_TO_END 0AVG_INTERVAL 0MAX_INTERVAL 0ADD_POINTS_SUM_YR_1 0ADD_POINTS_SUM_YR_2 0EXCHANGE_COUNT 0avg_discount 0P1Y_Flight_Count 0L1Y_Flight_Count 0P1Y_BP_SUM 0L1Y_BP_SUM 0EP_SUM 0ADD_Point_SUM 0Eli_Add_Point_Sum 0L1Y_ELi_Add_Points 0Points_Sum 0L1Y_Points_Sum 0Ration_L1Y_Flight_Count 0Ration_P1Y_Flight_Count 0Ration_P1Y_BPS 0Ration_L1Y_BPS 0Point_NotFlight 0dtype: int64数据的预处理
结合业务逻辑进行数据预处理
删除票价为空的数据
data = data[~data[SUM_YR_1].isna()]data = data[~data[SUM_YR_2].isna()]丢弃票价为0,平均折扣率不为0,总飞行公里数大于0的记录
index_yr_1 = data[SUM_YR_1] == 0index_yr_2 = data[SUM_YR_2] == 0index_avg_discount = data[avg_discount] != 0index_seg_km_sum = data[SEG_KM_SUM] > 0data = data[~(index_yr_1 & index_yr_2 & index_avg_discount & index_seg_km_sum)]特征构造
将入会时间和观测窗口结束时间转化为时间类型的数据
data[FFP_DATE] = pd.to_datetime(data[FFP_DATE])data[LOAD_TIME] = pd.to_datetime(data[LOAD_TIME])构建LRFMC特征
data[L] = (data[LOAD_TIME] - data[FFP_DATE]).apply(lambda x: x.days/30)data[R] = data[LAST_TO_END]data[F] = data[FLIGHT_COUNT]data[M] = data[SEG_KM_SUM]data[C] = data[avg_discount]new_data = data[[L, R, F, M, C]]# 归一化处理from sklearn.preprocessing import StandardScalerstandard_scaler = StandardScaler()new_data = standard_scaler.fit_transform(new_data)对筛选出来的特征进行聚类分析
from sklearn.cluster import KMeansK = 5kmeans_model = KMeans(n_clusters=K)kmeans_model.fit(new_data)KMeans(n_clusters=5)# 聚类中心kmeans_model.cluster_centers_ array([[ 0.06233588, -0.00491015, -0.22698558, -0.23067735, 2.18473313], [-0.31475997, 1.68627777, -0.57395798, -0.53669472, -0.17389533], [ 0.48278764, -0.7994434 , 2.48377146, 2.42496111, 0.30842462], [ 1.16095498, -0.37732257, -0.0867671 , -0.09463321, -0.159118 ], [-0.70027213, -0.41492263, -0.16102864, -0.1609829 , -0.2536324 ]])# 聚类类别sum(kmeans_model.labels_ == 4)24665可视化
# 绘制雷达图import matplotlib.pyplot as pltplt.rcParams[font.sans-serif] = [SimHei]plt.rcParams[axes.unicode_minus] = Falseax = plt.subplot(111, polar=True)plt.show()
