Airbnb短租

原始数据

需要用到的数据集，如下图。数据集中包含的数据是比较丰富。能从多个维度进行探索。

了解数据

导入数据

数据量最大的要数calendar_detail，里面包含一千万条数据，内容是每个房屋每天情况。其次是listings_detail和listings更多的是用户的评价。

import pandas as pdimport numpy as nppath1=\'/home/jhon/Desktop/DATA/renting/calendar_detail.csv\'path2=\'/home/jhon/Desktop/DATA/renting/listings_detail.csv\'path3=\'/home/jhon/Desktop/DATA/renting/listings.csv\'path4=\'/home/jhon/Desktop/DATA/renting/reviews.csv\'path5=\'/home/jhon/Desktop/DATA/renting/reviews_detail.csv\'calendar=pd.read_csv(path1)listings=pd.read_csv(path3)reviews=pd.read_csv(path5)

基本信息

calendar_detail所包含的字段名。特别注意，价格是用美元计算，而且面前还有“$\”，所以需要对他们进行处理。

calendar.head(2)#运行结果listing_id 	date 	available 	price 	adjusted_price 	minimum_nights 	maximum_nights0 	1165040 	2019-04-17 	f 	$511.00 	$511.00 	1.0 	1125.01 	1165040 	2019-04-18 	t 	$511.00 	$511.00 	1.0 	1125.0

listings的字段

listings.columns#运行结果Index([\'id\', \'name\', \'host_id\', \'host_name\', \'neighbourhood_group\',\'neighbourhood\', \'latitude\', \'longitude\', \'room_type\', \'price\',\'minimum_nights\', \'number_of_reviews\', \'last_review\',\'reviews_per_month\', \'calculated_host_listings_count\',\'availability_365\'],dtype=\'object\')

reviews的字段

reviews.columns#运行结果Index([\'listing_id\', \'id\', \'date\', \'reviewer_id\', \'reviewer_name\', \'comments\'], dtype=\'object\')

这些基本信息很重要。

数据清洗

查看数据是否有空值，最简单粗暴的方法直接删了。这里就删了，整个数据样本有一千万条数据，不会有大的影响。

calendar.isnull().sum()listing_id          0date                0available           0price               0adjusted_price      0minimum_nights    358maximum_nights    358dtype: int64

同时日期的格式也更改了一下。

calendar1=calendar.copy()calendar1.dropna(inplace=True)calendar1[\'date\']=calendar1[\'date\'].astype(\'datetime64\')

这步就需要把美元符去掉，同时要更改格式，修改成数值。在去除美元符后，还有千分符，这也必须处理。
price和adjusted_price是不同的，adjusted_price是优惠价格，但基本上没有差别。简单起见价格就采用price字段。

calendar1[\'price\']=calendar1[\'price\'].str.split(\'$\',expand=True)[1]calendar1[\'adjusted_price\']=calendar1[\'adjusted_price\'].str.split(\'$\',expand=True)[1]calendar1[\'adjusted_price\']=calendar1[\'adjusted_price\'].str.replace(\',\',\'\').astype(\'float\')calendar1[\'price\']=calendar1[\'price\'].str.replace(\',\',\'\').astype(\'float\')

房屋维度

数据整理

探讨方向是很多的，先以房屋为主体进行分析。

calendar_g=calendar1.groupby(by=\'listing_id\').mean()a=pd.merge(calendar_g,listings,left_on=\'listing_id\',right_on=\'id\')a[\'neighbourhood_group\'].isnull().count()house=a[[\'id\',\'room_type\',\'minimum_nights_x\',\'maximum_nights\',\'longitude\',\'latitude\',\'price_x\',\'price_y\',\'reviews_per_month\',\'availability_365\']]house.eval(\'rate=price_x/price_y\',inplace=True)

对不同价位的房屋进行分类。

house[\'level\']=pd.cut(house[\'price_x\'],bins=[0,200,400,600,800,1500,3000,8000,50000],labels=[1,2,3,4,5,6,7,8])house[\'level\'].value_counts()#运行结果2    92883    69701    47074    31995    24896    11177     5638     104Name: level, dtype: int64

数据可视化

地区分布

这里数据的可视化我用Tableau，当然你也可以plotly以及pychart等。如下图

镜头拉近看市区的房屋分布！！！

在分析过程中发现一下规律：

集中的主要区域在市中心附近；
房屋等级越高，分布的越是分散；
即便是北京市区，也是存在很高的聚集度。

对此认为，集中在市中心附近正常，此区域人口密度较大，对房屋的需求也较大。房屋等级越高，分布越分散。原因可能是具有更高消费能力的人，更喜欢在远离城市的地方，放松及修养。

不同类型房屋的数量和价格

这里按照房屋的价格，划分了8个等级，但是房屋的来源类型有三种。

从这里可以看出，等级1、2、3、4占了绝大部分。

利润率

这里的真实价格来源于calendar_detail，而实际来源于listings_detail。真实价格每天是在波动的，可能到节假日涨价，淡季价格降低一点，这都是可能的。listings_detail是平均价格，或者成本价。姑且就按照实际价格算或者成本来算。
从图中很明显的看出，等级越高的房屋，利润率越高。而按照不同房屋类型看，Shared room利润率是最高的。但这具体是什么原因，就需要结合实际的业务进行分析了。

时间维度

以下是从时间维度对数据进行分析。先对数据重新分组，然后聚合。

calendar2=calendar1[[\'date\',\'listing_id\',\'price\']]calendar2[\'level\']=pd.cut(calendar2[\'price\'],bins=[0,200,400,600,800,1500,3000,8000,50000],labels=[1,2,3,4,5,6,7,8])calendar2[\'year\']=calendar1[\'date\'].dt.yearcalendar2[\'month\']=calendar1[\'date\'].dt.monthcalendar2[\'day\']=calendar1[\'date\'].dt.dayb=calendar2.groupby(by=[\'level\',\'year\',\'month\']).agg({\'date\':\'count\',\'price\':\'mean\'})b=b.dropna()b=b.reset_index()

b#运行结果level 	year 	month 	date 	price0 	1 	2019 	4 	75776 	146.7957801 	1 	2019 	5 	163579 	146.4055042 	1 	2019 	6 	159471 	146.1669083 	1 	2019 	7 	163156 	146.1441014 	1 	2019 	8 	163746 	146.820570... 	... 	... 	... 	... 	...99 	8 	2019 	12 	3278 	13714.726052100 	8 	2020 	1 	3158 	13337.024066101 	8 	2020 	2 	2954 	13336.299932102 	8 	2020 	3 	3148 	13337.912961103 	8 	2020 	4 	1513 	13361.962327104 rows × 5 columns

时间维度需要注意，2019年4月和2020年4月之所以下降幅度特别大，是因为这两个月的数据，只取了半个月数据。

总结

本次的数据特征太少，不足以探究，短租房价的影响因素。这里可能只有经纬度和价格，但相关性还比较低，就不继续探索了。

Airbnb短租

Airbnb短租

原始数据

了解数据

导入数据

基本信息

数据清洗

房屋维度

数据整理

数据可视化

地区分布

不同类型房屋的数量和价格

利润率

时间维度

总结

相关推荐

热门文章

热门标签

回顶部