Digital and Data Accelerator
Author: Danger Crew
Last update: April 22, 2024

Big Data Project – Hong Kong Rental Analysis

As we all know, Hong Kong is one of the city with high housing price. However, when we look into the factors affecting the housing price, there are some unexpected insights that we can find with web scraping and machine learning.

Index

Scroll to top

Author: Henry Kuang, DANGER Bootcamp Alumni

As we all know, Hong Kong is one of the city with high housing price. However, when we look into the factors affecting the housing price, there are some unexpected insights that we can find.

The project workflow will be the following:

  1. Web scraping from Centaline
  2. Data cleansing and form a dataframe
  3. Exploratory data analysis using tableau
  4. Machine learning(ML)
  5. ML model visualization through Tabpy in tableau
  6. Observation
  7. Conclusion

Web scraping from Centaline

Web scraping means grabbing the data repeatedly by using programming language like python from the web instead of just keep copying and pasting manually on the website.

Centaline is one of the website that allow scraping, so I choose it as an data source. In Centaline, I conclude nine features for each renting unit: Big District(there are eighteen different big districts), District Name(there are eighty-nine different little districts),MTR Duration(time takes to go to MTR station),Estate Type(phase,normal,single),floor(high,medium,low),Net Size, Size, Bedroom Count and Building Age

Here is some of my web scraping code:

Data cleansing and form a dataframe

Since what I scrape is an Json-formatted data, some of the data I get is “None” value due to the missing of key. As first, I think the missing of feature- bedroom count means there is no bedroom and the missing of feature-building age means the building is newly building with almost zero year of age. But when i check the website and search, the missing of data is just a problem of website that it did not show the data on the page so I can not scrape it. Finally, for data cleansing, I drop the column of Sale Price, then drop the row with missing data of Rent Price,Building Age,Bedroom Count, Size, Floor and MTR duration.

Exploratory data analysis using tableau

  • The Big District with highest unit rent and the one with the lowest unit rent
  • The District with highest unit rent and the one with the lowest unit rent
  • The Big District ‘s flats with highest average building Age and the one with the lowest average building Age
  • The distribution of unit rent of different estate type
  • The distribution of unit rent of different floor

https://public.tableau.com/app/profile/liangyi.kuang/viz/HousingEDA_17110272004660/Analysis?publish=yes

Machine learning(ML)

After collecting a list of data, I can predict a rent by the integration of these data by performing machine learning. Unlike the general approach, which we may find out a result be setting a series of if-else condition, we use machine learning model to learn the algorithm by giving it a large amount of inputs(the eight features) and outputs(rent). After training a machine learning model, you can predict a output(rent) by fitting inputs(the eight features) into it.For building a machine learning model, I choose Linear Regression Model. It is because there are two benefit of choosing Linear Regression Model: short training time and the ability of extracting the coefficient (weight) of each feature. Since some of the features are categorical but not numerical, One-Hot-Encoder is also needed to turn these categorical data into computer-regconizable columns.

Here are some code for machine learning:

ML model visualization through Tabpy in Tableau

The above has shown that if the flat is at Eastern District, its MTR Duration is 10 minutes, size is 500, net size is 600, the bedroom Count is 5, the estate type is phase, and it is at high floor with building age 66. The estimated rent according to above factors is HK$12363.

Observation

Most of the people may think :“the more the bedroom count is, the more rent the flat is? “ since the building material used is more for dividing a bedroom.

´In fact, the coefficient shows, the more the bedroom count is, the less rent the flat is. $1753.70830145 will be reduced when there is one more bedroom.

The reason behind I think is because usually the flat without bedroom is smaller and cheaper than the flat with bedrooms so the mortgage application is easier, which raise the average price per foot of the flat without bedroom. The increase in average price per foot of open plan unit may be the reason that the reduce in rent when there are more bedrooms.

Conclusion

After finishing these project, I experience the difficulty in web scraping, data cleansing and visualizing the model in tabpy. If there is something to be improve in the model, that would be some feature to be add such as views and developers. However, after concerning the difficulty in data cleansing after adding these features and the amounts of data will be significantly reduced, I give up these feature. Here is the root documents of my project for you guys’s reference:

https://drive.google.com/drive/folders/1FpFlbkN_aox9qYqrrS9VqhLH3RXKn60t?usp=sharing