Group_Work_DeskB.qmd

```{python}
# A lot of packages are not used in this qmd file 
# but it did be used in some other ipynb files which provide solid support as a basic point, 
# including cleaning/preprocessing part, Word2Vec part, SVM part, etc.
import os
import spacy
import pandas as pd
import numpy as np
import geopandas as gpd
import re
import math
import string
import unicodedata
import gensim
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.gridspec import GridSpec
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import matplotlib.patheffects as PathEffects
import nltk
import seaborn as sns
import ast  
import umap
import zipfile
import requests
from PIL import Image
import contextily as ctx
import urllib.request
from PIL import ImageDraw
from scipy.spatial import cKDTree
from scipy.spatial.distance import cdist
from scipy.ndimage import convolve
from shapely.geometry import Point
from sklearn.preprocessing import OneHotEncoder  
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import dendrogram, linkage
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk import ngrams, FreqDist
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim.matutils import Sparse2Corpus
from gensim.matutils import corpus2dense
from gensim.models import tfidfmodel
from gensim.models import Word2Vec
from gensim.models import TfidfModel
from gensim.models import KeyedVectors
from gensim.models.ldamodel import LdaModel
from graphviz import Digraph
from IPython.display import Image
from joblib import dump
from joblib import load
from bs4 import BeautifulSoup
from wordcloud import WordCloud, STOPWORDS


```

```{python}
# get the current directory
current_dir = os.getcwd()
# Set the Github PERMALINK URL for downloading bio.bib and harvard-cite-them-right.csl
# Automatically download the BibTeX file.
bib_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/reference.bib"
# create local path for saving
local_bib_path = os.path.join(current_dir, "bio.bib")
# download and save .bib
response = requests.get(bib_url)
with open(local_bib_path, 'wb') as file:
    file.write(response.content)
csl_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/harvard-cite-them-right.csl"
# create local path for saving
local_csl_path = os.path.join(current_dir, "harvard-cite-them-right.csl")
# download and save .csl 
response = requests.get(csl_url)
with open(local_csl_path, 'wb') as file:
    file.write(response.content)
```

---
bibliography: bio.bib
csl: harvard-cite-them-right.csl
title: DeskB's Group Project
execute:
  echo: false
jupyter: python3
format:
  html:
    theme:
      - minty
      - css/web.scss
    code-copy: true
    code-link: true
    toc: true
    toc-title: On this page
    toc-depth: 3
    toc_float:
      collapsed: false
      smooth_scroll: true
  pdf:
    include-in-header:
      text: |
        \addtokomafont{disposition}{\rmfamily}
    mainfont: Spectral
    sansfont: Roboto
    monofont: JetBrainsMono-Regular
    papersize: a4
    geometry:
      - top=25mm
      - left=40mm
      - right=30mm
      - bottom=25mm
      - heightrounded
    toc: false
    number-sections: false
    colorlinks: true
    highlight-style: github
jupyter:
  jupytext:
    text_representation:
      extension: .qmd
      format_name: quarto
      format_version: '1.0'
      jupytext_version: 1.15.2
  kernelspec:
    display_name: Python 3 (ipykernel)
    language: python
    name: python3
---

## Declaration of Authorship {.unnumbered .unlisted}

We, \[**DeskB**\], confirm that the work presented in this assessment is our own. Where information has been derived from other sources, we confirm that this has been indicated in the work. Where a Large Language Model such as ChatGPT has been used we confirm that we have made its contribution to the final submission clear.

Date: 19th December 2023

Student Numbers: 20017359 23032922 23081403 23103585 23130397

## Brief Group Reflection

| What Went Well   | What Was Challenging |
|------------------|----------------------|
| data description | plotting             |
| data cleaning    | SVM classifier model |

## Priorities for Feedback

Are there any areas on which you would appreciate more detailed feedback if we're able to offer it?

Frankly, we've encountered lots of confusion towards the topic of this assessment. Especially in the topic selection, among all the predictive topis in the website, we can not propose the very specific quesiton and structure at the beginning. How to build the bridge between NLP recommending system for branding and inform valuable proposal for STL regulation could be the key issue for us.

So, if convinient, we would like to know did we structure the whole report with a solid logical chain. Also, did we successfully propose some constructive and feasible suggestions? And what should be NLP analysis used for proposal looked like in a real company project?

```{=html}
<style type="text/css">
.duedate {
  border: dotted 2px red; 
  background-color: rgb(255, 235, 235);
  height: 50px;
  line-height: 50px;
  margin-left: 40px;
  margin-right: 40px
  margin-top: 10px;
  margin-bottom: 10px;
  color: rgb(150,100,100);
  text-align: center;
}
</style>
```
{{< pagebreak >}}

# Response to Questions

```{python}
# check the "Data" folder
data_dir = os.path.join(current_dir, "Data")
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
# check the "Model" folder
model_dir = os.path.join(current_dir, "Model")
if not os.path.exists(model_dir):
    os.makedirs(model_dir)
# check the "Images" folder
iamges_dir = os.path.join(current_dir, "Images")
if not os.path.exists(iamges_dir):
    os.makedirs(iamges_dir)
```

```{python}
# Download and read the csv file remotely from url
host = 'http://data.insideairbnb.com'
path = 'united-kingdom/england/london/2023-09-06/data'
file = 'listings.csv.gz'
url  = f'{host}/{path}/{file}'
# Save csv file
if os.path.exists(file):
  Airbnb_Listing = pd.read_csv(file, compression='gzip', low_memory=False)
else: 
  Airbnb_Listing = pd.read_csv(url, compression='gzip', low_memory=False)
  Airbnb_Listing.to_csv(os.path.join("Data","listing.csv"))
```

```{python}
# Download and read the gpkg file remotely from url
host = 'https://data.london.gov.uk'
path = 'download/london_boroughs/9502cdec-5df0-46e3-8aa1-2b5c5233a31f'
file = 'London_Boroughs.gpkg'
url  = f'{host}/{path}/{file}'
# Save gkpg file
if os.path.exists(file):
  London_boroughs = gpd.read_file(file, low_memory=False)
else: 
  London_boroughs = gpd.read_file(url, low_memory=False)
  London_boroughs.to_file(os.path.join("Data","London_Boroughs.gpkg"), driver='GPKG')
```

```{python}
data_dir = os.path.join(current_dir, "Data")
zip_url = "https://data.london.gov.uk/download/statistical-gis-boundary-files-london/08d31995-dd27-423c-a987-57fe8e952990/London-wards-2018.zip"
local_zip_path = os.path.join(data_dir, "London-wards-2018.zip")
response = requests.get(zip_url)
with open(local_zip_path, 'wb') as file:
    file.write(response.content)
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)
London_wards = gpd.read_file(os.path.join("Data","London-wards-2018_ESRI","London_Ward.shp"))
```

## 1. Who collected the data?

The dataset was collected by [Murray Cox](https://en.wikipedia.org/wiki/Inside_Airbnb) through automatic scraping from the Airbnb website, specifically for the Inside Airbnb project.

## 2. Why did they collect it?

The [Inside Airbnb](http://insideairbnb.com/about) project aims to provide an independent perspective, helping the public, researchers, and policymakers understand how Airbnb affects urban housing affordability and community dynamics. It offers insights for policy discussions and social understanding of Airbnb's role in urban environments.

## 3. How was the data collected?

[listings.csv](http://data.insideairbnb.com/united-kingdom/england/london/2023-09-06/data/listings.csv.gz) : Inside Airbnb collects its data primarily by scraping information from the Airbnb website. This process involves the following steps:

-   Web Scraping: Inside Airbnb employs scripts to rapidly and extensively extract Airbnb listing data, imitating human browsing.

-   Data Extraction: Information about each listing, such as location, price, availability and host details, is extracted and compiled.

-   Data Aggregation: Aggregated data forms a database for analyzing Airbnb trends and insights across cities and regions.

-   Regular Updates: The scraping process is repeated periodically to keep the database current, capturing new listings and updates to existing ones.

## 4. How does the method of collection impact the completeness and/or accuracy of its representation of the process it seeks to study, and what wider issues does this raise?

The dataset is mostly obtained by scraping information from the Airbnb website, so its breadth and depth of information publicly available on the site may be limited. For instance, detailed information about certain listings might not be fully disclosed, or website terms might restrict access to some data. Moreover, legal and ethical considerations in web scraping, such as data privacy and usage rights, may affect the integrity and accuracy of the data. The content of the website is constantly changing dynamically, but data scraping occurs at intervals, which means the data might not be updated in-real time, potentially leading to information gaps[@prentice_addressing_2023].

## 5. What ethical considerations does the use of this data raise?

### 5.1 Privacy issues

Whether the dataset has the consent of the owner to disclose its information, e.g., house location, name. Geocoded data is privacy-sensitive and highly likely to expose personal privacy when used to study demographic patterns and behaviours[@van_den_bemt_teaching_2018]. Therefore, It is crucial to obtain the consent of the owners to ensure that their privacy is not infringed upon.

### 5.2 Legal compliance

Usage of the dataset should comply with laws and regulations such as GDPR, DPA and EDPS. The EDPS 2015 report states that it is not enough to comply with the law in today's digital environment; We must consider the ethical dimensions of data processing[@hasselbalch_making_2019]. Legal compliance and ethical considerations should be closely combined in the digital age.

### 5.3 Social responsibility

It is critical to use the dataset correctly, as exposing certain data may result in inequity and bias. The Fairness and Openness Report[@walker_consumer_2019] emphasizes how to use information responsibly and ethically, as well as the importance to resist the labelling of low-income communities, race, etc. For example, a significant gap in housing prices between different neighbourhoods may reflect economic differences, which may affect perceptions of the social status of those areas. To avoid unwanted consequences, it is necessary to examine how to disclose the tagged attributes of the data.

### 5.4 Data security

Some sensitive information in the dataset must be stored securely to prevent unauthorized access and misuse. By adjusting the norms of network data use, it is possible to effectively guarantee data security and increase companies' ethical behavior level when processing data[@culnan_how_2009]. Thus, attention to data security can prevent unscrupulous individuals from collecting housing data for profit or monitoring purposes.

## 6. With reference to the data (*i.e.* using numbers, figures, maps, and descriptive statistics), what does an analysis of Hosts and Listing types suggest about the nature of Airbnb lets in London?

### 6.1 Why should we choose the textual information?

Many studies have analyzed various aspects of Airbnb listings, including price[@zhang_key_2017], spatial distribution[@la_location_2021], room type[@voltes-dorta_drivers_2020], etc. However, the "textual description", with more impressive potential than numeric fields, also plays a crucial role in shaping renters' first impressions of the listings, contributing to facilitating successful rental transactions. Therefore, we scrutinize the textual features/characteristics from the data, generalize, classify and summarize insightful conclusion which is correlated with the branding potential value[@ji_analysis_2021].

### 6.2 What can we dig from the textual information?

Datasets consists of two textual fields: 'Description' and 'Amenities' from the host's self-promotion. 'description' column is to describe advantages and characteristics. 'Amenities' is about facilities affiliated with the listing.

After some [cleaning and preprocessing](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Processing_Airbnb_listing_normalising.ipynb), there are two set of questions corresponding to the two columns respectively.

#### 6.2.1 Which topics would host like to focus on when promoting their properties?

We could use the LDA model to generalize and extract topics to get the most frequent keywords in those topics. After calculating iteratively the model, we determine the best topics' number for summarizing 'descriptions' column should be 16. (*Figure1a*)

```{python}
current_dir = os.getcwd()
coherence_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/coherence_values.csv"
# create local path for saving
local_coherence_path = os.path.join(current_dir, "Data","coherence_values.csv")
# download and save .bib
response = requests.get(coherence_url)
with open(local_coherence_path, 'wb') as file:
    file.write(response.content)
# because it might cost several minutes to run the LDA modle
# so we just directly read the model's output remotely
# the detailed coding info could be accessed through project's github
LDAtopicwords_url = "https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/lda_topics_and_words.csv"
# create local path for saving
local_LDAtopicwords_path = os.path.join(current_dir, "Data","lda_topics_and_words.csv")
# download and save .bib
response = requests.get(LDAtopicwords_url)
with open(local_LDAtopicwords_path, 'wb') as file:
    file.write(response.content)
```

```{python}
# read coherence,csv
LDA_topic_coherence_frame = pd.read_csv(os.path.join("Data","coherence_values.csv"))
# read the LDA model output
LDA_topics_and_words_frame = pd.read_csv(os.path.join("Data","lda_topics_and_words.csv"))
```

```{python, fig.cap="Figure1a: best number of topics for summarizing key words", #Figure1a}
# create the line chart
fig, ax1 = plt.subplots(figsize=(12, 6))
ax1.plot(LDA_topic_coherence_frame['Topic_Num'], LDA_topic_coherence_frame['Coherence_Score'], marker='o')
ax1.set_title('Coherence Scores across Different Numbers of Topics')
ax1.set_xlabel('Number of Topics')
ax1.set_ylabel('Coherence Score')
ax1.grid(True)
# add the label for Y value
for x, y in zip(LDA_topic_coherence_frame['Topic_Num'], LDA_topic_coherence_frame['Coherence_Score']):
    ax1.annotate(f'{y:.3f}', (x, y), textcoords="offset points", xytext=(0, 5), ha='center')
# add extra space for label
#plt.subplots_adjust(bottom=0.2)
# labels of the picture on the bottom
#fig.text(0.5, 0.08, 'Figure1a: Best number of topics for summarizing key words', ha='center', va='bottom')
#add (a) in the left top corner
fig.text(0.1, 0.9, '(a)', ha='left', va='top', fontsize=14, color='black', weight='bold')  
plt.savefig(os.path.join("Images","CoherenceScoreOfLDA.png"))

plt.show()
```

```{python, fig.cap="Figure1b: Topics and key words", #Figure1b}
fig, axes = plt.subplots(4, 4, figsize=(24,24))  
axes = axes.flatten()  
# plot wordcloud for each topic
for i, topic in enumerate(LDA_topics_and_words_frame['Topic'].unique()):
    topic_data = LDA_topics_and_words_frame[LDA_topics_and_words_frame['Topic'] == topic]
    word_frequencies = {row['Word']: row['Weight'] for index, row in topic_data.iterrows()}
    wordcloud = WordCloud(width=400, height=400, background_color='white').generate_from_frequencies(word_frequencies)
    axes[i].imshow(wordcloud, interpolation='bilinear')
    axes[i].axis('off')
    axes[i].set_title(f'Topic {topic}', fontsize=15)
plt.tight_layout()
# add extra space for label
plt.subplots_adjust(bottom=0.1)
# labels of the picture on the bottom
# add (b) in the left top corner
fig.text(0, 0.97, '(b)', ha='left', va='top', fontsize=30, color='black', weight='bold')  

fig.text(0.5, 0.05, 'Figure1:(a)Variation of LDA Model Coherence Scores with Topic Quantity.\n(b)Airbnb Listing Topic Analysis: LDA Modeling and Keyword Visualization', ha='center', va='bottom', fontsize=30)
plt.savefig(os.path.join("Images","LDA_topic16_wordcloud.png"))
plt.show()
```

The [LDA process](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/LDA_Modelling_by_TFIDFmatrix.ipynb) will cost about 30 minutes, so wesave and re-read the output remotely from Github. Then, results shows that among 16 topics *Figure1b*, there are some topics mainly describe the location like *topic8* and *topic6*. Also, some contains information about the facilities and some adjectives towards surrounding environments like *topic13* and *topic14*. In short, all of those key words could illustrate the general features about Airbnb listings which is essential to the recommendation algorithms in platform's branding[@mody_airbnb_2018].

#### 6.2.2 Do the listings in the same neighbourhood, or with the same spatial location, share the similar amenities?

Amenities are highly categorizable, like '500Mb-WiFi' and 'highspeed Internet access' basically meaning the same. Thus, we should identify various amenities' similarities just like group synonyms out from dictionaries. We use the [Word2Vec model](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Word2Vec_Modelling_by_SVM.ipynb) to classify voluminous words and phrases, and then apply UMAP[@stalder_self-supervised_2023] for better visualization in *Figure2a*.

```{python}
# import Word2Vec Model remotely
word2vec_url = "https://github.com/BohaoSuCC/Groupwork_DeskB/raw/main/Model/word2vec-d500-w40.model"
# create local path for saving
local_word2vec_path = os.path.join(current_dir, "Model","word2vec.model")
# download and save .bib
response = requests.get(word2vec_url)
with open(local_word2vec_path, 'wb') as file:
    file.write(response.content)
word2vec_model = Word2Vec.load(os.path.join("Model","word2vec.model"))
```

```{python}
# read csv after norm and split
amenities_norm_split = pd.read_csv("https://raw.githubusercontent.com/BohaoSuCC/Groupwork_DeskB/main/Data/amenities_norm_split.csv",low_memory=False)
```

```{python}
NormListing_url = "https://github.com/BohaoSuCC/Groupwork_DeskB/raw/main/Data/Airbnb_listing_norm_min.zip"
local_NormListing_path = os.path.join(data_dir, "Airbnb_listing_norm_min.zip")
response = requests.get(NormListing_url)
with open(local_NormListing_path, 'wb') as file:
    file.write(response.content)
with zipfile.ZipFile(local_NormListing_path, 'r') as zip_ref:
    zip_ref.extractall(data_dir)
Airbnb_Listing = pd.read_csv(os.path.join("Data","Airbnb_listing_norm_min.csv"))
```

```{python}
Airbnb_Listing = pd.read_csv(os.path.join("Data","Airbnb_listing_norm_min.csv"))
texts_word2vec = Airbnb_Listing['amenities_norm']
# convert every word in column 'amenities' into a list

amenities_ast_literal = amenities_norm_split
amenities_ast_literal.drop('Unnamed: 0',axis=1)
list_of_lists = amenities_ast_literal.apply(lambda row: [item for item in row if item is not None], axis=1).tolist()
```

```{python}
# def vectorizing function
def vectorize(text, model):
    # distort the text into words and filter those unrelated words
    words = [word for word in text if word in model.wv.key_to_index]
    # if no words, return 0
    if len(words) == 0:
        return np.zeros(model.vector_size)
    # get the mean of all vectors
    word_vectors = [model.wv[word] for word in words]
    return np.mean(word_vectors, axis=0)
Airbnb_Listing['amenities_vector'] = pd.Series(list_of_lists).apply(lambda x: vectorize(x, word2vec_model))
amenities_vector = Airbnb_Listing['amenities_vector']
```

```{python}
import warnings
warnings.filterwarnings('ignore') 
# decrease the dimension by UMAP 
# convert pd.Series to np.array
amenities_vector_nparray = amenities_vector.to_numpy()
numpy_array = np.array([np.array(x) for x in amenities_vector_nparray])
reducer = umap.UMAP(n_components=2,n_neighbors=10,min_dist=0.9)
embedding = reducer.fit_transform(numpy_array)
```

```{python}
# caculate the centroid 
center = (np.median(embedding, axis=0)+np.mean(embedding, axis=0))*0.5
# get the moving amount
translation = -center
# transform all points 
translated_embedding = embedding + translation
# verify the new centroid 
new_center = translated_embedding.mean(axis=0)
#print(f"New center after translation: {new_center}")
```

```{python, fig.cap="Figure2a: Features clustering after UMAP", #Figure2a}
import warnings
warnings.filterwarnings('ignore') 
mag = np.sqrt(np.power(translated_embedding[:,0],2) + np.power(translated_embedding[:,1],2)).reshape(-1,1)
angle = np.arctan2(translated_embedding[:,1], translated_embedding[:,0])
# normalization to angle and distance 
angle = (angle-np.min(angle)) / (np.max(angle) - np.min(angle))
#standarlizing scaling to [0,1]
mag = (mag-np.mean(mag)/np.std(mag))
#sigmoid scaling
mag = 1 / (1 + np.exp(-mag))
circ_colors = mpl.colors.hsv_to_rgb(np.concatenate((angle.reshape(-1,1),  
                                                    np.ones_like(mag).reshape(-1,1), 
                                                    mag.reshape(-1,1)),
                                                    axis=1))
color_info = np.concatenate((translated_embedding, circ_colors), axis=1)
# create the fig
fig, ax = plt.subplots(figsize=(10, 6))
# sctter plot
ax.scatter(translated_embedding[:, 0], translated_embedding[:, 1], color=circ_colors, s=0.5)
ax.axis('off')

# add (a) in the left top corner
fig.text(0.1, 0.9, '(a)', ha='left', va='top', fontsize=10, color='black', weight='bold')

plt.savefig(os.path.join("Images","Word2Vec_2D_UMAP_Projection.png"), dpi=150)

#the image rendered by quarto would be multi-layers and take eras to reload and appear in PDF file, 
#so I will just re-read the rendered picture locally or remotely
#and definitely they are the same version with no difference

plt.close(fig)

#plt.show()

```

![](Images/Word2Vec_2D_UMAP_Projection.png)

```{python}
# save the color info in Airbnb_Listing
Airbnb_Listing['Word2Vec_UMAP_Xcor'] = color_info[:, 0]  
Airbnb_Listing['Word2Vec_UMAP_Ycor'] = color_info[:, 1]  
Airbnb_Listing['Word2Vec_UMAP_colorR'] = color_info[:, 2]  
Airbnb_Listing['Word2Vec_UMAP_colorG'] = color_info[:, 3]  
Airbnb_Listing['Word2Vec_UMAP_colorB'] = color_info[:, 4]  
Airbnb_Listing = Airbnb_Listing.drop(['amenities_vector'], axis=1)

```

```{python, fig.cap="Figure2b: spatial distribution of Listing's similarities", #Figure2b}
# Transfer pandas dataframe (Airbnb_listing.csv) to geopandas geodataframe
# By using the coordinates ()
# Converting to GeoDataframe
gdf_listing = gpd.GeoDataFrame(Airbnb_Listing, geometry=gpd.points_from_xy(Airbnb_Listing.longitude, Airbnb_Listing.latitude))
# Set the CRS
gdf_listing.set_crs("EPSG:4326", inplace=True)  # (EPSG:4326)
#print("Converting successful")
# Drop NAs of columns ['amenities_norm','longitude','latitude']
gdf_listing = gdf_listing.dropna(subset=['amenities_norm','longitude','latitude'])
#print(f"Now gdf has {gdf_listing.shape[0]:,} rows and {gdf_listing.shape[1]:,} columns.")
gdf_listing = gdf_listing.to_crs(epsg=3857)
London_boroughs = London_boroughs.to_crs(epsg=3857)
London_wards = London_wards.to_crs(epsg=3857)
#print("gdf_listing CRS:", gdf_listing.crs)
#print("London_boroughs CRS:", London_boroughs.crs)
#print("London_wards CRS:", London_wards.crs)
# plot the map
fig, ax = plt.subplots(figsize=(16, 16))
London_boroughs.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.4)
London_wards.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.2)
# extract the coordinates and RGB info from gdf_listing 
x = gdf_listing.geometry.x  
y = gdf_listing.geometry.y  
colors = gdf_listing[['Word2Vec_UMAP_colorR', 'Word2Vec_UMAP_colorG', 'Word2Vec_UMAP_colorB']].values  # RGB info

brightness_factor = 1.5
colors_brightened = np.clip(colors * brightness_factor, 0, 1)  # make sure the value is between [0,1]

ax.scatter(x, y, color=colors_brightened, s=40, alpha=0.1)  

"""
subax = plt.axes([0.1, 0.2, 0.3, 0.4]) # 左下角位置
"""
# add the label for boroughs
for idx, row in London_boroughs.iterrows():
    centroid = row.geometry.centroid
    text = ax.text(centroid.x, centroid.y, row['name'], fontsize=7, color='white',ha='center', va='center', alpha=0.7,
                   path_effects=[PathEffects.withStroke(linewidth=0.5, foreground='black')])
# add the label for wards
for idx, row in London_wards.iterrows():
    centroid = row.geometry.centroid
    text = ax.text(centroid.x, centroid.y, row['NAME'], fontsize=2, color='black',ha='center', va='center', alpha=0.5,
                   path_effects=[PathEffects.withStroke(linewidth=0.2, foreground='white')])
"""

x_min, x_max, y_min, y_max = -25000, 5000, 6695000, 6725000

subax.set_xlim(x_min, x_max)
subax.set_ylim(y_min, y_max)
London_boroughs.boundary.plot(ax=subax, edgecolor='black', linewidth=1, alpha=0.4)
London_wards.boundary.plot(ax=subax, edgecolor='black', linewidth=0.5, alpha=0.2)
subax.scatter(x, y, 
              color=colors_brightened, s=40,
              vmax=0.4, vmin=-0.5, alpha=0.2)
"""
 #OSM map, 
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik, alpha = 0.7)  
plt.subplots_adjust(bottom=0.1)  # set extra space for label
# hide the axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
# add (b) in the left top corner
fig.text(0.1, 0.79, '(b)', ha='left', va='top', fontsize=20, color='black', weight='bold')  # 修改处
fig.text(0.5, 0.15, 'Figure2:(a)Spectrum of Features: A UMAP Clustering of Word Embeddings.\n(b) Geographic Distribution of Residential Similarities in London', ha='center', va='bottom', fontsize=20)
# save as PNG file，150 dpi
fig.savefig(os.path.join("Images","Word2Vec_OSM_geospace.png"), dpi=350)  

#the image rendered by quarto would be multi-layers and really slow to reload and appear in PDF file, 
#so I will just read the rendered picture locally or remotely
#and definitely they are the same version with no difference

plt.close(fig)

#plt.show()

```

![](Images/Word2Vec_OSM_geospace.png)

In the *Figure2b*, each kind of colours represents the amenities feature of a property, and areas with similar colors indicate highly similar amenities features between properties. This allows us to determine whether the properties in a specific area or community exhibit homogeneity (highly similar colors) or heterogeneity (more varied colors) in listing features.

### 6.3 Which indicator guide the branding?

Even though Airbnb, as a responsible company, should take community and regulation into consideration, the essence of branding and recommendation system is still aiming for profit. Therefore comes the question: what indicator could represent the potential economic opportunities for listing's branding or promotion?

```{python}
# compare the average one
average_income_forlisting = Airbnb_Listing['sum_income'].mean()
#print((Airbnb_Listing['price'] >= 2000).sum()) 
#print(f"Data frame is {Airbnb_Listing.shape[0]:,} x {Airbnb_Listing.shape[1]:,}")
# remain column 'price' less than 2000
Airbnb_Listing = Airbnb_Listing[Airbnb_Listing['price'] < 2000]
# check dataframe's shape
#print(f"Data frame is {Airbnb_Listing.shape[0]:,} x {Airbnb_Listing.shape[1]:,}")
Airbnb_Listing['profitable'] = (Airbnb_Listing['sum_income'] >= average_income_forlisting).astype(int)
median_income_forlisting = Airbnb_Listing['sum_income'].median()
# Transfer pandas dataframe (Airbnb_listing.csv) to geopandas geodataframe
# By using the coordinates ()
# Converting to GeoDataframe
gdf_listing = gpd.GeoDataFrame(Airbnb_Listing, geometry=gpd.points_from_xy(Airbnb_Listing.longitude, Airbnb_Listing.latitude))
# Set the CRS
gdf_listing.set_crs("EPSG:4326", inplace=True)  # (EPSG:4326)
#print("Converting successful")
# Drop NAs of columns ['description','amenities']
gdf_listing = gdf_listing.dropna(subset=['amenities_norm'])
#print(f"Now gdf has {gdf_listing.shape[0]:,} rows and {gdf_listing.shape[1]:,} columns.")
```

```{python}
import warnings
warnings.filterwarnings('ignore') 
gdf_listing = gdf_listing.to_crs(epsg=3857)
London_boroughs = London_boroughs.to_crs(epsg=3857)
London_wards = London_wards.to_crs(epsg=3857)
#print("gdf_listing CRS:", gdf_listing.crs)
#print("London_boroughs CRS:", London_boroughs.crs)
#print("London_boroughs CRS:", London_wards.crs)
"""
# add borough names
gdf_listing_with_borough = gpd.sjoin(gdf_listing, London_boroughs, how='left', op='within')
gdf_listing_with_borough = gdf_listing_with_borough.rename(columns={'name': 'borough_name'})
# add ward names
gdf_listing_with_borough_wards = gpd.sjoin(gdf_listing_with_borough, London_wards, how='left', op='within')
gdf_listing_with_borough_wards = gdf_listing_with_borough_wards.rename(columns={'NAME': 'ward_name'})
"""
gdf_listing['log_sum_income'] = np.log(gdf_listing['sum_income'])
gdf_listing['log_sum_income'].value_counts()
gdf_listing_dropinf = gdf_listing[gdf_listing['log_sum_income'] != -np.inf]
```

```{python, fig.cap="Figure3a: Statistical distribution of Listings' profit-cost ratio", #Figure3a}
import warnings
warnings.filterwarnings('ignore') 
# nromalize the data btween 0 and 1 
min_val = gdf_listing_dropinf['log_sum_income'].min()
max_val = gdf_listing_dropinf['log_sum_income'].max()
gdf_listing_dropinf['log_sum_income_normalized'] = (gdf_listing_dropinf['log_sum_income'] - min_val) / (max_val - min_val)
# modify the range to [-1, 1]
gdf_listing_dropinf['log_sum_income_normalized_scaled'] = gdf_listing_dropinf['log_sum_income_normalized'] * 2 - 1
median_num_income = np.median(gdf_listing_dropinf['log_sum_income_normalized_scaled'],axis=0)
gdf_listing_dropinf['log_sum_income_normalized_scaled'] = gdf_listing_dropinf['log_sum_income_normalized_scaled'] - median_num_income

fig, ax = plt.subplots(figsize=(10, 6))

gdf_listing_dropinf['log_sum_income'].hist(bins=150, ax=ax, alpha=0.5, label='Original Data')

gdf_listing_dropinf['log_sum_income_normalized_scaled'].hist(bins=150, ax=ax, alpha=0.5, label='Normalized & Scaled Data')
ax.legend()
plt.subplots_adjust(bottom=0.15)  # save extra space for label
# add (a) in the left top corner
fig.text(0.02, 0.95, '(a)', ha='left', va='top', fontsize=14, color='black', weight='bold') 
# save to png，150 dpi
fig.savefig(os.path.join("Images","Profit-cost ratio distribution.png"), dpi=150)  
plt.show()
```

```{python, fig.cap="Figure3b: spatial distribution of Listings profit-cost ratio", #Figure3b}
import warnings
warnings.filterwarnings('ignore') 


fig, ax = plt.subplots(figsize=(24,24))
#  Jenks breaks
#breaks = jenkspy.jenks_breaks(gdf_listing_dropinf['log_sum_income_normalized_scaled'],n_classes=15)
# set the breaks manually
breaks = [-1,-0.75,-0.5,-0.4,-0.25,-0.20,-0.10,-0.05,-0.04,-0.03,-0.02,-0.01,-0.005,0,0.005,0.01,0.02,0.03,0.04,0.05,0.10,0.20,0.25,0.4,0.5,0.75,1,2]
# add the label for areas
for idx, row in London_boroughs.iterrows():
    centroid = row.geometry.centroid
    text = ax.text(centroid.x, centroid.y, row['name'], fontsize=7, color='black',ha='center', va='center', alpha=0.5,
                   path_effects=[PathEffects.withStroke(linewidth=0.5, foreground='white')])
# add the label for wards
for idx, row in London_wards.iterrows():
    centroid = row.geometry.centroid
    text = ax.text(centroid.x, centroid.y, row['NAME'], fontsize=2, color='black',ha='center', va='center', alpha=0.5,
                   path_effects=[PathEffects.withStroke(linewidth=0.2, foreground='white')])

# classify the data with breaks
gdf_listing_dropinf['income_category'] = np.digitize(gdf_listing_dropinf['log_sum_income_normalized_scaled'], breaks)
#plot the boundary of wards and boroughs
London_boroughs.boundary.plot(ax=ax, edgecolor='black', linewidth=1, alpha=0.4)
London_wards.boundary.plot(ax=ax, edgecolor='black', linewidth=0.5, alpha=0.2)

# sccater plot
scatter = ax.scatter(gdf_listing_dropinf.geometry.x, gdf_listing_dropinf.geometry.y, 
                     c=gdf_listing_dropinf['log_sum_income_normalized_scaled'], edgecolors=None, s=40, cmap='bwr_r', 
                     vmax=0.4, vmin=-0.5, alpha=0.2)
# add the color bar
cbar = plt.colorbar(scatter, ax=ax, label='Profit-cost Ratio', shrink=0.5, pad=0.02)

cbar.ax.set_aspect(20)  
# hide the axes
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik, alpha=0.7)         #OSM map
plt.subplots_adjust(bottom=0.1)  # save extra space for label
# add (b) in the left top corner   
# reference :https://www.datascience.ch/articles/self-learning-change-urban-housing-street-level
fig.text(0.08, 0.75, '(b)', ha='left', va='top', fontsize=24, color='black', weight='bold')  
fig.text(0.45, 0.2, 'Figure3:(a)Statistical Distribution of Annual Revenue for Listings in London.\n(b)Geographical Distribution of Cost-Benefit Ratio for Listings in London', ha='center', va='bottom', fontsize=24)
# save to png，150 dpi
fig.savefig(os.path.join("Images","Listings_profit_ratio.png"), dpi=350)

#the image rendered by quarto would be multi-layers and really slow to reload and appear in PDF file, 
#so I will just read the rendered picture locally or remotely
#and definitely they are the same version with no difference

plt.close(fig)

#plt.show()

```

![](Images/Listings_profit_ratio.png)

We use several numeric columns to calculate the total income for every listing. Though, technically this is an approximate number with normal distribution *Figure3a*, but it aligns with the data from the [Inside Airbnb](http://insideairbnb.com/london). Afterwards, we compare 'sum_income' with the average in that wards to indicate this listing's 'profit-cost ratio'. Then we standarlize the data and visualize it in the map *Figure3b*. The blue area means potential for more profit and more lease, which should be highlighted and coordinated with *Figure2b* when branding and promoting.

### 6.4 How does the indicator correlate with textual information?

By using the SVM model for better predicting the 'profit-cost ratio' according to the textual information, we gat an [trained model](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Processing_Modeling/Word2Vec_Modelling_by_SVM.ipynb) with accuracy more than **85%**, which could help the Airbnb platform or Government to evaluate the listings before they are promoted and recommended to the potential renters.

### 6.5 Summary

After the analysis, we have the key topics and words for better generalization(*Figure1*), the features spatial distribution for better classification(*Figure2*) and the 'profit-cost ratio' spatial distribution for better investment(*Figure3*), all of which would be utilized to inform the strategies for Airbnb, landlords, communities and governments(*Figure4*).

```{python}
#It is really hard, complicated and less-effective to use graghviz package in python to draw the framework diagram
# So I draw it in XML, and re-load the image remotely from my Girhub
# The Girhub url for this XML is: https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/mid_processing_files/outline.drawio.xml
```

![](https://github.com/BohaoSuCC/Groupwork_DeskB/blob/main/Images/Framework_Diagram.png?raw=true)

## 7. Drawing on your previous answers, and supporting your response with evidence (e.g. figures, maps, and statistical analysis/models), how *could* this data set be used to inform the regulation of Short-Term Lets (STL) in London?

### 7.1 Short Term Lets(STL)

In an effort to preserve the city's current housing supply, the government legalized short-term rentals in London for a maximum of 90 days per calendar year with the introduction of the [2011 Localism Act](https://www.legislation.gov.uk/ukpga/2011/20/contents/enacted) and the [2015 Deregulation Act](https://www.legislation.gov.uk/ukpga/2015/20/contents/enacted). Nevertheless, a number of studies[@jefferson-jones_can_2015] point out that this regulation isn't always adhered to in reality. Most of the [Airbnb listings](https://www.london.gov.uk/programmes-strategies/housing-and-land/housing-and-land-publications/housing-research-note-short-term-and-holiday-letting-london) (77%) did respect the 90-day limit. Out from the listings surpassing the 90-day limit, the [average estimated occupancy](https://commonslibrary.parliament.uk/research-briefings/cbp-8395/) was 145 nights a year. Of these lettings, 6,140 (or 55%) were entire homes and 5,000 (or 45%) were private rooms. Hence, much of the existing research [@shabrina_airbnb_2022] has focused on the role of Airbnb as the most prominent and prevalent online platform for short-term lets in the UK and internationally.

### 7.2 Airbnb Branding

To enhance the Airbnb platform strategically, leveraging text features for branding and recommendation algorithms is crucial. Based on the comparison between *Figure2b* and *Figure3b* and some perspectives from Question5, The following strategies can be implemented:

#### 7.2.1 Positive feedback cycle:

In regions with lower occupancy rates, recommendation algorithms adjustment ensures balanced occupancy rate over different areas. This proactive approach mitigates property vacancy concerns and boosts hosts' profitability, thereby fostering a dynamic equilibrium within London's housing market. Moreover, for listings with high rental profitability, providing additional positive feedback serves to incentivize competitive listings, which means positive feedback cycles, promoting business operations beneficial for both Airbnb and landlords.

#### 7.2.2 Negative Homogeneous listing:

Considering the potential contribution of housing homogeneity to market distortions [@zhou_asymmetric_2015; @Nieuwland_2018], in areas like London Bridge & West Bermondsey, where low income rates and property feature similarity coincide, the platform and housing department should explore the incorporation of text-based features. By leveraging these features, authorities can identify and filter out homogeneous listings in concentrated areas. This strategic approach could assist platform in rationally branding homogenous properties in time series and making arrangement according to various peak demand period, as well as promoting a more balanced housing landscape.

#### 7.2.3 Airbnb's trade-off:

In pursuing the core interests of its business, Airbnb undoubtedly seeks to foster a positive cycle by promoting competitive listings to renters [@Hoffman2020]. However, this could inadvertently contribute to homogeneity, counteracting the intended positive cycle [@H_bscher_2022]. Hence, personalized guidance to hosts in competitive areas could modify their amenities/descriptions to enhance their appeal to renters. As discussed in the Question 5, once Airbnb got valuable textual information, social responsibility they should take to establish a framework for communication and collaboration with hosts and provide insights towards market trends. Overall, the trade-off between promoting competitiveness and maintaining area diversity should be approached with flexibility by implementing a dynamic system that takes into account local preferences, seasonal variations, and emerging trends.

### 7.3 Government Regulatory Options

Furukawa & Onuki's tri-categorical definition [@furukawa_design_2019] indicates that effective policies should be less restrictive for Primary Hosted & Unhosted Short-term lets within appropriate timeframes, while regulating Nonprimary short-term lets more firmly to provide the right incentives to landlords to rent long-term.

#### 7.3.1 Tailored Policies Based on Spatial Distribution Features

Tailoring policies for diverse community types is essential. In high-density areas, consider limiting the addition of new listings to prevent overcrowding. In contrast, for areas with lower occupancy rates, policies can encourage landlords to adopt more proactive occupancy promotion strategies.

#### 7.3.2 Dynamic Policy Adjustments for Supply-Demand Balance

Utilize spatial distribution features to monitor market dynamics and make adjustments based on actual demand. In high-demand areas, policies can be more flexible, encouraging short-term rentals, while in oversupplied regions, stricter policies can reduce vacancy rates. Connect the identified branding opportunities with STL regulations to balance encouraging tourism and preventing negative impacts on housing markets. Regulation should preserve the uniqueness and solve the shortages for areas with distinctive features.

#### 7.3.3 Encouraging Landlord Engagement in Community Development

Airbnb transforms residential communities into tourist spaces and changes the socio-cultural landscape of urban neighborhoods. It specifically propagates the experience of 'living like a local'[@Ferreri_2018], but this consumption of everyday local residential life has implications for the well-being of long-term tenants, including the disruption and erasure of long-term communities and housing insecurity[@Rozena_2021]. Critical urbanists [@Cocola_Gant_2019; @Freytag_2018] have accordingly linked Airbnb to touristification/gentrification - 'Airbnbification'[@T_rnberg_2022]. Governments can consider incentivizing landlords to participate in community development, aiming to increase the 90-day occupancy rate. This not only reduces long-term property vacancies but also fosters community vitality and helps maintain supply-demand equilibrium.

#### 7.3.4 Create a Registration Service to Bridge Gaps in Data

In a context where the limitation of in research and decision-making outcomes [@Fonda2021], a registration service could provide some of the information necessary to bridge this gap. Utilizing statistical analysis and modelling, regulatory decisions can be evidence-based, considering the unique characteristics of each area. A collaborative effort between cities and Airbnb is suggested for the development of a centralized registration platform. The streamlined online monitoring and fine collection system could significantly enhance planning authorities' ability on balancing housing prices and availability, also improving community well-being.

## Reference