AirBNB: Sydney & Melbourne

- The code in this project is written in Python 3.6.6 :: Anaconda custom (64-bit). The following additional libraries have been used:

- nltk for the Vader Sentiment Analyzer. The Vader lexicon has been downloaded nltk.downloader.download('vader_lexicon')

- wordcloud to generate wordclouds from the text of the reviews

-from nltk.corpus import stopwords # stopwords to detect language

- nltk.downloader.download('vader_lexicon')

-from nltk import word_tokenize

-from nltk.tokenize import RegexpTokenizer

from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

from gensim.corpora.dictionary import Dictionary

from gensim.models.tfidfmodel import TfidfModel

from gensim.models.ldamodel import LdaModel

Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processing the given text, if included, they will just increase the size of bag of words that we will create as last step and decrease the efficiency of algorithm.

Stemming: Take roots of the word

Convert each word into its lower case: For example, it useless to have same words in different cases (eg ‘good’ and ‘GOOD’)

Analysis Constraints

1) Time period between January 2010 to December 2015 was considered in the analysis timeline.

2) Only English words with ASCII characters(33-165) were considered; resulting in approximately 10,000 records being removed from the data set. Decision was taken because of time constraints and lack of familiarity with appropriate libraries to process non-English languages .

3) Records without a comment we're not kept in the data set .

4) Information not available regarding text character limits for input into either descriptions by hosts, or comments by reviewers .

5) Demographic information was not available for either hosts or reviewers to make further analysis regarding authorship.

sentiment_analysis.ipynb, with the code to analyze the text reviews
Review_ASCII_Negative.csv has Dataset has 6098 rows, 19 columns after cleaning it up.
Review_ASCII_Positive.csv has Dataset has 161278 rows, 19 columns after cleaning it up.

Primative Text Analysis

Description Vs Comment Analysis(Primitive Tokenization).ipynb
topic modeling.ipynb

Processing

Data Dimensioning.ipynb
Exploring.ipynb
scratchpad-Copy1.ipynb
FeatureAdditions.ipynb
Polarity Analysis.ipynb
Test.ipynb
dictionary.csv
dictionary.gensim
language_detect.ipynb
lda.model
lda.model.expElogbeta.npy
lda.model.id2word
lda.model.state

What we did- used text analysis to understand specific customer's reviews and to understand the root cause of customer complainants.

After cleaning up the data. we ran a sentiment analyzer over each comment a reviewer. We then separated those scores at .25 with less the .25 being negative reviews and above .25 being positive reviews.

Then we ran a topic model over the two sets separately in the hope to gage some insight on what the overall talking points between the negative and positive reviews. But we didn’t do a good enough job filtering out the stop words. So other languages, numbers, punctuation , and stem words (*awesome', awwwwesome', 'awesomesause') made it through.

If we had more time it would have been better for us to filter all words that weren’t adjectives or nouns in the comments. We already had the context of the words due to polarity. At that point we would be looking at what the cause is.

To go into the topic further a POS tag filter would be necessary. POS tag filter are more about the context of the features than frequencies of features. Topic Modelling tries to map out the recurring patterns of terms into topics. However, every term might not be equally important contextually. For example, POS tag IN contain terms such as – “within”, “upon”, “except”. “CD” contains – “one”,”two”, “hundred” etc. “MD” contains “may”, “must” etc. These terms are the supporting words of a language and can be removed by studying their post tags.

A collocation is a sequence of words that co-occur more often that would be expected by chance. What we want to achieve now is to find collocations that have a high importance in the text and display them as main take aways of our reviews.

The Vader Sentiment Analyzer is due C.J. Hutto and Eric Gilbert fro the paper "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", available here

Melbourne Reviews-IBM Cloud Data Services/Economy & Business

Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

Melbourne Listings-IBM Cloud Data Services/Economy & Business
Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

Melbourne Calendar-IBM Cloud Data Services/Economy & Business
Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

Sydney Reviews-IBM Cloud Data Services/Economy & Business
Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

Sydney Listings-IBM Cloud Data Services/Economy & Business
Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

Sydney Calendar-IBM Cloud Data Services/Economy & Business
Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

AirBNB

Sydney & Melbourne

Data Analytics Bootcamp Project 3

Sydney

Melbourne

Summary

Once we have our typical analytics looking at locations, pricing trends and AirBNB user trends, we take a deeper dive to look into the qualitative aspect of the dataset: consumer reviews.

The Idea?
Is it possible to classify negative and positive reviews based on text?

Text Cleaning & Preprocessing

AirBNB

Sydney & Melbourne

Data Analytics Bootcamp Project 3

Sydney

Melbourne

Summary

Once we have our typical analytics looking at locations, pricing trends and AirBNB user trends, we take a deeper dive to look into the qualitative aspect of the dataset: consumer reviews.

The Idea? Is it possible to classify negative and positive reviews based on text?

Text Cleaning & Preprocessing

Installation

Text Cleaning & Preprocessing

File Descriptions

What could be improved

Licensing, Authors, Acknowledgements

The Idea?
Is it possible to classify negative and positive reviews based on text?