AirBNB

Sydney & Melbourne

Data Analytics Bootcamp Project 3

Raphael Vasquez, Molly Eskelson,

Kabrina Ramnath, Omar Trejo

Summary

Using a dataset with AirBNB reviews from the Australian cities of Sydney and Melbourne, we pull out data for market research and look to a machine learning module for textual analysis of the reviews.

Once we have our typical analytics looking at locations, pricing trends and AirBNB user trends, we take a deeper dive to look into the qualitative aspect of the dataset: consumer reviews.

The Idea?
Is it possible to classify negative and positive reviews based on text?

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. Latent Dirichlet Allocation(LDA) is an example of topic model and is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions. LDA (short for Latent Dirichlet Allocation) is an unsupervised machine-learning model that takes documents as input and finds topics as output. The model also says in what percentage each document talks about each topic. A topic is represented as a weighted list of words. LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion. Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.


The dataset comprises of three main tables:

  • listings - Detailed listings data showing 96 attributes for each of the listings. Some of the attributes used in the analysis are price continuous), longitude (continuous), latitude(continuous), listing_type (categorical), is_superhost (categorical), neighbourhood(categorical), ratings (continuous) among others.
  • reviews - Detailed reviews given by the guests with 6 attributes. Key attributes include date (datetime), listing_id (discrete), reviewer_id(discrete) and comment (textual).
  • calendar - Provides details about booking for the next year by listing. Four attributes in total including listing_id (discrete), date(datetime), available (categorical) and price (continuous).
  • Text Cleaning & Preprocessing

      Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processing the given text, if included, they will just increase the size of bag of words that we will create as last step and decrease the efficiency of algorithm.
      Stemming: Take roots of the word
      Convert each word into its lower case: For example, it useless to have same words in different cases (eg ‘good’ and ‘GOOD’)

      Now we can visualize the distributions of the sentiment indicators for both datasets. We don't expect major differences between Sydney and Melbourne

    - The code in this project is written in Python 3.6.6 :: Anaconda custom (64-bit). The following additional libraries have been used:

    - nltk for the Vader Sentiment Analyzer. The Vader lexicon has been downloaded nltk.downloader.download('vader_lexicon')

    - wordcloud to generate wordclouds from the text of the reviews

    -from nltk.corpus import stopwords # stopwords to detect language

    - nltk.downloader.download('vader_lexicon')

    -from nltk import word_tokenize

    -from nltk.tokenize import RegexpTokenizer

    from nltk.corpus import stopwords

    from nltk.stem import WordNetLemmatizer

    from gensim.corpora.dictionary import Dictionary

    from gensim.models.tfidfmodel import TfidfModel

    from gensim.models.ldamodel import LdaModel

      Remove Punctuations, Numbers: Punctuations, Numbers doesn’t help much in processing the given text, if included, they will just increase the size of bag of words that we will create as last step and decrease the efficiency of algorithm.
      Stemming: Take roots of the word
      Convert each word into its lower case: For example, it useless to have same words in different cases (eg ‘good’ and ‘GOOD’)

      Analysis Constraints
  • 1) Time period between January 2010 to December 2015 was considered in the analysis timeline.
  • 2) Only English words with ASCII characters(33-165) were considered; resulting in approximately 10,000 records being removed from the data set. Decision was taken because of time constraints and lack of familiarity with appropriate libraries to process non-English languages .
  • 3) Records without a comment we're not kept in the data set .
  • 4) Information not available regarding text character limits for input into either descriptions by hosts, or comments by reviewers .
  • 5) Demographic information was not available for either hosts or reviewers to make further analysis regarding authorship.
    • -The Jupyter notebooks included in this project are:
    • sentiment_analysis.ipynb, with the code to analyze the text reviews
    • Review_ASCII_Negative.csv has Dataset has 6098 rows, 19 columns after cleaning it up.
    • Review_ASCII_Positive.csv has Dataset has 161278 rows, 19 columns after cleaning it up.
      • Primative Text Analysis
    • Description Vs Comment Analysis(Primitive Tokenization).ipynb
    • topic modeling.ipynb
      • Processing
    • Data Dimensioning.ipynb
    • Exploring.ipynb
    • scratchpad-Copy1.ipynb
    • FeatureAdditions.ipynb
    • Polarity Analysis.ipynb
    • Test.ipynb
    • dictionary.csv
    • dictionary.gensim
    • language_detect.ipynb
    • lda.model
    • lda.model.expElogbeta.npy
    • lda.model.id2word
    • lda.model.state

    What we did- used text analysis to understand specific customer's reviews and to understand the root cause of customer complainants.


    After cleaning up the data. we ran a sentiment analyzer over each comment a reviewer. We then separated those scores at .25 with less the .25 being negative reviews and above .25 being positive reviews.


    Then we ran a topic model over the two sets separately in the hope to gage some insight on what the overall talking points between the negative and positive reviews. But we didn’t do a good enough job filtering out the stop words. So other languages, numbers, punctuation , and stem words (*awesome', awwwwesome', 'awesomesause') made it through.


    If we had more time it would have been better for us to filter all words that weren’t adjectives or nouns in the comments. We already had the context of the words due to polarity. At that point we would be looking at what the cause is.


  • To go into the topic further a POS tag filter would be necessary. POS tag filter are more about the context of the features than frequencies of features. Topic Modelling tries to map out the recurring patterns of terms into topics. However, every term might not be equally important contextually. For example, POS tag IN contain terms such as – “within”, “upon”, “except”. “CD” contains – “one”,”two”, “hundred” etc. “MD” contains “may”, “must” etc. These terms are the supporting words of a language and can be removed by studying their post tags.
  • A collocation is a sequence of words that co-occur more often that would be expected by chance. What we want to achieve now is to find collocations that have a high importance in the text and display them as main take aways of our reviews.
  • The Vader Sentiment Analyzer is due C.J. Hutto and Eric Gilbert fro the paper "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", available here

  • Melbourne Reviews-IBM Cloud Data Services/Economy & Business

    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

  • Melbourne Listings-IBM Cloud Data Services/Economy & Business
    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

  • Melbourne Calendar-IBM Cloud Data Services/Economy & Business
    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

  • Sydney Reviews-IBM Cloud Data Services/Economy & Business
    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

  • Sydney Listings-IBM Cloud Data Services/Economy & Business
    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0

  • Sydney Calendar-IBM Cloud Data Services/Economy & Business
    Inside Airbnb:http://creativecommons.org/publicdomain/zero/1.0