Fork me on GitHub

Devfest Data Science Track 2019

To run this without any environment setup, go here (this might take a minute or so to load).

Requirements for this track: Basic knowledge of Python.

What do I need to get started?

But before we even get started, we have to set our environment up. This guide was written in Python 3.6. If you haven’t already, download the latest version of Python (https://www.anaconda.com/download) and Pip. Once you have Python and Pip installed.

Once you have your notebook up and running, you can download all the data (yelp.csv) from GitHub. Make sure you have the data in the same directory as your notebook and then we’re good to go!

A Quick Note on Jupyter

For those of you who are unfamiliar with Jupyter notebooks, I’ve provided a brief review of which functions will be particularly useful to move along with this tutorial.

In the image below, you’ll see three buttons labeled 1-3 that will be important for you to get a grasp of – the save button (1), add cell button (2), and run cell button (3).

#ignore this code; only used for image
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "https://camo.githubusercontent.com/722c3d4565c0378b0b2902181281a76dbcf2506d/68747470733a2f2f7777772e7477696c696f2e636f6d2f626c6f672f77702d636f6e74656e742f75706c6f6164732f323031372f30392f717769674b704f7370683332416377524e4247415079506638383565736f346e534f756e677a4845614a35635a6365454836523941774e395a5169315558324b3444574b324e7676515941356e61704f497a2d7063666736597a644371534e475155507639625231706f4a365064336e5572546f5a314450337752485a6869455f446246624c737a2e706e67")

The first button is the button you’ll use to save your work as you go along (1). Feel free to choose when to save your work.

Next, we have the “add cell” button (2). Cells are blocks of code that you can run together. These are the building blocks of jupyter notebook because it provides the option of running code incrementally without having to to run all your code at once. Throughout this tutorial, you’ll see lines of code blocked off – each one should correspond to a cell.

Lastly, there’s the “run cell” button (3). Jupyter Notebook doesn’t automatically run it your code for you; you have to tell it when by clicking this button. As with add button, once you’ve written each block of code in this tutorial onto your cell, you should then run it to see the output (if any). If any output is expected, note that it will also be shown in this tutorial so you know what to expect. Make sure to run your code as you go along because many blocks of code in this tutorial rely on previous cells.

Background

You’ve likely heard the phrase ‘data science’ at some point of your life. Whether that be in the news, in a statistics or computer science course, or during your walk over to ferris for lunch. To demystify the term, let’s first ask ourselves what do we mean by data?

Data is another ambiguous term, but more so because it can encompass so much. Anything that can be collected or transcribed can be data, whether it’s numerical, text, images, sounds, anything!

What is Data Science?

Data Science is where computer science and statistics intersect to solve problems involving sets of data. This can be simple statistical analysis to compute means, medians, standard deviations for a numerical dataset, but it can also mean creating robust algorithms.

In other words, it’s taking techniques developed in the areas of statistics and math and using them to learn from some sort of data source.

Is data science the same as machine learning?

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. While they do have overlap, they are not the same! The ultimate goal of data science is to use data for some sort of insight, and that can often include learning how to do prediction from historical data. But it’s not the full picture. Visualization, data acquisition and storage are just as important as using machine learning to “predict the future.”

Why is Data Science important?

Data Science has so much potential! By using data in creative and innovative ways, we can gain a lot of insight on the world, whether that be in economics, biology, sociology, math - any topic you can think of, data science has its own unique application.

#start of check point 1 
#end of check point 1
#start of check point 2

Data

Our data contains 10,000 reviews, with the following information for each one:

Data Info
business_id ID of the business being reviewed
date Day of the review was posted.
review_id ID for the posted review.
stars 1-5 rating for the business.
text Review text
type Type of text.
user_id User’s ID.
comments Cool, Useful, Funny.

Importing the dataset

Firstly, let’s import the necessary Python libraries. NLTK is pretty much the standard library in Python library for text processing, which has many useful features. Today, we will just use NLTK for stopword removal.

In your terminal: conda install -c anaconda nltk

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.corpus import stopwords
#Should return True if package was successfully downloaded
import nltk 
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /home/alan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!





True

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Import the Yelp Reviews CSV file and store it in a Pandas dataframe called yelp. A Comma Separated Values (CSV) file is a plain text file that contains a list of data. These files are often used for exchanging data between different applications.

yelp = pd.read_csv('yelp.csv')

Let’s get some basic information about the data. The .shape method tells us the number of rows and columns in the dataframe.

yelp.shape
(10000, 10)

We can learn more using .head(), and .describe().

head() - This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.

describe() - Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

yelp.head()
business_id date review_id stars text type user_id cool useful funny
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0

To get an insight on the length of each review, we can create a new column in yelp called text length. This column will store the number of characters in each review.

yelp['text length'] = yelp['text'].str.len()

We can now see the text length column in our dataframe. Here I used Jupyter notebook’s pretty-printing (default print format) of dataframes, because it shows you the beginning and end, which can be useful when you have sorted data.

yelp
business_id date review_id stars text type user_id cool useful funny text length
0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 My wife took me here on my birthday for breakf... review rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 889
1 ZRJwVLyzEJq1VAihDhYiow 2011-07-27 IjZ33sJrzXqU-0X6U8NwyA 5 I have no idea why some people give bad review... review 0a2KyEL0d3Yb1V6aivbIuQ 0 0 0 1345
2 6oRAC4uyJCsJl1X0WZpVSA 2012-06-14 IESLBzqUCLdSzSqm0eCSxQ 4 love the gyro plate. Rice is so good and I als... review 0hT2KtfLiobPvh6cDC8JQg 0 1 0 76
3 _1QQZuf4zZOyFCvXc0o6Vg 2010-05-27 G-WvGaISbqqaMHlNnByodA 5 Rosie, Dakota, and I LOVE Chaparral Dog Park!!... review uZetl9T0NcROGOyFfughhg 1 2 0 419
4 6ozycU1RpktNG2-1BroVtw 2012-01-05 1uJFq2r5QfJG_6ExMRCaGw 5 General Manager Scott Petello is a good egg!!!... review vYmM4KTsC8ZfQBg-j5MWkw 0 0 0 469
5 -yxfBYGB6SEqszmxJxd97A 2007-12-13 m2CKSsepBCoRYWxiRUsxAg 4 Quiessence is, simply put, beautiful. Full wi... review sqYN3lNgvPbPCTRsMFu27g 4 3 1 2094
6 zp713qNhx8d9KCJJnrw1xA 2010-02-12 riFQ3vxNpP4rWLk_CSri2A 5 Drop what you're doing and drive here. After I... review wFweIWhv2fREZV_dYkz_1g 7 7 4 1565
7 hW0Ne_HTHEAgGF1rAdmR-g 2012-07-12 JL7GXJ9u4YMx7Rzs05NfiQ 4 Luckily, I didn't have to travel far to make m... review 1ieuYcKS7zeAv_U15AB13A 0 1 0 274
8 wNUea3IXZWD63bbOQaOH-g 2012-08-17 XtnfnYmnJYi71yIuGsXIUA 4 Definitely come for Happy hour! Prices are ama... review Vh_DlizgGhSqQh4qfZ2h6A 0 0 0 349
9 nMHhuYan8e3cONo3PornJA 2010-08-11 jJAIXA46pU1swYyRCdfXtQ 5 Nobuo shows his unique talents with everything... review sUNkXg8-KFtCMQDV6zRzQg 0 1 0 186
10 AsSCv0q_BWqIe3mX2JqsOQ 2010-06-16 E11jzpKz9Kw5K7fuARWfRw 5 The oldish man who owns the store is as sweet ... review -OMlS6yWkYjVldNhC31wYg 1 3 1 298
11 e9nN4XxjdHj4qtKCOPq_vg 2011-10-21 3rPt0LxF7rgmEUrznoH22w 5 Wonderful Vietnamese sandwich shoppe. Their ba... review C1rHp3dmepNea7XiouwB6Q 1 1 0 321
12 h53YuCiIDfEFSJCQpk8v1g 2010-01-11 cGnKNX3I9rthE0-TH24-qA 5 They have a limited time thing going on right ... review UPtysDF6cUDUxq2KY-6Dcg 1 2 0 433
13 WGNIYMeXPyoWav1APUq7jA 2011-12-23 FvEEw1_OsrYdvwLV5Hrliw 4 Good tattoo shop. Clean space, multiple artist... review Xm8HXE1JHqscXe5BKf0GFQ 1 2 0 593
14 yc5AH9H71xJidA_J2mChLA 2010-05-20 pfUwBKYYmUXeiwrhDluQcw 4 I'm 2 weeks new to Phoenix. I looked up Irish ... review JOG-4G4e8ae3lx_szHtR8g 1 1 0 1206
15 Vb9FPCEL6Ly24PNxLBaAFw 2011-03-20 HvqmdqWcerVWO3Gs6zbrOw 2 Was it worth the 21$ for a salad and small piz... review ylWOj2y7TV2e3yYeWhu2QA 0 2 0 705
16 supigcPNO9IKo6olaTNV-g 2008-10-12 HXP_0Ul-FCmA4f-k9CqvaQ 3 We went here on a Saturday afternoon and this ... review SBbftLzfYYKItOMFwOTIJg 3 4 2 1469
17 O510Re68mOy9dU490JTKCg 2010-05-03 j4SIzrIy0WrmW4yr4--Khg 5 okay this is the best place EVER! i grew up sh... review u1KWcbPMvXFEEYkZZ0Yktg 0 0 0 363
18 b5cEoKR8iQliq-yT2_O0LQ 2009-03-06 v0cTd3PNpYCkTyGKSpOfGA 3 I met a friend for lunch yesterday. \n\nLoved ... review UsULgP4bKA8RMzs8dQzcsA 5 6 4 1161
19 4JzzbSbK9wmlOBJZWYfuCg 2011-11-17 a0lCu-j2Sk_kHQsZi_eNgw 4 They've gotten better and better for me in the... review nDBly08j5URmrHQ2JCbyiw 1 1 1 726
20 8FNO4D3eozpIjj0k3q5Zbg 2008-10-08 MuqugTuR5DdIPcZ2IVP3aQ 3 DVAP....\n\nYou have to go at least once in yo... review C6IOtaaYdLIT5fWd7ZYIuA 2 4 1 565
21 tdcjXyFLMKAsvRhURNOkCg 2011-06-28 LmuKVFh03Uz318VKnUWrxA 5 This place shouldn't even be reviewed - becaus... review YN3ZLOdg8kpnfbVcIhuEZA 1 1 2 104
22 eFA9dqXT5EA_TrMgbo03QQ 2011-07-13 CQYc8hgKxV4enApDkx0IhA 5 first time my friend and I went there... it wa... review 6lg55RIP23VhjYEBXJ8Njw 0 0 0 148
23 IJ0o6b8bJFAbG6MjGfBebQ 2010-09-05 Dx9sfFU6Zn0GYOckijom-g 1 U can go there n check the car out. If u wanna... review zRlQEDYd_HKp0VS3hnAffA 0 1 1 594
24 JhupPnWfNlMJivnWB5druA 2011-05-22 cFtQnKzn2VDpBedy_TxlvA 5 I love this place! I have been coming here for... review 13xj6FSvYO0rZVRv5XZp4w 0 1 0 294
25 wzP2yNpV5p04nh0injjymA 2010-05-26 ChBeixVZerfFkeO0McdlbA 4 This place is great. A nice little ole' fashi... review rLtl8ZkDX5vH5nAx9C3q5Q 0 0 0 1012
26 qjmCVYkwP-HDa35jwYucbQ 2013-01-03 kZ4TzrVX6qeF0OvrVTGVEw 5 I love love LOVE this place. My boss (who is i... review fpItLlgimq0nRltWOkuJJw 0 0 0 921
27 wct7rZKyZqZftzmAU-vhWQ 2008-03-21 B5h25WK28rJjx4KHm4gr7g 4 Not that my review will mean much given the mo... review RRTraCQw77EU4yZh0BBTag 2 4 1 550
28 vz2zQQSjy-NnnKLZzjjoxA 2011-03-30 Y_ERKao0J5WsRiCtlKSNSA 4 Came here for breakfast yesterday, it had been... review EP3cGJvYiuOwumerwADplg 1 1 1 1011
29 i213sY5rhkfCO8cD-FPr1A 2012-07-12 hre97jjSwon4bn1muHKOJg 4 Always reliably good. Great beer selection as... review kpbhy1zPewGDmdNfNqQp-g 0 1 0 225
... ... ... ... ... ... ... ... ... ... ... ...
9970 R6aazv8FB-6BeanY3ag8kw 2009-09-26 gP17ykqduf3AlewSaRb61w 5 This place is super cute lunch joint. I had t... review mtoKqaQjGPWEc5YZbrYV9w 0 0 0 432
9971 JOZqBKIOB8WEBAWm7v1JFA 2008-07-22 QI9rfeWrZnvK5ojz8cEoRg 5 The staff is great, the food is great, even th... review uBAMd01ZtGXaHrRD6THNzg 1 2 1 318
9972 OllL0G9Kh_k1lx-2vrFDXQ 2012-10-23 U23UfuxN9DpAU0Dslc5KjQ 4 Yay, even though I miss living in Coronado I a... review Gh1EXuS42DY3rV_MzFpJpg 0 0 0 411
9973 XHr5mXFgobOHoxbPJxmYdg 2009-09-28 udMiWjeG0OGcb4nNddDkBg 5 Wow! Went on a Sunday around 11am - busy but ... review yRYNx24kUDRRBfJu1Rcojg 0 0 0 353
9974 cdacUBBL2tDbDnB1EfhpQw 2009-12-16 bVU-_x9ijxjEImNluy84OA 2 If Cowboy Ciao is the best restaurant in Scott... review V9Uqt00HXwXT6mzsVCjMAw 0 0 0 473
9975 EWMwV5V9BxNs_U6nNVMeqw 2007-10-20 g4LsVAoafmUDHiS-_yN4tA 5 When I lived in Phoenix, I was a regular at Fe... review TLj3XaclA7V4ldJ5yNP-9Q 1 1 0 1015
9976 iDYzGVIF1TDWdjHNgNjCVw 2009-09-11 bKjMcpNj0xSu2UI2EFQn1g 3 I was looking for chile rellenos and this plac... review 2tUCLMHQKz4kA1VlRB_w0Q 0 0 0 465
9977 iDYzGVIF1TDWdjHNgNjCVw 2012-10-30 qaNZyCUJA6Yp0mvPBCknPQ 5 Why did I wait so long to try this neighborhoo... review Id-8-NMEKxeXBR44eUdDeA 3 6 3 2918
9978 9Y3aQAVITkEJYe5vLZr13w 2010-04-01 ZoTUU6EJ1OBNr7mhqxHBLw 5 This is the place for a fabulos breakfast!! I ... review vasHsAZEgLZGJDTlIweUYQ 0 1 0 493
9979 GV1P1x9eRb4iZHCxj5_IjA 2012-12-07 eVUs1C4yaVJNrc7SGTAheg 5 Highly recommend. This is my second time here ... review bJFdmJJxfXgCYA5DMmyeqQ 2 2 1 244
9980 GHYOl_cnERMOhkCK_mGAlA 2011-07-03 Q-y3jSqccdytKxAyo1J0Xg 5 5 stars for the great $5 happy hour specials. ... review xZvRLPJ1ixhFVomkXSfXAw 6 6 4 393
9981 AX8lx9wHNYT45lyd7pxaYw 2008-11-27 IyunTh7jnG7v3EYwfF3hPw 5 We brought the entire family to Giuseppe's las... review fczQCSmaWF78toLEmb0Zsw 10 9 5 885
9982 KV-yJLmlODfUG1Mkds6kYw 2012-02-25 rIgZgxJPWTacq3mV6DfWfg 4 Best corned beef sandwich I've had anywhere at... review J-oVr0th2Y7ltPPOwy0Z8Q 0 0 0 240
9983 24V8QQWO6VaVggHdxjQQ_A 2010-06-06 PqiIeFOiVr-tj_FtHGAH2g 3 3.5 stars. \n\nWe decided to check this place ... review LaEj3VpQh7bgpAZLzSRRrw 1 4 1 861
9984 wepFVY82q_tuDzG6lQjHWw 2012-02-12 spusZYROtBKw_5tv3gYm4Q 1 Went last night to Whore Foods to get basics t... review W7zmm1uzlyUkEqpSG7PlBw 0 1 2 1673
9985 EMGkbiCMfMTflQux-_JY7Q 2012-10-17 wB-f0xfx7WIyrOsRJMkDOg 4 Awesome food! Little pricey but delicious. Lov... review 9MJAacmjxtctbI3xncsK5Q 0 0 0 68
9986 oCA2OZcd_Jo_ggVmUx3WVw 2012-03-31 ijPZPKKWDqdWOIqYkUsJJw 4 I came here in December and look forward to my... review yzwPJdn6yd2ccZqfy4LhUA 0 0 0 647
9987 r-a-Cn9hxdEnYTtVTB5bMQ 2012-04-07 j9HwZZoBBmJgOlqDSuJcxg 1 The food is delicious. The service: discrimi... review toPtsUtYoRB-5-ThrOy2Fg 0 0 0 200
9988 xY1sPHTA2RGVFlh5tZhs9g 2012-06-02 TM8hdYqs5Zi1jO5Yrq6E0g 4 For our first time we had a great time! Our se... review GvaNZY4poCcd3H4WxHjrLQ 0 2 0 496
9989 mQUC-ATrFuMQSaDQb93Pug 2011-10-01 ta2P9joJqeFB8BzFp-AzjA 5 Great food and service! Country food at its best! review fKaO8fR1IAcfvZb6cBrs2w 0 1 0 49
9990 R8VwdLyvsp9iybNqRvm94g 2011-10-03 pcEeHdAJPoFNF23es0kKWg 5 Yes I do rock the hipster joints. I dig this ... review b92Y3tyWTQQZ5FLifex62Q 1 1 1 263
9991 WJ5mq4EiWYAA4Vif0xDfdg 2011-12-05 EuHX-39FR7tyyG1ElvN1Jw 5 Only 4 stars? \n\n(A few notes: The folks that... review hTau-iNZFwoNsPCaiIUTEA 1 1 0 908
9992 f96lWMIAUhYIYy9gOktivQ 2009-03-10 YF17z7HWlMj6aezZc-pVEw 5 I'm not normally one to jump at reviewing a ch... review W_QXYA7A0IhMrvbckz7eVg 2 3 2 1326
9993 maB4VHseFUY2TmPtAQnB9Q 2011-06-27 SNnyYHI9rw9TTltVX3TF-A 4 Judging by some of the reviews, maybe I went o... review T46gxPbJMWmlLyr7GxQLyQ 1 1 0 426
9994 L3BSpFvxcNf3T_teitgt6A 2012-03-19 0nxb1gIGFgk3WbC5zwhKZg 5 Let's see...what is there NOT to like about Su... review OzOZv-Knlw3oz9K5Kh5S6A 1 2 1 1968
9995 VY_tvNUCCXGXQeSvJl757Q 2012-07-28 Ubyfp2RSDYW0g7Mbr8N3iA 3 First visit...Had lunch here today - used my G... review _eqQoPtQ3e3UxLE4faT6ow 1 2 0 668
9996 EKzMHI1tip8rC1-ZAy64yg 2012-01-18 2XyIOQKbVFb6uXQdJ0RzlQ 4 Should be called house of deliciousness!\n\nI ... review ROru4uk5SaYc3rg8IU7SQw 0 0 0 881
9997 53YGfwmbW73JhFiemNeyzQ 2010-11-16 jyznYkIbpqVmlsZxSDSypA 4 I recently visited Olive and Ivy for business ... review gGbN1aKQHMgfQZkqlsuwzg 0 0 0 1425
9998 9SKdOoDHcFoxK5ZtsgHJoA 2012-12-02 5UKq9WQE1qQbJ0DJbc-B6Q 2 My nephew just moved to Scottsdale recently so... review 0lyVoNazXa20WzUyZPLaQQ 0 0 0 880
9999 pF7uRzygyZsltbmVpjIyvw 2010-10-16 vWSmOhg2ID1MNZHaWapGbA 5 4-5 locations.. all 4.5 star average.. I think... review KSBFytcdjPKZgXKQnYQdkA 0 0 0 461

10000 rows × 11 columns

Exploring the dataset and Creating Visualizations

Let’s visualise the data a little more by plotting some graphs with the Seaborn library. Seaborn’s FacetGrid allows us to create a grid of histograms placed side by side. We can use FacetGrid to see if there’s any relationship between our newly created text length feature and the stars rating.

#sns.set();
#g = sns.distplot(yelp)
#g.map(plt.hist, 'text length', bins=50)

Seems like overall, the distribution of text length is similar across all five ratings. However, the number of text reviews seems to be skewed a lot higher towards the 4-star and 5-star ratings (that is - there are more 4 and 5-star ratings). This may cause some issues later on in the process.

Next, let’s create a box plot of the text length for each star rating. The advantage of a box plot is in comparing distributions across many different groups all at once.

sns.boxplot(x='stars', y='text length', data=yelp)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9459243048>

png

From the plot, looks like the 1-star and 2-star ratings have much longer text, but there are many outliers (which can be seen as points above the boxes). An outlier can distort results, such as dragging the mean in a certain direction, and can lead to faulty conclusions being made. Because of this, maybe text length won’t be such a useful feature to consider after all.

Correlation is used to test relationships between quantitative variables or categorical variables. In other words, it’s a measure of how things are related. Let’s group the data by the star rating, and see if we can find a correlation among the user comments: cool, useful, and funny. We can use the .corr() method from Pandas to find any correlations in the dataframe.

“groupby” operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

stars = yelp.groupby('stars').mean()
stars.corr()
cool useful funny text length
cool 1.000000 -0.743329 -0.944939 -0.857664
useful -0.743329 1.000000 0.894506 0.699881
funny -0.944939 0.894506 1.000000 0.843461
text length -0.857664 0.699881 0.843461 1.000000

To visualise these correlations, we can use Seaborn’s heatmap. A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors.

sns.heatmap(data=stars.corr(), annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f9458901828>

png

Looking at the map, funny is strongly correlated with useful, and useful seems strongly correlated with text length. We can also see a negative correlation between cool and the other three features.

Independent and dependent variables

Our task is to predict if a review is either bad or good, so let’s just grab reviews that are either 1 or 5 stars from the yelp dataframe. We can store the resulting reviews in a new dataframe called yelp_class.

yelp_class = yelp[(yelp['stars'] == 1) | (yelp['stars'] == 5)]
yelp_class.shape
(4086, 11)

We can see from .shape that yelp_class only has 4086 reviews, compared to the 10,000 reviews in the original dataset. This is because we aren’t taking into account the reviews rated 2, 3, and 4 stars.

In machine learning and statistics, classification is a supervised learning approach in which the computer program learns from the data input given to it and then uses this learning to classify new observation. This data set may simply be bi-class (like identifying whether the person is male or female or that the mail is spam or non-spam) or it may be multi-class too.

Next, let’s create the X and y for our classification task. X will be the text column of yelp_class, and y will be the stars column.

X = yelp_class['text']
y = yelp_class['stars']

Text pre-processing

The main issue with our data is that it is all in plain-text format.

X[0]
'My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.\n\nDo yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I\'ve ever had.  I\'m pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.\n\nWhile EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best "toast" I\'ve ever had.\n\nAnyway, I can\'t wait to go back!'

The classification algorithm will need some sort of feature vector in order to perform the classification task. The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word in a text will be represented by one number.

A feature vector is just a vector that contains information describing an object’s important characteristics. In image processing, features can take many forms. For example: a simple feature representation of an image is the raw intensity value of each pixel.

First, let’s write a function that will split a message into its individual words, and return a list. We will also remove the very common words (such as “the”, “a”, “an”, etc.), also known as stopwords. To do this, we can take advantage of the NLTK library. The function below removes punctuation, stopwords, and returns a list of the remaining words, or tokens.

import string

def text_process(text):
   
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)
    
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]

To check if the function works, let’s pass in some random text and see if it gets processed correctly.

sample_text = "Hey there! This is a sample review, which happens to contain punctuations."
text_process(sample_text)
['Hey', 'sample', 'review', 'happens', 'contain', 'punctuations']

Vectorisation

There are several Python libraries which provide solid implementations of a range of machine learning algorithms. One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms.

At the moment, we have our reviews as lists of tokens (a list of words). To enable Scikit-learn algorithms to work on our text, we need to convert each review into a vector.

We can use Scikit-learn’s CountVectorizer to convert the text collection into a matrix of token counts. You can imagine this resulting matrix as a 2-D matrix, where each row is a unique word, and each column is a review.

Let’s import CountVectorizer and fit an instance to our review text (stored in X), passing in our text_process function as the analyser.

from sklearn.feature_extraction.text import CountVectorizer
bow_transformer = CountVectorizer(analyzer=text_process).fit(X)

Since there are many reviews, we can expect a lot of zero counts for the presence of a word in the collection. Because of this, Scikit-learn will output a sparse matrix (a matrix that is comprised of mostly zero values.)

Now, we can look at the size of the vocabulary stored in the vectoriser (based on X) like this:

len(bow_transformer.vocabulary_)
26435

To illustrate how the vectoriser works, let’s try a random review and get its bag-of-word counts as a vector. Here’s the twenty-fifth review as plain-text:

review_25 = X[24]
review_25
"I love this place! I have been coming here for ages.\nMy favorites: Elsa's Chicken sandwich, any of their burgers, dragon chicken wings, china's little chicken sandwich, and the hot pepper chicken sandwich. The atmosphere is always fun and the art they display is very abstract but totally cool!"

Now let’s see our review represented as a vector:

bow_25 = bow_transformer.transform([review_25])
bow_25
<1x26435 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

This means that there are 24 unique words in the review (after removing stopwords). Two of them appear thrice, and the rest appear only once. Let’s go ahead and check which ones appear thrice:

print(bow_transformer.get_feature_names()[11443])
print(bow_transformer.get_feature_names()[22077])
chicken
sandwich

Now that we’ve seen how the vectorisation process works, we can transform our X series into a sparse matrix. To do this, let’s use the .transform() method on our bag-of-words transformed object.

X = bow_transformer.transform(X)

We can check out the shape of our new X.

print('Shape of Sparse Matrix: ', X.shape)
print('Amount of Non-Zero occurrences: ', X.nnz)
# Percentage of non-zero values
density = (100.0 * X.nnz / (X.shape[0] * X.shape[1]))
print("Density: {}".format((density)))
Shape of Sparse Matrix:  (4086, 26435)
Amount of Non-Zero occurrences:  222391
Density: 0.2058920276658241

Training data and test data

The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

As we have finished processing the review text in X, It’s time to split our X and y into a training and a test set using train_test_split from Scikit-learn. We will use 30% of the dataset for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Training our model

What is Naive Bayes algorithm?

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Multinomial (consisting of several terms) Naive Bayes is a specialised version of Naive Bayes designed more for text documents. Let’s build a Multinomial Naive Bayes model and fit it to our training set (X_train and y_train).

from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Testing and evaluating our model

Our model has now been trained! It’s time to see how well it predicts the ratings of previously unseen reviews (reviews from the test set). First, let’s store the predictions as a separate numpy array called preds.

preds = nb.predict(X_test)

Next, let’s evaluate our predictions against the actual ratings (stored in y_test) using confusion_matrix and classification_report from Scikit-learn.

from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_test, preds))
print('\n')
print(classification_report(y_test, preds))
[[157  71]
 [ 24 974]]


              precision    recall  f1-score   support

           1       0.87      0.69      0.77       228
           5       0.93      0.98      0.95       998

   micro avg       0.92      0.92      0.92      1226
   macro avg       0.90      0.83      0.86      1226
weighted avg       0.92      0.92      0.92      1226

Looks like our model has achieved 92% accuracy! This means that our model can predict whether a user liked a local business or not, based on what they typed!

Data Bias

Although our model achieved quite a high accuracy, there are some issues with bias caused by the dataset. Let’s take some singular reviews, and see what rating our model predicts for each one.

Machine bias is the effect of erroneous assumptions in machine learning processes. Bias reflects problems related to the gathering or use of data, where systems draw improper conclusions about data sets.

Predicting a single positive review

positive_review = yelp_class['text'][59]
positive_review
"This restaurant is incredible, and has the best pasta carbonara and the best tiramisu I've had in my life. All the food is wonderful, though. The calamari is not fried. The bread served with dinner comes right out of the oven, and the tomatoes are the freshest I've tasted outside of my mom's own garden. This is great attention to detail.\n\nI can no longer eat at any other Italian restaurant without feeling slighted. This is the first place I want take out-of-town visitors I'm looking to impress.\n\nThe owner, Jon, is helpful, friendly, and really cares about providing a positive dining experience. He's spot on with his wine recommendations, and he organizes wine tasting events which you can find out about by joining the mailing list or Facebook page."

Seems like someone had the time of their life at this place, right? We can expect our model to predict a rating of 5 for this review.

positive_review_transformed = bow_transformer.transform([positive_review])
nb.predict(positive_review_transformed)[0]
5

Our model thinks this review is positive, just as we expected.

Predicting a single negative review

negative_review = yelp_class['text'][281]
negative_review

'Still quite poor both in service and food. maybe I made a mistake and ordered Sichuan Gong Bao ji ding for what seemed like people from canton district. Unfortunately to get the good service U have to speak Mandarin/Cantonese. I do speak a smattering but try not to use it as I never feel confident about the intonation. \n\nThe dish came out with zichini and bell peppers (what!??)  Where is the peanuts the dried fried red peppers and the large pieces of scallion. On pointing this out all I got was " Oh you like peanuts.. ok I will put some on" and she then proceeded to get some peanuts and sprinkle it on the chicken.\n\nWell at that point I was happy that atleast the chicken pieces were present else she would probably end up sprinkling raw chicken pieces on it like the raw peanuts she dumped on top of the food. \n\nWell then  I spoke a few chinese words and the scowl turned into a smile and she then became a bit more friendlier. \n\nUnfortunately I do not condone this type of behavior. It is all in poor taste...'
negative_review_transformed = bow_transformer.transform([negative_review])
nb.predict(negative_review_transformed)[0]
1

Our model is right again!

Where does the model go wrong?

Here’s another negative review. Let’s see if the model predicts this one correctly.

another_negative_review = yelp_class['text'][140]
another_negative_review
"Other than the really great happy hour prices, its hit or miss with this place. More often a miss. :(\n\nThe food is less than average, the drinks NOT strong ( at least they are inexpensive) , but the service is truly hit or miss.\n\nI'll pass."
negative_review_transformed = bow_transformer.transform([another_negative_review])
nb.predict(negative_review_transformed)[0]
5

Why the incorrect prediction?

One explanation as to why this may be the case is that our initial dataset had a much higher number of 5-star reviews than 1-star reviews. This means that the model is more biased towards positive reviews compared to negative ones.

Introduction adapted from: https://github.com/adicu/data-science/blob/master/All%20Levels.ipynb