GGR274 Homework 3#

Logistics#

Due date: The homework is due 23:59 on Monday, January 30.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_3.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw3 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Task 1: Reading in a data set#

In this lab, you’ll work with movie reviews from IMDB, part of a larger data set which was curated in 2011 by AI researchers at Stanford University (source).

Since we covered in lecture how to read data from files, you’ll start by doing that here. In the code cell below, write code to open the movie review dataset file, called reviews.txt, and then use the readlines file method to read the lines of the file. Store the list of lines in a variable called reviews.

Note: remember to pass the encoding="utf-8" argument to open() to ensure that all characters are read without error.

# Write your code here

review_file = open('reviews.txt', encoding="utf-8")
reviews = review_file.readlines()

# The following code is provided to help you check your work
print(f"The first movie review:\n\n{reviews[0]}")
The first movie review:

I just watched it last night and it was great.I can see why some ppl have ill feelings towards it from a rugby fan and maori culture point of view but other than that I have no idea what's so negative about it. The movie is great. It has a lot of heart. Very inspiring and encouraging to all ages. Great family movie! They did a pretty good job considering that it was a budget movie. I love movies based on true stories/events. I was raised around rugby all my life, it is a great game but I was never really taken to it because (please forgive me if I offend anyone, nothing personal this is just how I saw it) I thought, their trainings are not as ruthless or hard, the players are not as disciplined and don't seemed as serious like other sportmen and it looked like it's all just muscle and blooming tackling each other etc. But after watching Forever strong, I was like, wow! I was proud! It did good things for rugby (well it changed my view of rugby) and also the New Zealnd Haka. I actually cried. I am not even New Zealander and I was proud of their culture. Didn't even know what the chant meant until this movie. The movie is NOT about rugby techniques or rugby, it's not even about New Zealand All Blacks or the Haka or etc......Mother of pearls!!!!! hahaha SHUX!<br /><br />So to all you beautiful negative ppl, You are missing the point! I am sure if they had the means, it would have been better, the haka is in there because that was part of Highland Rugby culture, tradition or what ever you want to call it. <br /><br />So any new members on this site such as myself, please don't be put off by those negative comments. See it for yourself! Must see movie! There is a lot you can learn from this movie, ppl of all ages. It definitely makes you want to be a better person and be humble! This movie reminded me of a lot of things that I already know and was raised with but I kinda lost along the way! Loved it! Happy reading ppls and All the best!<br /><br />Muawha!

Task 2: Basic statistics#

As you can probably guess, this is likely a fairly large dataset! Let’s use Python to help make sense of it.

2a. Number of reviews#

The dataset stores each review on a single line. Using this fact and how you read in the data above (with readlines), compute the number of reviews in the dataset and store the result in a variable called num_reviews.

# Write your code here
num_reviews = len(reviews)

# The following code is provided to help you check your work
print(f"There are {num_reviews} reviews in the dataset.")
There are 2000 reviews in the dataset.

2b. Total and average review length (counting characters)#

Now let’s start computing on the individual reviews.

First, use the approach from lecture to create a new list called review_lengths that contains the length, in terms of number of characters, of each review in the data set. (Hint: start with review_lengths = [] and then use a for loop.)

Then using review_lengths, define the following variables with the described values:

  • total_length: the total number of characters across all of the reviews

  • max_length: the length of the longest review

  • min_length the length of the shortest review

  • average_length: the average number of characters per review

Hint: use the functions len/sum/max/min to compute these values.

# Write your code here
review_lengths = []

for review in reviews:
    review_lengths.append(len(review))

total_length = sum(review_lengths)
max_length = max(review_lengths)
min_length = min(review_lengths)
average_length = total_length / len(review_lengths)


# The following code is provided to help you check your work
print(f"The length of the first review is {review_lengths[0]}.")
print(f"The total review length is {total_length}.")
print(f"The maximum review length is {max_length}.")
print(f"The minimum review length is {min_length}.")
print(f"The average review length is {average_length}.")
The length of the first review is 1985.
The total review length is 2657205.
The maximum review length is 7127.
The minimum review length is 91.
The average review length is 1328.6025.

2c. Total and average review lengths (words)#

Now you’ll repeat part 2b, except counting words instead of characters.

Specifically, define a new list variable called review_word_lengths that contains the length, in terms of number of words, of each review in the data set. Use the str.split method to split the reviews by spaces; don’t worry about punctuation marks appearing in the words, that’s expected.

Then using review_word_lengths, define the following variables with the described values:

  • total_word_length: the total number of words across all of the reviews

  • max_word_length: the length of the longest review (in words)

  • min_word_length the length of the shortest review (in words)

  • average_word_length: the average number of words per review

# Write your code here
review_word_lengths = []

for review in reviews:
    review_word_lengths.append(len(review.split()))

total_word_length = sum(review_word_lengths)
max_word_length = max(review_word_lengths)
min_word_length = min(review_word_lengths)
average_word_length = total_word_length / len(review_word_lengths)


# The following code is provided to help you check your work
print(f"The length (in words) of the first review is {review_word_lengths[0]}.")
print(f"The total review length (in words) is {total_word_length}.")
print(f"The maximum review length (in words) is {max_word_length}.")
print(f"The minimum review length (in words) is {min_word_length}.")
print(f"The average review length (in words) is {average_word_length}.")
The length (in words) of the first review is 380.
The total review length (in words) is 468849.
The maximum review length (in words) is 1167.
The minimum review length (in words) is 16.
The average review length (in words) is 234.4245.

Task 3: Identifying positive and negative reviews#

The original movie review dataset was originally curated for performing sentiment analysis, which is “natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.” (Wikipedia)

In this task, you’ll perform a very rudimentary sentiment analysis on the movie reviews in our dataset. After writing the code and looking at the results, we’ll also ask you to reflect on the limitations of our current process and how we might improve it.

3a. Categorizing reviews#

Create three new empty lists: positive_reviews, negative_reviews, and ambiguous_reviews. Then loop through the reviews and for each review:

  • If the review contains the string "best" but not the string "worst", append it to positive_reviews.

  • If the review contains the string "worst" but not the string "best", append it to negative_reviews.

  • If the review contains both the strings "best" and "worst", append it to ambiguous_reviews.

  • If the review doesn’t contain either of the strings, don’t append it to any list.

Use ____ in ____ and ____ not in ____ to check whether one string is contained inside another string (or not). These comparisons are case-sensitive, so for example a review containing "BEST" but not "best" won’t be categorized as positive.

It is possible to complete this task using one for loop or multiple for loops (do whatever makes the most sense to you).

# Write your code here
positive_reviews = []
negative_reviews = []
ambiguous_reviews = []

for review in reviews:
    if "best" in review and "worst" not in review:
        positive_reviews.append(review)
    elif "best" not in review and "worst" in review:
        negative_reviews.append(review)
    elif "best" in review and "worst" in review:
        ambiguous_reviews.append(review)


# The following code is provided to help you check your work
print(f"Here is the first positive review:\n\n{positive_reviews[0]}\n")
print(f"Here is the first negative review:\n\n{negative_reviews[0]}\n")
print(f"Here is the first ambiguous review:\n\n{ambiguous_reviews[0]}\n")
Here is the first positive review:

I just watched it last night and it was great.I can see why some ppl have ill feelings towards it from a rugby fan and maori culture point of view but other than that I have no idea what's so negative about it. The movie is great. It has a lot of heart. Very inspiring and encouraging to all ages. Great family movie! They did a pretty good job considering that it was a budget movie. I love movies based on true stories/events. I was raised around rugby all my life, it is a great game but I was never really taken to it because (please forgive me if I offend anyone, nothing personal this is just how I saw it) I thought, their trainings are not as ruthless or hard, the players are not as disciplined and don't seemed as serious like other sportmen and it looked like it's all just muscle and blooming tackling each other etc. But after watching Forever strong, I was like, wow! I was proud! It did good things for rugby (well it changed my view of rugby) and also the New Zealnd Haka. I actually cried. I am not even New Zealander and I was proud of their culture. Didn't even know what the chant meant until this movie. The movie is NOT about rugby techniques or rugby, it's not even about New Zealand All Blacks or the Haka or etc......Mother of pearls!!!!! hahaha SHUX!<br /><br />So to all you beautiful negative ppl, You are missing the point! I am sure if they had the means, it would have been better, the haka is in there because that was part of Highland Rugby culture, tradition or what ever you want to call it. <br /><br />So any new members on this site such as myself, please don't be put off by those negative comments. See it for yourself! Must see movie! There is a lot you can learn from this movie, ppl of all ages. It definitely makes you want to be a better person and be humble! This movie reminded me of a lot of things that I already know and was raised with but I kinda lost along the way! Loved it! Happy reading ppls and All the best!<br /><br />Muawha!


Here is the first negative review:

Not having seen the film in its commercial debut, we just caught with it via DVD. Expecting the worst, "Hitch" proved to be a pleasant experience because of the three principals in it. Thanks to Andy Tenant's direction, the film has an easy pace, and while predictable, the comedy has some winning moments.<br /><br />Hitch is a sort of "date coordinator" for losers like Albert, who is not exactly what one would consider a hunk. Yet, Albert is a genuine guy who, without some professional help would go unnoticed by the same women he would like to take out. Enter Hitch, to prepare him to overcome the obstacles that he can't overcome, and even though Albert stays overweight and never gets to master social graces, he conquers us because he is a real, in sharp contrast with all the phonies making the rounds in Manhattan.<br /><br />The basic mistake most production designers make, when preparing locales for Hollywood films, is how out of touch with reality they are. The apartments in which they situate these characters are so rare to find that only by the magic of the movies can these people live in places likes these. Evidently most of the movie people are dealing with fantasy since most city dwellers would kill for spaces so fabulous as the ones they show in the movies, let alone these same people depicted in the film would not be able to afford them.<br /><br />Will Smith is a charismatic actor. He has a disarming way to charm without doing much. The surprise of the movie though, is Kevin James, who as the overweight Albert, not only win us over, but he proves he can hold his own in his scenes with Mr. Smith. Eva Mendez is fine as the main interest of Hitch. In minor roles we see Adam Arkin, Amber Valletta, Michael Rappaport, and Phillip Bosco, among others.<br /><br />"Hitch" is a fun film to watch thanks to the inspired direction by Andy Tenant.


Here is the first ambiguous review:

In my personal opinion - «The Patriot» is one of the best Steven Seagal movies.<br /><br />I've heard people say it's the worst one ever, it's not like SS etc. I disagree. As a highly spiritual person, a great master Seagal established a good tradition in action movies. He always has a good background, great action, high professionalism and a clever message. This movie has it all. You have good shooting scenes, great aikido. Although there isn't a lot of it, it shows us its peaceful side. This change in his film making only proves his spiritual growth (he doesn't kill Chisolm's buddy in the end).<br /><br />«The Patriot» is definitely one of the best films from the «filmmaker's» point of view which I have seen lately. You have great panoramic shots of Montana, we see real American nature and beautiful wildlife(among others - horses and flowers). The soundtrack also deserves a few words. During the film I had a great opportunity to listen to classical American-cowboy-western music(not Country though). Similar music was heard in «Back to the Future Prt.3». SS's acting has greatly improved since his last films. His role is unfamiliar to him(unlike cops & commandos), but he does a good job playing the-retired-doctor-from-the-government. His acting is convincing and his lines are good.<br /><br />I was really pleased with the cast. LQ Jones proves that life & death walk the Earth together, Whitney Yellow Robe plays a beautiful and clever scientist, Camilla Belle makes a great appearance as McClaren's daughter.<br /><br />Mr.Seagal discusses the much debated «Real American» tradition and the militia squads, providing his own point of view(he likes the Constitution just fine, but chubby bearded men have nothing to do with it). Also good points are raised regarding the Eastern-Western Medicine system and nature.<br /><br />Seagal's best. And opening new horizons in his film career.<br /><br />

3b. Analysis#

Read through some of the reviews in each category list (you can modify the print calls we provided to display more of them). Decide whether you agree with the categorization of these reviews.

Then, create a new Markdown cell below this one, and write one or two paragraphs responding to the following prompt:

Describe two different limitations of our very basic review categorization. What problems are caused by these limitations, and for what kinds of reviews might these problems occur? Using just what you’ve learned in the course so far, are there ways we could address these limitations, at least partially?