GGR274 Homework 3

GGR274 Homework 3#

Logistics#

Due date: The homework is due 23:59 on Monday, January 30.

You will submit your work on MarkUs. To submit your work:

Download this file (Homework_3.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)
Submit this file to MarkUs under the hw3 assignment. (See our MarkUs Guide for detailed instructions.) All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

Task 1: Reading in a data set#

In this lab, you’ll work with movie reviews from IMDB, part of a larger data set which was curated in 2011 by AI researchers at Stanford University (source).

Since we covered in lecture how to read data from files, you’ll start by doing that here. In the code cell below, write code to open the movie review dataset file, called reviews.txt, and then use the readlines file method to read the lines of the file. Store the list of lines in a variable called reviews.

Note: remember to pass the encoding="utf-8" argument to open() to ensure that all characters are read without error.

# Write your code here


# The following code is provided to help you check your work
print(f"The first movie review:\n\n{reviews[0]}")

Task 2: Basic statistics#

As you can probably guess, this is likely a fairly large dataset! Let’s use Python to help make sense of it.

2a. Number of reviews#

The dataset stores each review on a single line. Using this fact and how you read in the data above (with readlines), compute the number of reviews in the dataset and store the result in a variable called num_reviews.

# Write your code here


# The following code is provided to help you check your work
print(f"There are {num_reviews} reviews in the dataset.")

2b. Total and average review length (counting characters)#

Now let’s start computing on the individual reviews.

First, use the approach from lecture to create a new list called review_lengths that contains the length, in terms of number of characters, of each review in the data set. (Hint: start with review_lengths = [] and then use a for loop.)

Then using review_lengths, define the following variables with the described values:

total_length: the total number of characters across all of the reviews
max_length: the length of the longest review
min_length the length of the shortest review
average_length: the average number of characters per review

Hint: use the functions len/sum/max/min to compute these values.

# Write your code here



# The following code is provided to help you check your work
print(f"The length of the first review is {review_lengths[0]}.")
print(f"The total review length is {total_length}.")
print(f"The maximum review length is {max_length}.")
print(f"The minimum review length is {min_length}.")
print(f"The average review length is {average_length}.")

2c. Total and average review lengths (words)#

Now you’ll repeat part 2b, except counting words instead of characters.

Specifically, define a new list variable called review_word_lengths that contains the length, in terms of number of words, of each review in the data set. Use the str.split method to split the reviews by spaces; don’t worry about punctuation marks appearing in the words, that’s expected.

Then using review_word_lengths, define the following variables with the described values:

total_word_length: the total number of words across all of the reviews
max_word_length: the length of the longest review (in words)
min_word_length the length of the shortest review (in words)
average_word_length: the average number of words per review

# Write your code here


# The following code is provided to help you check your work
print(f"The length (in words) of the first review is {review_word_lengths[0]}.")
print(f"The total review length (in words) is {total_word_length}.")
print(f"The maximum review length (in words) is {max_word_length}.")
print(f"The minimum review length (in words) is {min_word_length}.")
print(f"The average review length (in words) is {average_word_length}.")

Task 3: Identifying positive and negative reviews#

The original movie review dataset was originally curated for performing sentiment analysis, which is “natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.” (Wikipedia)

In this task, you’ll perform a very rudimentary sentiment analysis on the movie reviews in our dataset. After writing the code and looking at the results, we’ll also ask you to reflect on the limitations of our current process and how we might improve it.

3a. Categorizing reviews#

Create three new empty lists: positive_reviews, negative_reviews, and ambiguous_reviews. Then loop through the reviews and for each review:

If the review contains the string "best" but not the string "worst", append it to positive_reviews.
If the review contains the string "worst" but not the string "best", append it to negative_reviews.
If the review contains both the strings "best" and "worst", append it to ambiguous_reviews.
If the review doesn’t contain either of the strings, don’t append it to any list.

Use ____ in ____ and ____ not in ____ to check whether one string is contained inside another string (or not). These comparisons are case-sensitive, so for example a review containing "BEST" but not "best" won’t be categorized as positive.

It is possible to complete this task using one for loop or multiple for loops (do whatever makes the most sense to you).

# Write your code here


# The following code is provided to help you check your work
print(f"Here is the first positive review:\n\n{positive_reviews[0]}\n")
print(f"Here is the first negative review:\n\n{negative_reviews[0]}\n")
print(f"Here is the first ambiguous review:\n\n{ambiguous_reviews[0]}\n")

3b. Analysis#

Read through some of the reviews in each category list (you can modify the print calls we provided to display more of them). Decide whether you agree with the categorization of these reviews.

Then, create a new Markdown cell below this one, and write one or two paragraphs responding to the following prompt:

Describe two different limitations of our very basic review categorization. What problems are caused by these limitations, and for what kinds of reviews might these problems occur? Using just what you’ve learned in the course so far, are there ways we could address these limitations, at least partially?