EEB125 Homework 2: Reading and manipulating some data#

Logistics#

Due date: The homework is due on Tuesday, January 21 at 11:59pm.

You will submit your work on MarkUs. To submit your work:

  1. Download this file (Homework_2.ipynb) from JupyterHub. (See our JupyterHub Guide for detailed instructions.)

  2. Submit this file to MarkUs under the hw2 assignment. (See our MarkUs Guide for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one).

Overview#

This week, you will be building on and synthesizing some of the tools that we learned about in lecture this week to answer a data science question: What is the most common fossil taxon found in Canada? We will be using data sourced from the Paleobiology Database (https://paleobiodb.org/#/) to answer this question.

Task 1: Read in the data file#

Problem 1a. Open the file pbdb_data.csv in Python and read in the lines.#

Open the data file we will use this week, pbdb_data.csv. Assign the result to a variable named file. Next, read in the lines and assign the output to a variable called lines.

# Write your code here

Problem 1b. Separate the header from the rest of the data#

Create a variable named header and assign the first line of the file to it. Create a second variable called data and assign the rest of the lines to it.

# Write your code here

Problem 1c. Interpret the data file#

Please explain what data are contained in this file by examining the contents of the first line (the “header”). What can this header tell us about the data contained within this file? (2pts) Please pick one or two of the entries in this header and explain or speculate about what information they might contain. (2pts)

# Run this cell to print the headers. Do not modify this cell.
print(header)

WRITE YOUR RESPONSE HERE.

Problem 2a: Loop through the file and split up the data fields#

Create an empty list and assign it to the variable line_lengths. Loop through the lines of our data file. Remove any whitespace from the start and end of each line, and split each line using the comma as a delimiter. Find out how long the resulting list is and append the result to line_lengths.

After the loop, find the unique values of line_lengths and assign the result to unique_line_lengths.

# Write your code here


# Run this code after creating line_lengths to see all the unique line lengths
unique_line_lengths = set(line_lengths)
print(unique_line_lengths)

Problem 2b: Interpret this step#

In the previous step, you found that some of the lines in our data file were split into different lengths. Why do you suppose this might have happened? Could it be a pose a problem for our analyses? In subsequent weeks, we will learn techniques for cleaning up our data a bit more rigorously, but for now we can just move on as-is.(4pts)

WRITE YOUR RESPONSE HERE.

Problem 2c: Extract and clean the taxon names from each line#

Create an empty list and assign it to the variable taxa. Modify our loop to extract from each line the entry corresponding to accepted_name in the header. Remove any whitespace from the start and end of this string, and convert the string to all caps. Append each of these cleaned names to taxa.

# Write your code here

Problem 3: Number of fossil taxa in Canada#

Next, find out how many unique fossil taxa have been reported in Canada. First, we will want to find all of the unique values in our list taxa. You can do this using the code provided below, which will convert our list, taxa, into another Python data type called a ‘set’, which removes duplicates. Assign the results from the provided code to the variable unique. Next, find how many unique taxa there are. Assign the results to the variable num_taxa.

unique = set(taxa)  # this provided instruction will remove duplicates by converting taxa to a set data type.

# WRITE YOUR CODE HERE

Problem 4a: The most frequent taxa in our dataset#

Create an empty list and assign it to the variable num_occur. Loop over the items in unique and find the number of times it can be found in taxa. For each taxon, print the taxon name and the number of times it occurs in taxa. Also, append the number of times it has occurred to the variable num_occur. This last step is for autograding, but you will use the printed output to interpret your results.

# Write your code here

Problem 4b: Find the most common taxon in the dataset#

That output was pretty long and hard to sort through. We will learn tools to do this later, but for now, we will achieve our answer using only functionality from base Python. We can use our variable unique to easily find the most common taxon in our dataset using the max() function built into Python. We just need to use a slightly funky syntax to do so. Run the provided code to find the most common taxon in the Canadian fossil record.

# Run this cell. Do not change the code in this cell.

most_common = max(unique, key=taxa.count)
print(most_common)

Problem 5a: Interpret your results#

What is the name of the most common taxon in our dataset? (2pts) Google it. Can you think of any reasons why this taxon is so commonly found in Canada? There are no incorrect answers – feel free to speculate. (2pts)

WRITE YOUR RESPONSE HERE.

Problem 5b: Interpret your results#

Scroll through your results from Problem 4a. Google the names of a few of the taxa that have occured many times (let’s say any that occurs more than 30 times). Do you notice anything in common about these lineages? Comment on any patterns you notice. (2pts)

WRITE YOUR RESPONSE HERE.