TDM 20100: Project 3 — 2023

Motivation: The need to search files and datasets based on text is common during various parts of the data wrangling process. As an example, grep is a powerful UNIX tool that allows you to search text using regular expressions. Regular expressions are a structured method for searching for specified patterns. Regular expressions can be very complicated. (Even professionals can make critical mistakes.) With that being said, learning some of the basics will come in handy, regardless of the language in which you are using regular expressions.

Regular expressions are not something you will be able to completely escape from. They exist in some way, shape, and form in all major programming languages. Even if you are less-interested in UNIX tools, you should definitely take the time to learn regular expressions.

Context: We’ve just begun to learn the basics of navigating a file system in UNIX using various terminal commands. Now we will go into more depth with one of the most useful command line tools, grep, and experiment with regular expressions using grep, R, and later on, Python.

Scope: grep, regular expression basics, utilizing regular expression tools in R and Python

Learning Objectives
  • Use grep to search for patterns within a dataset.

  • Use cut to section off and slice up data from the command line.

  • Use wc to count the number of lines of input.

Make sure to read about, and use the template found here, and the important information about projects submissions here.

Dataset(s)

The following questions will use the files in this directory:

  • /anvil/projects/tdm/data/consumer_complaints/

and, in particular, several questions will focus on the data in this file:

  • /anvil/projects/tdm/data/consumer_complaints/processed.csv

grep stands for (g)lobally search for a (r)egular (e)xpression and (p)rint matching lines. As such, to best demonstrate grep, we will be using it with textual data.

Let’s assume for a second that we didn’t provide you with the location of this projects dataset, and you didn’t know the name of the file either. With all of that being said, you do know that it is the only dataset with the text "That’s the sort of fraudy fraudulent fraud that Wells Fargo defrauds its fraud-victim customers with. Fraudulently." in it. You may use 'grep' command to search for the dataset.

You can start in the /anvil/projects/tdm/data directory to reduce the amount of text being searched. In addition, use a wildcard (*) to reduce the directories we search to only directories that start with a con inside the /anvil/projects/tdm/data directory such as

/anvil/projects/tdm/data/con*

Just know that you’d eventually find the file without using the wildcard, but we don’t want to waste your time.

Use man to read about some of the options with grep. For example, you’ll want to search recursively through the entire contents of the directories starting with a con with option -R or -r.

grep -Rin 'fraudy fraudulent fraud' /anvil/projects/tdm/data/con*
  • R: This flag tells grep to search recursively, meaning it will traverse through the specified directory and all of its subdirectories, looking for the pattern in every file it encounters.

  • i: This flag makes the search case-insensitive. So, "FRAUDY", "Fraudy", and "fraudy" would all match.

  • n: With this flag, grep will also display the line numbers in the files where the matches are found.

  • 'fraudy fraudulent fraud': This is the pattern grep is looking for. It will search for the exact phrase "fraudy fraudulent fraud" in files.

  • /anvil/projects/tdm/data/con*: This is the path where grep should start its search. Specifically, it tells grep to look in the /anvil/projects/tdm/data/ directory and search in all files and directories starting with con.

When you search for this sentence in the file, make sure that you type the single quote in "That’s" so that you get a regular ASCII single quote. Otherwise, you will not find this sentence. Or, just use a unique part of the sentence that will likely not exist in another file.

Question 1 (1 pt)

  1. Write a grep command that finds the dataset, which contains text "朝阳区" in all directories that start with air inside the /anvil/projects/tdm/data directory. As with the example given above, you search should be case-insensitive, and your needs to display the line numbers for the location of the text.

Question 2 (1.5 pts)

  1. Use the head command to print out the first line only from the file /anvil/projects/tdm/data/consumer_complaints/processed.csv.

    Using the head command, we can (in general) quickly print out the first n lines of a file. A csv file typically has a header row to explain what data each column holds.

    head -n numberoflines filename
  2. Print out first 5 lines from 3 columns, namely: Date Received, Issue and Company response to consumer from the file /anvil/projects/tdm/data/consumer_complaints/processed.csv

    Use the cat command to view all file contents, the head to control the row, and the cut command to select columns.

    cat filename | head -n rowNumbers | cut -d 'delimiterhere' -f field1,field2,...
  3. For the single line where we heard about the "That’s the sort of fraudy fraudulent fraud", print out these 4 columns: Date Received, Issue, Consumer complaint narrative, and Company response to consumer.

Use cat, head, tail, and cut commands to isolate the 4 columns and the single line

You can find the exact line from the file where the "fraudy fraudulent fraud" occurs, by using the n option from grep. That will tell you the line number, which can then be used with head and tail to isolate the single line.

cat filename | grep 'patternhere' | cut -d 'delimiterhere' -f field1,field2,field3,field4

Question 3 (2 pts)

  1. From the file /anvil/projects/tdm/data/consumer_complaints/processed.csv, use a one line statement to create a new dataset called midwest.csv that has the following requirments:

    • it will only contains the data for these five states: - Indiana (IN), Ohio (OH), Illinois (IL), Wisconsin (WI), and Michigan (MI)

    • it will only the contain these five columns: Date Received, Issue, Consumer complaint narrative, Company response to consumer, and state

      • Be careful that you don’t accidentally get lines with a word like "AGILE" in them (IL is the state code of Illinois and is present in the word "AGILE").

      • Use '>' redirection operator to create the new file, e.g.,

        createthefile > midwest.csv
  2. Please describe how many rows of data are in the new file, and find the size of the new file in megabytes

  • Use wc to count rows

  • Use cut to isolate just the data we ask for. For example, just print the number of rows, and just print the value (in Mb) of the size of the file:

cut -d 'delimiterhere' -f positionofrequestedfield
output like this
520953
output not like this
520953 /home/x-nzhou1/midwest.csv

Question 4 (1.5 pt)

  1. Use grep command to get information from the new data set 'midwest.csv' to find the number of rows that contain one (or more) of the following words (the search is case-insensitive): "improper", "struggling", or "incorrect".

Question 5 (2 pts)

  1. In the file /anvil/projects/tdm/data/consumer_complaints/processed.csv, which date appears the most in the Date received column?

  2. In the file /anvil/projects/tdm/data/consumer_complaints/processed.csv, for each category of Product, how many times does that type product appear in the data set?

Project 03 Assignment Checklist

  • Code used to solve quesiton 1 to 5

  • Output from running the code

  • Copy the code and outputs to a new Python File

    • firstname-lastname-project03.ipynb.

  • Submit files through gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.