STAT 29000: Project 3 — Spring 2021
Motivation: Web scraping takes practice, and it is important to work through a variety of common tasks in order to know how to handle those tasks when you next run into them. In this project, we will use a variety of scraping tools in order to scrape data from trulia.com.
Context: In the previous project, we got our first taste at actually scraping data from a website, and using a parser to extract the information we were interested in. In this project, we will introduce some tasks that will require you to use a tool that let’s you interact with a browser, selenium.
Scope: python, web scraping, selenium
Questions
Question 1
Visit trulia.com. Many websites have a similar interface, i.e. a bold and centered search bar for a user to interact with. Using selenium
write Python code that that first finds the input
element, and then types "West Lafayette, IN" followed by an emulated "Enter/Return". Confirm you code works by printing the url after that process completes.
You will want to use |
That video is already relevant for Question 2 too.
-
Python code used to solve the problem.
-
Output from running your code.
Question 2
Use your code from question (1) to test out the following queries:
-
West Lafayette, IN (City, State)
-
47906 (Zip)
-
4505 Kahala Ave, Honolulu, HI 96816 (Full address)
If you look closely you will see that there are patterns in the url. For example, the following link would probably bring up homes in Crawfordsville, IN: trulia.com/IN/Crawfordsville. With that being said, if you only had a zip code, like 47933, it wouldn’t be easy to guess www.trulia.com/IN/Crawfordsville/47933/, hence, one reason why the search bar is useful.
If you used xpath expressions to complete question (1), instead use a different method to find the input
element. If you used a different method, use xpath expressions to complete question (1).
-
Python code used to solve the problem.
-
Output from running your code.
Question 3
Let’s call the page after a city/state or zipcode search a "sales page". For example:
![](./images/trulia.png)
Use requests
to scrape the entire page: www.trulia.com/IN/West_Lafayette/47906/. Use lxml.html
to parse the page and get all of the img
elements that make up the house pictures on the left side of the website.
Make sure you are actually scraping what you think you are scraping! Try printing your html to confirm it has the content you think it should have:
|
Are you human? Depends. Sometimes if you add a header to your request, it won’t ask you if you are human. Let’s pretend we are Firefox:
|
Okay, after all of that work you may have discovered that only a few images have actually been scraped. If you cycle through all of the img
elements and try to print the value of the src
attribute, this will be clear:
import lxml.html
tree = lxml.html.fromstring(response.text)
elements = tree.xpath("//img")
for element in elements:
print(element.attrib.get("src"))
This is because the webpage is not immediately, completely loaded. This is a common website behavior to make things appear faster. If you pay close to when you load www.trulia.com/IN/Crawfordsville/47933/, and you quickly scroll down, you will see images still needing to finish rendering all of the way, slowly. What we need to do to fix this, is use selenium
(instead of lxml.html
) to behave like a human and scroll prior to scraping the page! Try using the following code to slowly scroll down the page before finding the elements:
# driver setup and get the url
# Needed to get the window size set right and scroll in headless mode
myheight = driver.execute_script('return document.body.scrollHeight')
driver.set_window_size(1080,myheight+100)
def scroll(driver, scroll_point):
driver.execute_script(f'window.scrollTo(0, {scroll_point});')
time.sleep(5)
scroll(driver, myheight*1/4)
scroll(driver, myheight*2/4)
scroll(driver, myheight*3/4)
scroll(driver, myheight*4/4)
# find_elements_by_*
At the time of writing there should be about 86 links to images of homes. |
-
Python code used to solve the problem.
-
Output from running your code.
Question 4
Write a function called avg_house_cost
that accepts a zip code as an argument, and returns the average cost of the first page of homes. Now, to make this a more meaningful statistic, filter for "3+" beds and then find the average. Test avg_house_cost
out on the zip code 47906
and print the average costs.
Use |
If you get an error that tells you |
You will want to wait a solid 10-15 seconds for the sales page to load before trying to select or click on anything. |
Your results may end up including prices for "Homes Near \<ZIPCODE\>". This is okay. Even better if you manage to remove those results. If you do choose to remove those results, take a look at the |
You can use the following code to remove the non-numeric text from a string, and then convert to an integer:
|
-
Python code used to solve the problem.
-
Output from running your code.