Funda Scraper: Spider in the Housing Market

These past few years, the housing market in the Netherlands has been wild. With low interest rates and a strong economy, demand for housing has increased, leading to rising property prices and a shortage of available properties. The government has introduced several measures to address the housing shortage, including increasing the pace of construction and reducing bureaucratic hurdles for new housing development. Despite these efforts, the housing market remains competitive, with high demand driving up prices and making it difficult for first-time buyers to enter the market.

Everyone who is planning on moving or looking to buy a house in the Netherlands has most likely heard of the popular housing website: Funda. Funda is a website where houses from all over the Netherlands are listed for rent or for sale. The website is visited 43 million times a month, which is a lot, considering the population of the country is only 17 million! On the website you can look at pictures of the listed places and check its details and characteristics. One of its most important features is the possibility to book appointments to view the houses.

Funda is very fast paced. Unless you are looking for high priced houses or houses in a very specific remote area, chances are you are going to have to act quickly to book an appointment for a viewing. Check the website every hour and be quick to react to new listings, because viewings for most of the properties will be full in the span of a few hours. Funda offers no active notifications for new listings. They do offer one email a day with search terms of your choosing. But, like I said, this will not be fast enough in most cases. So to take matters into my own hands, I built a Funda web scraper that runs in the background checking every few minutes if a new listing is posted and sending me a notification to my phone if a new one is added.

The Goal

When it comes to building web scrapers, there are a lot of possibilities. For this project I set some goals which I wanted to achieve. These goals are:

Funda offers a daily update on your search term. However, this is not fast enough. I wanted my scraper to detect new listings within minutes of it coming online.
I didn’t want a solution which I have to manually check every so often, because checking Funda every hour to be on time for a new listing can be very stressful, and this is the exact thing I want to prevent. For this project I wanted to receive a notification of sorts when a new listing was placed on the website.
Lastly, I wanted it to be legal and not affect the Funda website whatsoever. It should be a lightweight solution that simulates someone obsessively checking the website a few times per hour.

The Idea

Web scraping is extracting data from webpages. It is done in many different ways. But, because I made this in Python, there are a few popular libraries that could come in handy. The most popular libraries are BeautifulSoup, Selenium, and Scrapy. The latter of which I used. Scrapy uses something called spiders to scan pages for the requested data. These spiders are able to access linked pages to get the data and return it in a usable format: items. More on spiders later.

The next step was to make it so the spider runs every so often. Again, a lot of options. The option I chose was to use a package called scrapyscript. This package lets you run spiders from script instead of the command line. This made it so I could use a package called schedule to make the spider run every few minutes.

The last step was to send out a notification whenever a listing was added to the website. I considered using email to do this, but didn’t want to give my Python code access to one of my email addresses or create a new one, so I decided against that option. A great alternative, and the one I ended up using, was PushBullet. PushBullet is an app that lets you connect notifications from your phone and computer. It is most often used to get notifications like texts from your phone to your computer, but also has a great and easy to use API for sending “pushes” to a device. More on the implementation later.

Robots.txt

Most big popular websites have a robots exclusion protocol or simply called: robots.txt. This file informs the robot what areas of the website should not be scanned. Because we are giving our spider URL’s, it is important to follow these rules. Funda’s robots.txt states to following:

# Prevent bots from indexing combinations of locations
Disallow: /koop/*,*
Disallow: /huur/*,*

So to abide by these rules, this solution does not pass combinations of locations to the spider.

Scrapy Spider

Scrapy is an open-source Python library that is used to crawl websites. When starting a scrapy project, you run the command:

scrapy startproject <project name>

Scrapy makes a directory and fills it with useful (and less useful) content. The main part of a scrapy project are its so-called “spiders”. A spider is a class that defines how to follow webpages. It is what crawls the web pages and returns the data found. A spider has one main function: parse(). It is the default callback function a spider uses to handle the urls given to it. In this project I defined my spider as followed:

class FundaSpider(Spider):
    name = 'funda_spider'

    def parse(self, response):

        listings = response.css('li.search-result')
        for lstng in listings:

            loader = ItemLoader(item=FundaScraperItem(), selector=lstng)

            loader.add_css('street_name','.search-result__header-title.fd-m-none::text')
            loader.add_css('postal_code', '.search-result__header-subtitle.fd-m-none::text')
            loader.add_css('price', '.search-result-price::text')
            loader.add_css('living_space', 'ul.search-result-kenmerken span ::text')
            loader.add_css('plot_size', 'ul.search-result-kenmerken span ::text')
            loader.add_css('nr_of_rooms', 'ul.search-result-kenmerken li ::text')
            loader.add_css('url', 'div.search-result__header-title-col a::attr(href)')

            yield loader.load_item()

The parse function uses ItemLoader to collect the data. It scours a webpage looking for a css class (li.search-result in this case) and picks it apart to add to the items defined. The items a spider scrapes are defined in the appropriately called items.py file. In this file a response item is created which can be filled with the structure and preprocessing functions for the responses. In this project the item class was very simple:

class FundaScraperItem(Item):
    street_name = Field(
        input_processor=MapCompose(clean_and_strip),
        output_processor=TakeFirst()
    )
    postal_code = Field(
        input_processor=MapCompose(clean_and_strip),
        output_processor=TakeFirst()
    )
    price = Field(
        input_processor=MapCompose(keep_ints),
        output_processor=TakeFirst()
    )
    living_space = Field(
        input_processor=MapCompose(keep_ints),
        output_processor=TakeFirst()
    )
    plot_size = Field(
        input_processor=MapCompose(keep_ints),
        output_processor=TakeLast()
    )
    nr_of_rooms = Field(
        input_processor=MapCompose(keep_ints),
        output_processor=TakeLast()
    )
    url = Field(
        input_processor=MapCompose(complete_url),
        output_processor=TakeFirst()
    )

Each Item has its input processor and output processor. The output processors are predefined for us. They are used to get an item from a list of returned elements. The input processors are custom and look like this:

def clean_and_strip(text):
    text = text.replace("\r\n","").strip()
    return text

def keep_ints(text):
    text = re.sub("[^0-9]", "", text)
    return text

def complete_url(text):
    return "https://www.funda.nl" + text

The schedule

The schedule package lets you run a function on a timer:

if __name__ == "__main__":
    schedule.every(15).minutes.do(periodic_checker, database = database,
                                      url_list = url_list, token = token,
                                      send_notification=send_notification,
                                      open_links=open_links)
    while True:
        schedule.run_pending()
        time.sleep(1)

As mentioned before, I used scrapyscript to run the spider from script. The code used to run the spider is very simple:

def run_spider(spider, url_list):
    funda_job = Job(spider, start_urls = url_list)
    processor = Processor(settings=None)
    fetched_listings = processor.run(funda_job)
    return [dict(x) for x in fetched_listings]

This returns a list with the fetched items for each listing on the website. Every time this code is run, another function checks whether these listings exist in a .json-file. If they exist nothing happens, but when a new one is found, it is saved to the .json-file and a notification is pushed.

The Notification

To get started working with PushBullet I used this tutorial. In short PushBullet has a pc and android application. After setting up my account and finding my access token I used the following code to send push notifications to my phone when a new listing was found.

def pushbullet_notification(title, body, token):
    msg = {"type": "note", "title": title, "body": body}

    resp = requests.post('https://api.pushbullet.com/v2/pushes',
                         data=json.dumps(msg),
                         headers={'Authorization': 'Bearer ' + token,
                                  'Content-Type': 'application/json'})

    if resp.status_code != 200:
        raise Exception('Error', resp.status_code)
    else:
        print('Message sent!')

The code takes a title and the body of the notification and puts it in the correct format before sending it off using the access token.

Using the scraper

All configuration variables are set in a config.py script. These can be configured to have the scraper run in convenient ways. The config file looks as followed:

ONE_LOOP = False
OPEN_LINKS = True
SEND_NOTIFICATION = True
DATABASE = "listing_database.jsonl"
URL_LIST = ['https://www.funda.nl/koop/gemeente-den-bosch/200000-400000/dakterras/tuin/sorteer-datum-af/',
            'https://www.funda.nl/koop/gemeente-vught/200000-400000/dakterras/tuin/sorteer-datum-af/']
PUSHBULLET_TOKEN = "string_of_your_token"

ONE_LOOP decides if the script should run just one time or if it should periodically check for new listings.
OPEN_LINKS decides if a link to the listing should be opened when a new one is found.
SEND_NOTIFICATION decides if notifications of new listings should be sent to the phone.
DATABASE is a reference to the .jsonl-file that holds all previously seen listings.
PUSHBULLET_TOKEN holds the pushbullet access token.
URL_LIST is a list that holds urls for all housing specifications that should be checked. It is important to set the search results to “sorteer datum aflopend” so that the newest listings appear at the top. Easily checked by looking at the end of the url and seeing if there is a /sorteer-datum-af/.

The Result

This scraper helped me be one of the first to reach out for an appointment whenever a new house in listing hit Funda. Looking for houses can be stressful and having to check the website multiple times a day does not help. I am happy with this project and proud to say we found a nice place to call home.