Navigating Sydney’s Property Market: A Data-Driven Approach with dlt + dbt core
Introduction
As a data analyst transitioning into a data engineer, I’m intrigued by the array of tools available in the market. However, my real excitement lies in rolling up my sleeves and embarking on my inaugural cloud project.
dlt (data load tool) is a new Python library that automates the tedious part of data ingestion: Loading, Schema Management, Data Type Detection, Scalability, Scable Extraction. dbt core is an open-source tool designed to enable data teams to apply analytics engineering best practices for data transformation, and Tableau Public is a free platform to explore, create and publicly share data visualizations online.
These elements are reflected in the diagram below and further clarified in greater detail later in the article:
Problem
Navigating the property market can be daunting for people seeking homes in Sydney. While Realestate.com.au is a popular platform, obtaining reliable market insights poses a challenge.
In this competitive market, which suburbs offer dream homes that fit within our budget?
Solution
I. Data Ingestion
Rather than scraping real estate websites, I opted for Realty in AU’s API via Api Dojo. Because of the max page limit, I decided to retrieve sold properties for the last 12 months for a start.
With dlt Incremental loading, only new or changed data will be loaded. It enables low-latency and low-cost data transfer. The schema was automatically inferred and staged in BigBucket before loading into BigQuery.
The dlt pipeline I created for ingestion is in this repo. The ingested tables are shown below.
dlt feels like the perfect solution for anyone who has scratched the surface of python development. To be able to have schemas ready for transformation in such a short space of time is truly… transformational.
II. Clean and transform data
Among the ten ingested tables, the focus lay on realestate_analytics_tiered_results_results, housing sold listings’ details like sold_price and sold_date.
Through structured modeling layers — Staging, Intermediate, and Reporting — we prepared modular pieces for a comprehensive tableau report.
The series of transformations run helps identify potential duplicate listings among other things. Duplicates are produced by
- the same listings returned from different API calls as listings from surrounding suburbs(one unique).
- the property listing was replaced by a new one due to an unresolved issue with the previous buyer(both unique). An intermediate model, int_realestate_duplicates.sql, employs logic to pinpoint unique listings. It assumes that properties with the same address and sold dates within a 7-day window are duplicates.
The dbt structure I created for data transformation can be found in this repo.
III. Data Visualization
Tableau Public, while limited in native connectors, was leveraged for data visualization. To automate the pipeline, Gsheet served as a conduit. Sheets were explored for reporting, with data refreshed on a scheduled basis and synced to the Tableau dashboard.
Final Product
The resulting dashboard addresses key queries, including the number of listed properties sold in the last 12 months, areas with the highest property sales, and the average cost of a 3-bedroom property. This personalized insight equips homebuyers with the knowledge to identify suburbs aligning with their budget and preferences.
Link to the Tableau dashboard.
Further development
- Realestate.com.au lacks certain sold listings, some properties were only listed on Domain.com.au (2nd largest in Australia).
- By integrating NSW Property Sales Information (PSI), we can cross-reference property prices for enhanced accuracy.
- This iteration primarily aimed to establish a foundation for analyzing properties, tailored to my specific needs and objectives. In terms of further development, we can expand the real estate dataset to cover NSW or even the entire Australia.
Thank you for reading this far. If you find this article useful, like and share this article, and feel free to reach out via Linkedin and Github, all comments are appreciated 🚀