Semester Project: Luxury Apartment Web Crawler — A.Dziedzic , D.Leckner

Andrew Dziedzic
Web Mining [IS688, Spring 2022]
11 min readMay 3, 2022

--

The key problem within this semester project was to generate insights on if an individual/person(s) were contemplating transitioning to a luxury apartment in a different city in the USA, what do the luxury apartments look like in terms of pricing, pricing fluctuations, and volatility. Additionally, if you are satisfied with the city, you currently reside in, would a transition to a different apartment type in the person’s current city entice an individual to do so? The main motivation would be to gain granular insight to understand where a move to a different city can benefit an individual, individuals, or a small family. The motivation is to provide transparent insight across the various cities in the USA and inform individuals on pricing fluctuations to be confident and comfortable on where someone will decide to potentially relocate.

We will be in the perspective and viewpoint of an individual person who rents an apartment (in this case, specifically, a luxury apartment) in a large city within the United States of America. We will be analyzing a single real estate corporation’s luxury apartment rental listing to look for insights into the cost of living in a luxury apartment in a popular U.S. city. Specifically, we will look at pricing trends and pricing increases for a studio, 1 bedroom, and 2-bedroom apartments in the various popular cities of the USA. This will generate key insight and information for individuals who are working remotely or individuals who plan to transition to a full-time remote job, and where they can live at a lower cost than in other U.S. cities. There could potentially be thousands of individuals who would be interested in this information. Individuals living in a luxury apartment rental in one of the U.S. cities will be able to clearly gain insight on pricing trends for various apartment types, as well if there are any unusual and significant price increases and/or price gouging. We will be tracking various KPIs, along with other variables such as type of apartment: studio, 1-bedroom, 2-bedroom, amenities, apartment square footage t, location, etc.

The coding for the daily feed and ingestion of new daily data is being run appropriately each day and is correctly acquiring the data required. Once the coding was completed, we must now expand the data by including another or many other cities within the U.S.A. This will require very little change and/or manipulation in the coding, as several lines will have to be changed, but not many alterations. Once the expansion to other cities was completed, we performed data quality and data assurance to make sure the new apartment building units and varying apartment building types are correctly being captured and ingested with the historical data. Once other cities were incorporated into the daily feed, we performed the same analysis and insight generation as has been already completed with apartment buildings within New York City. The comparison across varying apartment types within New York City was analyzed to perform a city-by-city comparison to specifically highlight and detail the major differences between pricing, features, square footage, and availability dates. The additional work that was unforeseen and undertaken particularly pertaining to the analysis was the parsing of the key features within each apartment type. To do this, specific calculations had to be performed to parse the string of data pertaining to the apartment features text. Once the parsing was correctly performed, after many attempts, we can now granularly see which specific apartment features are significant variables/factors in either high or low monthly apartment prices. Within the analysis portion, there has also been important work performed to create, modify, enhance, and build out a visualization of analytics for insight generation. Specifically, there have been 10–12 visualizations that were created, now updated, and further analyzed on a daily/weekly basis for key changes. The basic technique for analysis has been constructed, and only needs to be maintained and modified going forward.

There has been substantial discussion on addressing several problems and approaches from our first approach to this analysis. For example, data storage was an initial concern at the beginning of the project, in a professional environment, having the data stored in SAP, Snowflake, etc., and then utilizing Alteryx workflows to maintain a stable data pipeline would be the approach. However, making sure that we can continue to capture daily data and consolidate the daily data into a single master file for analysis was a concern, however, this is no longer an issue. The current amount of data we have is workable, and we can perform analysis to generate insights, however, if this project were to continue for 6–12 months, the concern of using millions of records of data would have to be addressed with the addition of professional tools as mentioned (SAP, Google Cloud, Microsoft Azure, Amazon Redshift, etc.), Additionally, another challenge that needed to be addressed is the time to develop the web crawling algorithm/web scraper for this project. Understanding the timeline for the web crawler to be built, bugs to be addressed, and testing for appropriate capture of data was a struggle. Understanding the data schema and data structure that would be captured from the web crawling algorithm using APIs was also a concern. The level of information and the type of information that we could capture was a concern, and now we believe there is sufficient data to perform analyses and experiments going forward.

Writing the code to scrape the necessary information was at the beginning, a relatively easy process. However, while the information at the target sites appeared to simply be “just sitting there,” there was a caveat. Using Beautiful soup allows a crawler to go on the web and take data as it would be presented on a webpage. However, particularly for the UDR site, on the onset, Beautiful Soup will not be able to find any data in the target sites. This is because no information stays stagnate on the front end of the real estate company. Instead, the data is called from an API call to the back end, rendering the information to the front. To stimulate this, it was necessary to abandon the use of beautiful soup and employ the use of Selenium. With Selenium, we were able to use the chrome web driver to render a webpage. Feeding the appropriate target URLs into the Selenium Wedriver, we instructed the code to wait for the data to be fully visible on the page and begin crawling. There is a dataframe that holds a collection of the apartment buildings and companies that are being queried. The final dataframe holds the information itself. It consists of daily data scraped from the sites and is centered around the apartments and features that relate to its available units.

The data that was collected was specifically on luxury apartment rental units and various KPIs/variables associated with each apartment type. We mined and crawled through the UDR Website (Luxury Apartments for Rent | UDR Apartments) with the initial focus and initial efforts being taken in the city of New York, NY. Once the completion of various apartment buildings in New York City was complete, this approach was kept consistent, and we theoretically and computationally applied this to various U.S. cities throughout the country.

The Python programming language that was used featured APIs, Beautiful Soup, Selenium, the ChromeDriver software, as well as the initial crawling and mining of the URD website pertaining to a specific apartment building. Once the initial crawl was created and a specific day’s variables and KPIs were successfully captured, the only update that was required for the python code was a daily maintenance run to capture the current day’s data and push the current day’s data with all the historical data to the GitHub repository. The push that is taking place daily to consolidate the current day’s data with the entire historical dataset is an output file that is stored in GitHub specifically in the form of a CSV file. When the daily maintenance is performed the ‘master’ CSV file will hold all the historical data as well as the current day’s data. Once the ‘master’ CSV file is cleaned and converted to an Excel worksheet, the file was then ready to produce insights, analytics, reports, and data visualization through Tableau Software.

As we perform a daily push to generate and store the current day’s data, we now understand that from a single luxury apartment building, specifically located in New York City, we can expect between 158–173 new data points produced daily. This is the range that we have clearly seen over the first week of running the daily push. The matrix that is being built and processed daily when the daily push occurs consists of 14 columns (14 Variables, both qualitative and quantitative) with a range of 158–173 new data points being added daily as rows to this matrix. If this daily range remains consistent, and when this is scaled to the other UDR Luxury Apartment Buildings in New York City (6 URD Luxury apartment buildings in total), the daily run rate will be 948–1,038 new data points being added daily to the ‘master’ matrix. The monthly (30-day month) run rate would be 28,440–31,140 new data points being added monthly to the ‘master’ matrix. If this monthly run rate for six (6) luxury apartment buildings in the same city is very similar to that of other cities, we can therefore get an estimate of how many data points the ‘master’ matrix would have if applied to five (5) other large cities in the U.S. If the monthly run rate is applied to a total of six (6) U.S. cities throughout the country, the total number of data points in a month would be between 171,000–187,000. Having to constantly perform this update daily, over the course of an entire calendar year, the estimate would be 2,048,000–2,242,000 data points. Duringthe beginning of this process, the initial data can be stored in an excel file with the ‘master’ matrix schema. However, if this process were to run for an entire year, storing over 2 million data points in a single file is not feasible, and therefore a powerful tool, such as SAP, Microsoft Azure, AWS, or Google Cloud would be needed to handle the amount of data for processing.

It was incredibly fascinating to see after the conclusion of this project the specific and thorough information that was gained from performing this project. Specifically, the greatest value in terms of pricing and square footage is within the Dallas, Texas area. If necessary to stay in the Northeast area of the country, Philadelphia is a significantly better option than Boston and New York. Boston and New York Features and pricing is driven by views (Bay, Hudson River), while Philadelphia features and pricing are due to specific finish packages (Espresso Brown, Maple White). Austin and Dallas features and pricing are driven by smart home packages. $0.30 — $0.50 cheaper price per sq. ft in Dallas compared to Austin. Another great conclusion of this project is the capability and insight in understanding not from the perspective of an individual seeking to relocate to a different city in the USA, but from the perspective of an individual looking to not only stay in the same building but find/locate the same apartment type at a cheaper cost. We can clearly indicate for an individual seeking to remain in a 1-bedroom luxury apartment in New York City, that a transition to the Leonard Pointe building would be the best cost savings. Additionally, for an individual seeking to remain in a 1-bedroom luxury apartment in the same building they currently reside in New York City, if the individual gave up/removed himself from having a River view with a corner location, the individual can relocate for a cost savings of 16% per month. It was incredible to see how features within the same apartment type in the same building can have such a significant influence on monthly apartment rent prices.

The key learnings from start to completion of this project was the creation of the API creation, web crawling, crawling incognito, bot creation, data collection, data engineering, data analysis, visual analytics, and project presentation. Understanding how to properly perform a daily scraping of the information required using various bots randomly being used throughout the course of a day was challenging. Additionally, taking the time to analyze the changes that were occurring throughout each city and each apartment type required data and visual analytics. Lastly, being able to construct a project presentation for an audience to fully comprehend the project goal and project task took careful consideration and constant revisions.

To conclude, we believe this project will provide the appropriate transparency for individuals to make the correct decision regarding their decision to move or not to move. There can be further opportunities for development, including the addition of new cities being added to the daily crawl. There can be further opportunity in terms of defining what a “luxury” apartment is vs. a “regular” apartment. The specific criteria to denote what constitutes a “luxury” apartment can be analyzed further. Additional metric values can be further explored as well as how these metrics change depending on seasonality (different times of year to move).

Analytics, Insights and Appendix below:

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language
Selenium is an open source, free to use, portable tool that is used to perform web testing in a seamless manner. It takes off a load of the burden from testers, since the amount of time that is required to perform testing, especially on repeated test cases reduces drastically. These are multiple reasons as to why Selenium is way more popular and powerful when compared to other web-automation tools in the market.
Tableau is the broadest and deepest data and analytics platform. Tableau is the fastly growing and powerful data visualization tool. Tableau is a business intelligence tool which helps to analyze the raw data in the form of the visual manner, it may be a graph, report, etc. The raw data is simplified easily to any format understandable by the users.
This is the user interface of the View 34 luxury apartment building that is within New York City.
This is a zoom in image of the user interface of the View 34 Luxury apartment building that is within New York City.
Avg. Monthly Rent ($) for a 1 Bedroom Luxury Apartment in New York City by building.
Avg. Monthly Rent ($) for a 1 Bedroom Luxury Apartment in New York City by building with the top feature (feature #1) being shown.
Avg. Monthly Rent ($) for a 1 Bedroom Luxury Apartment in New York City within the View 34 building with all features being shown.

APPENDIX

There was significant discussion taken regarding the approach for this project. After discussing other possible projects and feasibility of all projects, it was decided to move forward with this project. As we were both configuring/creating a web crawler for these projects to discuss and decide which project to move forward on, we agreed to use the foundation that Derek was working on. The specific changes to perform a daily scrape on a specific website had to be changed within the programming language. It was a collaborative approach from the beginning. Please see below additional responsibilities for everyone within the group:

Andrew Dziedzic: Front-End Responsibilities: data analysis, visual analytics, data governance, quality assurance

Derek Leckner: Back-End Responsibilities: data crawling, bot creation, data engineering design

For further information or questions, please email ad386@njit.edu or dl489@njit.edu

--

--