Back to Home

Ames Housing Price Prediction

Ames Housing Data Visualization

Overview

In this project, I worked with a partner to develop predictive models for housing prices in Ames, Iowa for Century 21 Ames. We explored various approaches to determine the most accurate model for predicting sale prices, leveraging statistical techniques and machine learning to identify key factors influencing home values.

Challenge

The real estate market requires accurate property valuation for both buyers and sellers. Our analysis needed to:

Dataset

We worked with the Ames Housing dataset from Kaggle, containing detailed information on 1,460 homes with 79 explanatory variables:

The dataset’s comprehensive nature allowed us to investigate numerous factors affecting housing prices and create robust predictive models.

Approach

We undertook two parallel modeling approaches:

Approach 1: Neighborhood-Specific Model

We focused on three specific neighborhoods (NAmes, Edwards, BrkSide) to investigate how square footage and location influence home prices:

  1. Data Preparation: We cleaned the dataset, removing outliers identified through diagnostic plots and Cook’s distance metrics
  2. Linear Regression: Developed a model using the formula SalePrice ~ GrLivArea * Neighborhood to capture both main effects and interactions
  3. Model Validation: Verified linear regression assumptions through residual analysis and diagnostic plots
  4. Performance Evaluation: Assessed model performance using adjusted R², CV PRESS, and confidence intervals

Approach 2: Comprehensive Model Selection

We created three competing models for all Ames neighborhoods:

  1. Simple Linear Regression: Used log-transformed year built as a predictor
  2. Multiple Linear Regression: Combined above-ground living area (GrLivArea) and bathroom count (FullBath)
  3. Custom MLR Model: Incorporated lot area, living area, land contour, and land slope with relevant interaction terms

Each model was evaluated using adjusted R², cross-validation PRESS, and Kaggle score to determine the optimal predictive performance.

Key Findings

Neighborhood-Specific Analysis

Our interactive app allows users to visualize these neighborhood differences through customizable plots.

Comprehensive Model Analysis

Our custom MLR model outperformed both the simple linear regression and the provided multiple regression model:

Model Adjusted R² CV PRESS Kaggle Score
Simple Linear Regression 0.270 153.891 0.33906
Multiple Linear Regression 0.523 4.43×10¹² 0.28586
Custom MLR Model 0.571 4.09×10¹² 0.28449

Key insights from our custom model:

Data Discovery: Our analysis revealed that land characteristics, often overlooked in traditional valuation, can significantly impact property values by up to 30% when comparing similar-sized homes.

Impact

Our analysis provided Century 21 Ames with:

Business Applications

Client Benefits

Long-term Value

The models can be periodically retrained with new data, ensuring Century 21 Ames maintains its competitive edge in property valuation accuracy.

Technical Implementation

We developed an interactive RShiny application that allows real estate professionals to:

The app provides an intuitive interface for exploring housing data and making data-driven decisions in real estate transactions.

Statistical Methods Used

Tools Used

Future Directions

This project has several potential extensions that could further enhance its value:

Collaboration and My Role

This project was completed in collaboration with Max Pagan, with whom I worked closely throughout the analysis process. My primary contributions included:

The complementary skills of our team allowed us to approach the problem from multiple angles and develop a more comprehensive solution than would have been possible individually.