Nashville Housing Data Cleaning in SQL

Data Cleaning SQL Project Image

This project executes queries to clean housing data in SQL Server.

You can find the code for this project on GitHub.

Data Cleaning

You can check out the dataset on Github. It provides information on housing data including unique ID, property address, owner address, sale date, sold as vacant or not and so on.

Let's dive right in!

Standardizing the date format

The SaleDate column has values in date time format but the time part does not hold any significance. So I converted all values of the column into date format.

Populating Property Address data

Some rows of the PropertyAddress data have null values. Upon investigation it was observed that there is a unique parcel ID for every property address present and that there are no null values in ParcelID column. This means that I can check for every ParcelId if I have a PropertyAddress value with an actual address and some null values then I can populate the null values with the address data. To do this, I performed a self join on matching the ParcelIDs AND not the UniqueIDs. I filtered the result on the rows with null values in the first table reference's PropertyAddress column and then set the null values with the address data in the second table reference's PropertyAddress column using the ISNULL() function.

Splitting address into individual columns (Address, City, State)

The PropertyAddress and OwnerAddress columns have values with full address data. The PropertyAddress has address and city mentioned so I divided the column into separate columns for Address and City. I used the SUBSTRING() function to extract the values of the PropertyAddress column from the first character index till the position before the first comma index and added them into the new column PropertySplitAddress. I did the same thing to extract the value after the first comma index till the last character index, found using the LEN() function, and added them to the new PropertySplitCity column.

The OwnerAddress column has address data with address, city and state mentioned so I split the column up into three separate columns for Address, City, and State. This time I tried a different approach for learning purpose. I used PARSENAME() function after replacing all commas (",") with periods (".") as periods are the default delimiters used by the PARSENAME() function. The PARSENAME() function splits the address into parts based on the “.” delimiter but works backwards. This means it considers the rightmost part as part 1, the second rightmost part as part 2, and so on. After extraction, these parts have been added into the three new columns- OwnerSplitAddress, OwnerSplitCity, and OwnerSplitState.

Changing 'Y' and 'N' to 'Yes' and 'No'

The SoldAsVacant column has values in both 'Y' and 'Yes' and 'N' and 'No' which requires standardizing. For this, I used a CASE statement to set the value with 'Yes' when the value is 'Y' and with 'No' when the value is 'N'.

Removing duplicates

I used Common Table Expression (CTE) named RowNumCTE which selects all columns from the table and assigns a row number to each row based on a specified partition and order criteria. The partition is defined by a combination of ParcelID, PropertyAddress, SalePrice, SaleDate, and LegalReference columns and ordered by UniqueID column. Each partition has rows with the same values for these columns and is assigned a unique sequential integer called row_num, starting at 1 for the first row in each partition. Then I used a DELETE statement to delete rows from the CTE where the row_num is greater than 1. This effectively removes duplicate rows while keeping the first occurrence.

Deleting unused columns

For learning purpose, I deleted some unnecessary columns - OwnerAddress, TaxDistrict, PropertyAddress, and SaleDate.