Project 3
Question 1
After importing charleston_ask.csv, I used kFold method in sklearn to split the data into 10 folds. Since most of the data are used as training data and the rest are used as testing data, we may avoid overfitting when building the model. The training scores I produced is 0.019 and the testing scores I produced is -0.047. Both the values are far from 1, so this linear regression model performs very badly. This poor performance is probably becuase the features are not in the same units and numbers of bedrooms and baths are not highly correlated with the asking price. Therefore a model based on bedrooms, baths, and area is not a good predicting model for asking price.
Question 2
Attempting to improve the model, I then applied standardScaler in sklearn to standardize the model. The data are still splited into 10 folds. The result shows that this standardized model also performs badly: training score is still around 0.019 and the testing score changes to 0.031, which is still far from the perfect score. This implies that maybe linear regression model does not apt to these data and we should consider using other models to predict the relationship between the features and target.
Question 3
The model still didn’t improve much after applying ridge regression model. The new training score is 0.018 and the new testing score is -0.033. Then I standardized the data to see if the model would improve, the training score becomes 0.017 and the testing score becomes -0.031, which doesn’t make much difference. Neither of these values are close to 1, so neither linear regression model nor ridge regression model (no matter standardize the data or not) are applicable. This further proves that the features listed are not quite correlated with the target, other objects should be considered as features in order to make asking price predictable.
Question 4
With charleston_act.csv, in linear regression model the training score is 0.004 and the testing score is -0.010, after standardization, tbe training score remains the same and the testing score changes to -0.008. In ridge regression, the training score remains 0.004 and the testing score changes to -0.055. These value are even smaller than those with charleston_ask.csv, showing that these regression models fit even poorly for this data file.
Question 5
Although the training and testing scores are still low, they generally increase for all the mentioned models when including zipcodes. I’m still using 10 folds. In linear regression model, the training score is 0.339 and the testing score is 0.214. In linear regression after standardization, the training score remains 0.339. In ridge regression, the training score is 0.333 and the testing score is 0.219. For each model, the predictive power is approximately 33% and the model represents roughly 21% of testing data. The effectiveness of linear regression model and ridge regression model is similar and the general increase in training score shows that the location (zipcode) of the house is a larger influencing fator of price than bedrooms, baths and area.
Question 6
The model that seems to produce the best result is the ridge regression including zip code data. I would consider this model to be underfitting the data because this model doesn’t perform well on the training data. If I am working for Zillow as their chief data scientist, I would first check which one of “baths and bedrooms” and “area in square feet” is less critical for pricing by checking their training scores and testing scores seperately and eliminate it from the features. Then I would add some more potentially relevant factors (like distance from infrastructures including hospital, department store, school, etc) to features. In this way, the predictive power of this model would increase. The current data file is splitted into 10 folds (roughly 70, or 10% of the data are used for testing), which I think is moderately reasonable, but if necessary, I may adjust the number of splits to make testing data fit better to the model.