MANAGERIAL REPORT Essay, Research Paper
INTRODUCTION
The purpose of this analysis was to develop a regression model to predict mortality. Data was collected, by researchers at General Motors, on 60 U.S. Standard Metropolitan Statistical Areas (SMSA?s), in a study of whether air pollution contributes to mortality. This data was obtained and randomly sorted into two even groups of 30 cities. A regression model to predict mortality was build from the first set of data and validated from the second set of data.
BODY
The following data was found to be the key drivers in the model:
? Mean July temperature in the city (degrees F)
? Mean relative humidity of the city
? Median education
? Percent of white collar workers
? Median income
? Suffer dioxide pollution potential
The objective in this analysis was to find the line on a graph, using the variables mentioned above, for which the squared deviations between the observed and predicted values of mortality are smaller than for any other straight line model, assuming the differences between the observed and predicted values of mortality are zero. Once found, this ?Least Squared Line? can be used to estimate mortality given any value of above data or predict mortality for any value of above data. Each of the key data elements was checked for a bell shaped symmetry about the mean, the linear (straight line) nature of the data when graphed and equal squares of deviations of measurements about the mean (variance). After determining whether to exclude data points, the following model was determined to be the best model:
-3276.108 + 862.9355×1 – 25.37582×2 + 0.599213×3 + 0.0239648×4 + 0.01894907×5 – 41.16529×6 + 0.3147058×7 +
See list of independent variables on TAB #1. This model was validated against the second set of data where it was determined that, with 95% confidence, there is significant evidence to conclude that the model is useful for predicting mortality.
Although this model, when validated, is deemed suitable for estimation and prediction, as noted by the 5% error ratio (TAB #2), there are significant concerns about the model. First, although the percent of sample variability that can be explained by the model, as noted by the R? value on TAB #3, is 53.1%, after adjusting this value for the number of parameters in the model, the percent of explained variability is reduced to 38.2% (TAB #3). The remaining variability is due to random error. Second, it appears that some of the independent variables are contributing redundant information due to the correlation with other independent variables, known as multicollinearity. Third, it was determined that an outlying observation (value lying more than three standard deviations from the mean) was influencing the estimated coefficients.
In addition to the observed problems above, it is unknown how the sample data was obtained. It is assumed that the values of the independent variables were uncontrolled indicating observational data. With observational data, a statistically significant relationship between a response y and a predictor variable x does not necessarily imply a cause and effect relationship. This is why having a designed experiment would produce optimum results. By having a designed experiment, we could, for instance, control the time period that the data corresponds to. Data relating to a longer period of time would certainly improve the consistency of the data. This would nullify the effect of any extreme or unusual data for the current time period. Also, assuming that white collar workers are negatively correlated with pollution, we do not know how the cities were selected. The optimal selection of cities would include an equal number of white collar cities and non white collar cities. !
Furthermore, assuming a correlation of high temperature and mortality, an optimal selection of cities would include an equal number of northern cities and southern cities.
CONCLUSIONS AND RECOMMENDATIONS
The model has been tested and validated on a second set of data. Although there are some limitations to the model, it appears to provide good results within 95% confidence. If time had permitted, different variations of independent variables could have been tested in order to increase the R? value and decrease the multicolliniarity (mentioned above). However, until more time can be allocated to this project, the results obtained from this model can be deemed appropriate.
STATISTICAL REPORT
MODEL SELECTION
In order to select the best model, several exercises were implemented. Sometimes, data transformations are performed on y values to make them more nearly satisfy the important model assumptions listed below:
a) Linearity – the mean value of mortality, given any independent variable, is a linear function of that variable.
b) Independence – the random errors (difference in mortality and the mean value of mortality given values of independent variables) are independent.
c) Normality – for any value of an independent variable, the values of mortality are normally distributed.
d) Equal Variance – for any value of an independent variable, the values of mortality have the same variance.
Sometimes transformations are performed to make the deterministic portion of the model a better approximation to the mean value of the transformed response. In order for mortality to be transformed, there must be an obvious improve
d in the model.
In addition to the above tests, outlying observations (defined in managerial report) were found on three cities. Examination of the data revealed that these three cities had an obviously lower relative humidity when compared with the other cities. Furthermore, these cities displayed a much higher number relating to white collar workers for two cities. Assuming that these extreme data points could be eliminated, there was found to be an improvement in model normality. However, linearity was negatively affected for the July temperature variable and the R? value (defined above) was reduced from 68.6% to 65%. These observations, it was decided, were not removed from the final model.
After the above analysis was complete, a report displaying all possible regressions was run. The best model should incorporate some combination of the following variables:
a) Highest R? ratio which indicates the percentage of sample variability that can be explained by the model.
b) Lowest Root MSE or square root of the average squares of the deviations about the mean.
c) Lowest Cp criterion where the value is based on a small estimate of variance (number of squares of deviations about the mean) and indicates that slightly or no bias exists in the subset regression model where bias occurs when the mean of the sampling distribution we are estimating does not equal the parameter we are estimating.
Three models were initially chosen to compare from TAB #6: The 6, 7 & 8 variable models. After comparing these models, the 7 variable model was chosen. As displayed on TAB #7, the R-squared and adjusted R-squared are approximately 2% less than model #3. This is not enough of a difference to justify the more complex model. The Root MSE is 39.7 vs. 39.3 in Model #3. This model has the best Cp value of .64653 compared with other models. Multicolliniarity is somewhat more of a concern than in the 1st model because of the following reasons: a) nonsignificant model selection tests on the following independent variables: July temp x2, relative humidity x2 and white collar x2 when the overall model test is significant. b) Variance inflation factors [tests of multicollinearity (exists if greater than 10)] are > 10 for four variables compared with two for model #1 and six for model #3. c) Intercept is negative for three variables compared with two for model #1 and four for model!
#3. Although multicolliniarity is greater for model #2 than for model #1, it also has more variables. Multicolliniarity in model #3 is the worst. Although normality is close for models #1 and #2, #2 looks better because more plots are concentrated at the center. Variability is very close for models #1 and #2. However it my be slightly better for #2. For these reasons, model #2 is chosen over the other models.
MODEL TESTING
The model was validated for predicting and estimating mortality with the following hypothesis test:
H : All coefficients in the model are equal to zero. ( 1 = 2 = …. = k = 0)
Ha: At least one of the coefficients is not equal to zero.
Rejection Region: F > F (where the distribution of F depends on k numerator df and n – (k + 1) denominator df
Test Statistic: Mean Square for model = R?/k where, n = number of observations
Mean Square for error(1 – R?)/[n - (k + 1)] k = number of parameters (excluding intercept)
Substitution (TAB #3): = .531026 / 7 = 3.5587
(1 – .5301)/ [30 - (7+1)]
Decision: Reject H Conclusion: There is sufficient evidence to conclude that at least one of the variables is good to estimate mortality.
Confidence Interval:
y ? t /2 s y where s y = s n and t /2 is a t value based on (n-1) degrees of freedom
Substitution (TAB #8): 50.53793 ? 2.074 * 6.334616 = (37.39993642, 63.67592358)
Substitution (TAB #2): 5.316607 ? 2.074 * 0.6332737 = (4.003197346, 6.630016654)
Conclusion: The absolute value of the residuals is 50.5 and the percentage of error is 5.3%. Therefore with 95% confidence, we can say that the mean absolute error falls within 37 and 64 deaths with an error ratio of between 4% and 7%.
CONCLUSIONS AND RECOMMENDATIONS
Although there seems to several problems including a low R?, severe multicollinearity, influential observations and problems with linearity and variability, the model is deemed to be a good estimator/predictor of mortality. Obviously improvements such as better data collection (though an controlled experiment), larger sample size, multicollinearity analysis (inclusion and exclusion of different variables) and data transformation analysis could result in better model prediction. However, analysis of this type is extremely time consuming and is recommended only if additional funds can be generated.