Austin Contrarian was kind enough to post this video of Council Member Morrison talking about regression to determine water demand in the future. I had mentioned this moment in a previous post. This moment is great. We need our public servants to be using these concepts to get better answers. It completely warmed my heart.
Unfortunately, her main point about regression – at least how it was expressed – remains problematic. Specifically, she indicates that Austin Water staff used 20 years of data to predict demand. Opponents indicated that 5-10 years was the unofficial standard. Perhaps she was tired, hurried, or had people explain things wrong since I have met Council Member Morrison and she is both smart and a good listener. Unfortunately, the ideas of less vs. more data points misses an opportunity to explain to the the public about what regression is and makes me extremely suspicious about exactly what kind of regressions people are running that they are giving to our decision-makers.
First of all, regressions rarely provide clear answers. Instead it provides probabilities, confidence intervals, ranges. Sometimes, if the inputs fed into the model are extensive and there is a lot of quality data, regression is very confident about what is likely to happen. But that is hardly regularly the case. That Austin Water provided projections without any intervals or probabilities indicated that they did not actually use a linear regression. If they did, then they probably over-stated the certainty of a regression about likely future demand. I’ve discussed my difficulty in deciphering their estimates in an earlier post.
Second, the idea that 5 data points is enough, let alone 20 is pretty strange. Without getting too technical, that is usually too few points to establish a meaningful statistical relationship. More points are almost always better. If you want to read a more nerdy take on how one has to be cautious about the false insight that may come from big data sets, read Columbia uber-statistician Andrew Gelman here. But in general, the appeal to fewer data points is odd, and I am particularly surprised it is coming from the opponents, who I believe are more savvy about data and I agree with on the demand projection issues (but not necessarily cost.)
I am a bit surprised at the mystery surrounding the Austin Water estimate, if indeed it used a regression. We don’t know what the independent variables in the model are. It seems that population is the sole independent variable. That’s fine, but if that is the only one then the model is a very simple model and it doesn’t take into account changes in pricing and regulations, and therefore it is likely to be a poor predictive model.
At this moment, I am fairly certain no one has shared any of the key linear regression statistics with the public. A linear regression model pops out several key statistics. The first thing we want to know is the r-squared. This indicates amount of variance between the data points explained by the selected independent variables. We also want to know the p-values for the independent variable coefficients (i.e. population) to figure out the statistical significance (unlikely to have occurred by chance) of the coefficients. We don’t have these outputs. Nowadays, academics put their data sets on their websites to allow for validation. It is odd that this has not happened with this project yet.
In my requests to Council Members Riley and Spelman, they have pointed me to the original Austin Water demand chart. But I think we all need to see the actual model to provide better feedback. For all we know we are having a debate about a demand estimate based on a regression that has a pitiful r-squared and no statistically significant coefficients for independent variables. In other words, they would be presenting noise as fact.
One final point…there are countless ways of developing predictive models that go beyond something as simple and clunky as linear regression. It appears that linear regression is the direction Austin Water took, and even though it saddens the statistician in me, I understand that it is a great choice because it is intuitive and makes sense to people not versed with statistics. That said, they should at least live up to the burdens imposed by their selected tool. At this point we have no public evidence that their demand chart (or any other that might have been provided to Council Members) is anything else than a selective interpretation of data.
UPDATE: I got the data! You can read my analysis here. Hat tip to Austin Contrarian.