Tuesday, November 11, 2008

Error in Scientist Mom's Vaccine & Autism Data Analysis

Back in September there was some noise about a post by someone I'll call "Scientist Mom" (apparently she doesn't use a pseudonym at all) titled The Correlation that Does Indicate Causation. I didn't want to even read the post back then because I had a feeling I would become involved in analyzing the data and spend way too much time that I was supposed to spend doing something else. The obvious critique of such an analysis, without knowing much about it, is that it was a pirates vs. global-warming type of correlation. Orac slammed Scientist Mom for it, and rightly so.

In my last post on the (lack of) association between rainfall and autism I had used birth-year data from California. I thought a natural extension of that work was to apply a detrended cross-correlation analysis to the caseload data and Scientist Mom's vaccine data.

Well, to my disappointment, the post doesn't provide any usable vaccine data. It's more of a qualitative analysis, where Scientist Mom just lists vaccines that were recommended during different time periods.

I noticed a significant error in the analysis, however. It has to do with Scientist Mom's key claim:

Most compelling of all, there was no increase in the percentage of autism cases in 2002-2004, when no vaccines were added to the childhood schedule.


I wonder if the error is obvious to some of my readers. If I mention "left censorship" as a hint, do you see the problem now? What if I mention that in my last post I decided to left-censor California birth-year autism caseload such that I only used data up to 2000?

You see, autism prevalence by birth year series always have a hook shape on the right hand side of the graph. It doesn't matter if I survey the prevalence in 2004 or 1994. They always do. The following is an IDEA graph representing prevalence by birth year, as reported in 2001, 2002 and 2003.



Not only is there a natural decline in prevalence by birth year because some autistics are diagnosed late; it's also the case that prevalence by birth year data is not fixed in time. If we request new data from Califonia DDS next year, the data potentially changes in all birth years, and it likely changes considerably in recent birth years.

This is a common mistake. Mark Blaxill has fallen for it. The Geiers have as well, assuming they didn't know what they were doing.

One way to solve the issue is to left-censor the data. Basically, you only consider the birth year data that is more likely to remain stable in the future.

I believe a much better way to solve the issue (although this is not always feasible) is to use data on prevalence by a given age in a given year, e.g. prevalence of 3 year olds in the system in a given year. This data shouldn't change with time. This is the type of data that was used when there was a debate over the expected decline in the California DDS 3-5 caseload. And as you may recall, the 3-5 prevalence continued to increase.

Since California DDS provides birth year data as reported in different years, we can estimate the caseload of autistic 3 year olds from, say, 2000 to 2007. You basically look at each of the 32 files (8 years times 4 quarters) for the years we're interested in, and get the birth-year caseload of the report year minus 3. The resulting graph follows.



This is an approximation, of course. Consider that on 03/2002, the number of children born in 1999 will not be as many as you'll have in 12/2002. Hence the seesaw pattern.

The point is that Scientist Mom is mistaken in her finding that the prevalence of autism dropped or was stable after 2002. This completely undermines her analysis, since that was her key claim.

Thursday, November 06, 2008

Is Precipitation Associated with Autism? Now I'm Quite Sure It's Not.

In the last post I attempted to confirm if there was a naive ecological state-level association between precipitation and IDEA autism prevalence. To my surprise, there wasn't, and there was no need to control for urbanicity.

Technically what the result means is that, just considering this one analysis, we can't reject the null hypothesis. Of course, one could argue that state-level data is poor. The confidence interval is too big, and a real effect could easily hide in it. (In part this is what "not being able to prove a negative" means).

So I couldn't leave it at that. I wanted to confirm it in some other way. I remembered I had birth-year caseload data from California DDS dating back to 1920 (contiguous since 1930) that David Kirby had originally requested, and a copy of which I had obtained in order to rebutt one of his posts. This is data from a file called AUT_200703.xls contained in Job5028.zip, which may be requested from California DDS. Corresponding precipitation data is not difficult to obtain.

The year range I will use is 1930 to 2000. (I'm left-censoring autism caseload starting at 2000). For precipitation we have to assume some sort of a lag. I will use precipitation at 1 year of age. The autism and precipitation time series follow.



The time series in themselves don't look very promising, do they? But I wanted to apply some math to them in order to confirm if there's at least a trend, even if not a statistically significant one.

Whenever you compare two time series, there's always a possibility that you'll end up with a pirates vs. global-warming type of association. There are different ways to control for this. One that I particularly like is called detrended cross-correlation analysis (Podobnik & Stanley, 2008). Basically, you remove the trends from the series, and then compare them. The reason I like this technique is that it's intuitive, can be illustrated graphically, and is easy for anyone with passing knowledge of Excel formula syntax to reproduce.

Now, one problem is that there isn't something we can call the trend of the time series. There are many different ways to model trends. What we should ideally do is try many different types of trends, e.g. linear, quadratic, and cubic. For simplicity I will skip the linear and quadratic trends (they don't look adequate) and use cubic trend lines, which you can see in the graph above.

The following graph represents the cubic detrending of the original time series.



At this point we can just put the detrended data points in a scatter chart and see if there's an association.



This is the kind of scatter you'd expect to see if you compare two completely independent random variables. That is, you see a random distribution of dots and a linear regression slope that is almost completely horizontal.

Of course, we're still left with the problem of not being able to prove a negative. The slope of the linear regression is 0.11 (0.11 more California autistics for every extra inch of rain in a year) with 95% confidence interval of -3.896 to 4.133.

But I think the scatter graph is compelling. What we see in it is entirely consistent with a complete lack of association between autism and precipitation.

Wednesday, November 05, 2008

Is Precipitation Associated with Autism? Apparently Not.

A while back I wrote a critique of the TV hypothesis by Waldman et al. I noted the likely confound is population density, which should not be considered a "fixed effect" in Waldman's methodology (an interesting statistical methodology that is apparently used in Economics frequently). When we talk about population density as a confound, we're really using it as a proxy of other confounds that are clearly not fixed in time. These more specific confounds could be things like awareness, availability of autism specialists, etc.

In general, studies like Waldman's and Palmer's likely suffer from the fundamentally incorrect assumption that regional differences in the administrative prevalence of autism reflect a real difference in actual prevalence. But I do believe it is possible to use administrative data to draw preliminary conclusions, so long as confounding factors are accounted for.

My intention in writing this post was to walk through an analysis of publicly available data, controlling for population density, to see if the rainfall effect remained. I fully expected there to be a naive ecological association between precipitation and autism. To my surprise, the effect didn't appear to exist in the first place at the US level, and there was no need to control for confounding.

The following is a scatter graph of annual precipitation by state (1971-2000) vs. the 3-5 IDEA prevalence of autism (estimated for 2006).



There's not even a trend in the expected direction. This is quite the head-scratcher, and it left me wondering what was going on. Why is it unexpected? Let's first look at a population density map of the US.



It would be reasonable to expect that counties with a higher concentration of people will have higher rates of autism diagnoses, due to increased awareness and a greater availability of autism specialists. Let's now look at a map of precipitation in the US.



The correlation between precipitation and population density is quite clear, isn't it? Why didn't we see an association trend in the expected direction in the scatter graph then? First, it seems that a few states bring the slope down. These would be states with a low autism prevalence but high precipitation rates, like Louisiana, Alabama and Mississippi.

That's a bit of bad luck for Waldman et al. Additionally, we don't have that many data points. There's unfortunately too much variability in this US-level data, which makes it pretty inadequate. Perhaps using 6-11 IDEA prevalence would be better than 3-5 prevalence. In any case, it's doubtful statistical significance would be achieved, and even if it were, it is doubtful it could withstand controlling for population density.

I think the association needs to be revisited in a different way. But this exercise left me wondering why Waldman et al. decided to only look at counties from certain states, namely, California, Oregon and Washington (with California not showing a clear association).

I'm going to suggest cherry-picking might have occurred when it comes to Oregon and Washington. In order to argue this point, I will simply post population density and precipitation maps of each of these states. You will see that the pattern in these two states is fairly unique. Most people live in the west side of the state, and that's also where it rains.









To summarize: (1) It was not easy to confirm the reported association. (2) Analysis of any such associations should account for population density. (3) Cherry-picking might have occurred in this particular case.