Thursday, April 15, 2010

The Administrative Prevalence of Autism is a Bass Distribution

There's a new paper on the rise of autism diagnoses in California: Liu et al. (2010). Its findings are probably not surprising to my readers, I would imagine. It finds that children living in close proximity to a child already diagnosed with autism are more likely to be diagnosed with autism themselves. This reminded me of an observation I made once about administrative prevalence growth curves. They look like "word of mouth" growth curves, and they are devoid of abrupt "step" changes.

[Note: Also see Dr. Novella's take on Liu et al. (2010).]

It occurred to me to try to model this "word of mouth" type of process. The idea is that a model could be helpful in making predictions and understanding the reasons for the observed rise in prevalence ascertained from passive databases. I even wrote a simulation, and had some preliminary results. As much as I like to come up with my own models to explain things, however, I'd much rather use a proven model. So I kept trying to find an existing solution to this sort of problem.

Eventually I found something that looked very promising: The Bass Diffusion Model. This is a highly successful model that has been applied to the acquisition of durable goods, adoption of innovations, and more recently, the growth of social networks. Evidently, the model is unheard of in the autism world, and practically undiscovered in epidemiology in general. Interestingly, though, Liu et al. (2010) repeatedly uses the term "diffusion of information" to explain its findings.

Mathematically, the Bass Diffusion Model can be expressed using the following formula.

Model variables and parameters – adapted for our purposes – are defined as follows:
  • N(t) is the administrative prevalence of autism at time t.
  • t is the time, typically represented by a year.
  • t0 is the initial time, when prevalence is zero.
  • The coefficient p is called "the coefficient of innovation, external influence or advertising effect" (Wikipedia.)
  • The coefficient q is called "the coefficient of imitation, internal influence or word-of-mouth effect" (Wikipedia.)
  • m is the maximum administrative prevalence of autism – i.e. the prevalence value reached when the prevalence curve finally levels off.

In order to apply it to real world data, we need to derive the parameters of the model. This is fairly difficult because it's non-linear. So I used genetic programming to estimate the parameters that produce the best fit between the model and observations. I did this with the 6 to 9 California DDS prevalence, and I "trained" the model with two different time ranges. I will later explain the rationale.

For 6-9 prevalence data between 1993 and 2007, the correlation coefficient was 0.9994, and the parameters were:

p = 5.959·10-8
q = 0.253
t0 = 1943.246 (year)
m = 65.395 (per 10,000 population)

When trained with prevalence between 1986 and 2007, the correlation coefficient R for the model fit was 0.9991. The resulting parameters were:

p = 1.415·10-8
q = 0.237
t0 = 1934.1 (year)
m = 70.45 (per 10,000 population)

Anyone familiar with modeling and/or statistics will tell you that a correlation coefficient of 0.9994 is not only good, it's actually hard to believe. It might even be beyond law-of-physics good.

The following is a graph of the observed 6-9 prevalence in California DDS, along with the 2 derived Bass models, with forecasting all the way to 2020.

If the first model turns out to be correct, as early as Q4 2013 the 6-9 prevalence should be very close to 60 in 10,000, and a leveling-off pattern should already be evident. The first model predicts that prevalence will level off when it reaches 65.4 in 10,000. The second model predicts it will top at 70.5 in 10,000. I think these projections are reasonable, considering California DDS has eligibility restrictions. But we'll just have to see if they pan out.


The main limitation of the models derived in this post is that they assume m is constant. In reality m could change, not just because of possible environmental factors, but also because of changes in diagnostic criteria, and changes in eligibility policy. That's why I used a shorter time range to derive the model I actually prefer: the one based on the 1993-2007 prevalence series only.

No comments:

Post a Comment