* Add missing data handling to the just-pass-in-a-matrix bit of the high-level API * Add parallel array handling to build_design_matrices * Add parallel array handling of some sort to high-level API... * Refactor build so that there are two stages - first stage takes a set of factor evaluators, and returns a set of evaluated columns - second stage handles interactions and categorical coding and assembles these together into design matrices use case: any model where you actually want to get categorical data out (like multinomial regression with a factor on the LHS, or CART with factors on the right-hand side) ** first stage should also handle other "parallel" data, like weights, which need to participate in the missingness calculations ** possibly also support a "subset=" argument at this stage ** and for parallel vectors and subset=, allow a string as a value, and if seen then evaluate it as python code in the same context as formula data (like R's subset=(MyCol > 10)) ** And do NaN/mask/missing data handling at this stage *** Imputation? *** numpy.ma * Better NaN/masks/missing data handling in transforms. I think the current ones will just blow up if there are any NaNs. (The previous entry is about handling the term "x" where x has NAs; this entry is about handling "center(x)" where x has NAs.) R's solution to this is that scale(x) simply unconditionally ignores NAs when computing the mean, regardless of the overall setting of na.action. That seems reasonable... * Advocacy Potential users? - statsmodels - PyMC has a (closed) ticket requesting such features: http://code.google.com/p/pymc/issues/detail?id=162 - nipy, though they have their own thing... - sklearn, which has regression (and might find it useful otherwise!) * Do something smarter with mismatched pandas indexes Right now we're conservative -- if you do ~ x + y and x and y don't have identical indexes, then that's an error. It's possible we should do something cleverer, though. Perhaps we should merge them somehow (pandas.concat(..., join="outer")?). (This of course would require that *all* items have indexes, though; right now you can mix plain ndarrays and pandas objects.) * Improve EvalFactor's stateful transform handling to follow . lookups right now it can only detect stateful transforms when they are called directly like scale(x) but not if referenced through some module like mylib.scale(x) In general we don't even want to try handling every possible function lookup syntax (see next item for a safety check for that), but we should allow for people to distribute non-builtin stateful transforms. * As a safety check for non-stateful transforms, we should always evaluate each formula on just the first row of data alone, and make sure the result matches what we got when evaluating it vectorized (i.e., confirm f(x[0]) == f(x)[0], where f is our transform. However, this is kind of tricky given that x might be pulled out of the environment, the 'data' dict might have arbitrary objects, etc. Hmm. Maybe intercept variable lookups and just munge those? This is easy to do if someone's passing in a structured array or dataframe and pulling all their data from it, or even if they use a dict with well-behaved columns. But the problem is when people do things like: In [1]: logx = np.log(data["x"]) # refers to data["y"] and logx together In [2]: lm("y ~ logx", data) * More contrast tools - Some sort of symbolic tools for user-defined contrasts -- take the comparisons that people want to compute in terms of linear combinations of level names, convert that to a matrix and do the pinv dance? We have the linear_contrast code already, but that's for describing constraints in terms of the coefficients you have -- it seems like people want to be able to describe constraints in terms of... I'm not sure what. Group means? The coefficients they could have had if they'd fit some other model? (Presumably the all-full-rank-dummy-coding-all-the-time model.) If I can ever figure out what this is (it has something to do with "estimable contrasts") then I'll implement it. - Short-hands for Type II, Type III, and "remove this term and everything marginal to it" contrast tests? Might need to figure out the trick that car::Anova uses to do efficient Type II tests with two contrast matrices. - Understand how coding matters for Type-III ANOVA. The tabs I had open last time I was looking at this: http://goanna.cs.rmit.edu.au/~fscholer/anova.php http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg69781.html https://stat.ethz.ch/pipermail/r-help/2007-October/143047.html http://www.uni-kiel.de/psychologie/dwoll/r/ssTypes.php * A good way to support magic functions like mgcv's s(). statsmodels wants this for things like y ~ arima(2, 3) y ~ garch(1, 1) the cheap trick way of doing it is: class ArimaModelType(object): __patsy_magic__ = True ... def arima(n, m): return ArimaModelType(n, m) and then in the factor type sniffing code detect these things and separate them out from "real" factors. * make sure that pickling works - And make sure that if we allow it at all, then it's sustainable! i.e. we'll be able to guarantee that if people pickle a ModelDesc or Design or whatever now, then they'll be able to get it back later. * Should EvalEnvironment.capture make a copy of the scope dictionaries? - The effect would be to prevent later changes in the enclosing scope from affecting predictions. Of course, we probably don't want to make a *deep* copy of the scope, so there's still no guarantees -- changes to mutable objects within that scope would still be visible. Perhaps we *could* get away with making a deep copy of all mutable objects that are accessed during the initial build, though... I think we'd need to special-case and ignore any READONLY ndarrays, as a safety valve for people who have a giant data-set they're referring to. of course, even a deep copy isn't enough -- they could call an immutable function which accesses mutable state. - Josef points out that in long-running REPLs people often need to del local variables to let memory be released, and if all their formulas are going and making shallow copies of the environment then this will be impossible. So making a shallow copy is probably out. - The other approach would be to extend the state dependency checking that we already want to do (to catch undeclared stateful transforms), and have it not only check that building an isolated row of data gives the same result as building the full list, but also that re-building that same row later at prediction time gives the same result as it did in the first place. * Export information on which terms are marginal to which other ones Marginality only makes sense within a numeric-interaction "bucket", so this has to be computed in patsy.build and exported as part of DesignMatrixColumnInfo. Then it can be used for Type II tests. * Some way to specify the default contrast * Support for R's magic "." term - The "y ~ everything else" form - The "what I had in this other ModelDesc" form (e.g., "y ~ . - a" to drop the 'a' predictor from an old model) - This will require that the formula->ModelDesc have access to the data or previous formula... * More stateful transforms: - Splines - 'cut': numeric->factor by quantile dichotimization - Orthogonal polynomials - 'code': takes a Categorical (or coerces to one), and optionally a contrast, and and does the standard contrast-coding. And possibly this should replace _CatFactorEvaluator... * Support for building sparse model matrices directly. (This should be pretty straightforward when it comes to exploiting the intrinsic sparsity of categorical factors; numeric factors that evaluate to a sparse matrix directly might be slightly more complicated.) * Real testing/support for formula syntax extensions The tricky part here is making sure we produce something useful. Use cases: - multinomial log-linear modelling - see below Prior art: - R package "lmer" interpets formulas like y ~ x1 + x2 + (1 | foo) + (1 + x | bar) - The R [[http://cran.r-project.org/web/packages/Formula/vignettes/Formula.pdf][Formula]] package, which has two features: - you can write multivariate responses, like y1 + y2 ~ ... (in stock R, this is interpreted as addition (!)). - you can write multiple "parts" on each side, separated by |. Basically these are treated as a list of design matrix specifications, and there are ways to pull out the first, second etc. on each side. - R package "plm" uses Formula to allow formulas like: y ~ x1 + x2 y ~ x1 + x2 | x3 y ~ x1 + x2 | . + x3 where the second part specifies "instrumental variables". I can't tell if the second part has an implicit intercept. - R package "frontier" uses Formula in a similar way, allowing formulas like y ~ x1 + x2 y ~ x1 + x2 | x3 where the first form computes a "error components frontier" and the latter computes an "efficiency effects frontier" (where the part after the | are regresses "used to explain the efficiency levels (Z variables)"). The part after the bar does have an implicit intercept. - package AER uses this in its "ivreg" command, which seems similar to plm. An example makes clear that "y ~ . | x1 + x2" works, and presumably the "." means the same thing as it would in "y ~ ." for lm. - package betareg does "beta regression", and a formula like "y ~ x1 | x2" states that "x1" should be used for the "mean submodel" and "x2" should be used for the "precision submodel". Its betatree function extends this further to "y ~ x1 | x2 | c1 + c2" where "c1", "c2" are "partitioning variables". AFAICT this means that it does basically does a CART-style tree division of the data based on c1, c2, and then fits beta regression models x1 | x2 on each subset. - package "fdaMixed" uses formulas like Y | id ~ fixed | random where Y is a response variable, id is "a factor separating the samples", and fixed and random are linear models for the fixed and random effects. The 'id' part seems to be used to match multiple samples from the same random effects group? - package "growcurves" allows "y ~ fixed | random". If there is no |, then there is a second argument (random.only) which is consulted to determine whether the sole RHS argument is fixed or random. (Maybe 'y ~ x1 + x2 + random(x3 + x3)' would be a better syntax?) - package "games" uses a syntax like "y ~ x1 + x2 | 0 | x3 | z". There is another version with 8 entries instead of 4. - package "metafor" does effect-size calculations using the syntax "outcome ~ group | study" where each entry has to be a 2-level factor. (And the 'weights' argument gives the actual numbers.) - package "mhurdle" seems to describe a kind of multi-step process via three-part formulas y ~ x1 | x2 | x3 where "the first part describes the selection process if any, the second part the regression equation, and the third part the purchase infrequency process". You can fill in 0 if you want to assume that some process doesn't actually apply (or leave out the last one altogether). - package "mlogit" uses three-part RHS formulas to specify different parts of a multinomial logit model. "the first one contains the alternative specific variables with generic coefficient, i.e. a unique coefficient for all the alternatives; the second one contains the individual specific variables for which one coefficient is estimated for all the alternatives except one of them ; the third one contains the alternative specific variables with alternative specific coefficients...If a standard formula is writen, it is assumed that there are only alternative specific variables with generic coefficients." The second RHS termlist has an intercept by default; for the other two termlists any intercept is ignored in any case. - package "polywog" does some clever polynomial basis function fitting thing, and uses formulas like y ~ x1 + x2 | z1 + z2 to mean basically the equivalent of y ~ x1*x2 + z1 + z2 i.e., the first termlist gets a super-rich non-linear interaction between all its entries, and the second is just entered linearly. * Currently we don't distinguish between ordered and unordered categorical data. Should that change? * how should redundancy elimination and explicit factor matrices interact? Example: If you do 1 + C(a, mat):C(b, mat), then currently it will expand that to 1 + C(a, mat) + C(a, mat):C(b, mat), which is going to be weird. Probably we should notice that the .contrast attribute in these cases does not give us the option of full- versus reduced-rank coding, and in redundancy.py we should note that such factors cannot be "expanded". * Profiling/optimization. There are lots of places where I use lazy quadratic algorithms (or even exponential, in the case of the non-redundant coding stuff). Perhaps worse is the heavy multiplication used unconditionally to load data into the model matrix. I'm pretty sure that at least most of the quadratic stuff doesn't matter because it's n^2 where n is something like the number of factors in an interaction term (and who has hundreds of factors interacting in one term?), but it wouldn't hurt to run some profiles to check. I think really what I mean is just, run timeit on a 10-variable interaction to make sure it isn't completely annoying. * Possible optimization: let a stateful transform's memorize_chunk function raise Stateless to indicate that actually, ha-ha, it turns out that it doesn't need to memorize anything after all (b/c the relevant data turns out to be specified explicitly in *args, **kwargs). Actually, this would be really useful for things like splines, which need to do expensive quantile estimation, but not if knots are specified. Another use case: C(something_that's_already_categorical, contrast=...). Note that this can't be detected until we do the first round of evaluation. A better interface would be memorize_needed(self, *args, **kwargs). I guess we could even have memorize_passes_needed, but eh... * Wacky idea: make factors into an actual stateful transform (one that takes a dict-like object and returns a matrix or Categorical) This would require: - adding memorize_passes support to stateful transforms - moving the factor memorization state inside an object (so it wouldn't be factors that would be stateful transforms, factors would be factories for stateful transforms)