Data Skills

Data Skills
posted on January 9, 2016

Data skills – results of discussion between Josh Vogelstein and Konrad Kording

Foundational (courtesy Josh):

use median instead of mean if you want an estimate of position (for univariate data; multivariate is more complicated)
use MAD or some other robust statistic if you want an estimate of scale (for univariate data; multivariate is more complicated)
when doing regression, consider using something robust to outliers, svm is a popular example, but other options are simpler/more intuitive, eg, change from L2 loss to huber loss, L1 loss, etc.
never conduct a statistical test without first looking at the data that goes into the function and deciding whether the assumptions of the test are approximately satisfied by the data. if the assumptions are woefully not satisfied, the p-values are worse than meaningless, they are deceiving.
always plot your data before doing any analyses, preferably in a whole bunch of ways, to get ideas of what might be going on.
if you n vectors, each d-dimensional, and n is not bigger than say 5d or 10d, you must regularize. lasso is the most popular choice these days, though it is highly problematic if data are correlated at all. elastic net is a hacky and effective solution that i would try first.
if you would like to reduce the dimensionality of your data, always try pca first. always remember that pca is terribly sensitive to outliers (simulate data with 1 strong outlier and you will see). always remember that eigenvectors are not unique, ie, sign flips and rotations are equally good. therefore, the eigenfunctions are not intreptable, without considerable additional work.
never steal code from matlab exchange to implement an algorithm if the code is not written by the dude who published the paper, or verified by that dude. lots of code is wrong or otherwise bad.
always try your code on simulated data prior to running it on real data. confirm that the code works when it should work and fails when it should fail. if your code does fail when it should, then it failed. don’t forget to try things like data corrupted with a NaN, or a few actual zeros, both of which can be highly problematic.
never write if x == y. numerical precision dictates that this will not be true many times. write if (x-y)^2 < eps, for some small choice of epsilon, eg, 10 standard deviations.
always consider the bias/variance tradeoff. where did you make unstated assumptions/simplifications? did you spike sort? did you filter? did you digitally sample? why did you make these assumptions? computational reasons? is it the standard in the field?
always remember the information processing lemma: you cannot add information via processing your data, only keep it or lose it!

1-5 derive from the fact that big data is almost always dirty. 6-7 derives from big data leads to large problems. 8-10 are important to avoid bugs.

Specialized techniques (mostly Konrad) Konrad just put together a list of techniques he would like his students to know

Elastic net glm
MCMC
Gaussian process regression, Hierarchical Regression
Variational bayes
Graphical models, e.g. hmm, kalman, learning
Extended Kalman Filters, dynamical systems, stability
Particle filters
Linear/convex programming
Sequence alignment algorithms / time warping
Regression trees
SVMs , kernel approaches
Search trees (e.g efficient nn)
Deep learning (contentious if this should be on the list)
Neural networks with regularizing (history matters)
Model comparison AIC, BIC, cross-validation, corrections
Bootstrap
Network metrics/ inference/ generative grammars
Nonparametric /infinite bayes
Standard feature sets (sift, time series, wavelets, etc)
Standard frequentist tests

@Jovo, @Kordinglab

Data Skills posted on January 9, 2016

Data Skills
posted on January 9, 2016