I apologize if this question is already answered and appreciate any pointers to existing answers. I'm not familiar with statistical or data mining terms so my search was limited to basic words used in the title of this question. I did look at some algorithms but could not decide if they serve the purpose.
I'm looking for an algorithm that can give me % probability or % confidence that out of given set of input factors which inputs are affecting the output most. Let us take an example below and assume that input and output will be recorded 4 times a day for a month.
Output:
- Phone's battery drainage
- Phone heating up (temperature).
Input:
- Ambient temperature
- Which apps were used most when the output was measured (multiple input points)
- Number of calls received and call length
- Screen brightness
- GPS connectivity enabled?
- Bluetooth enabled?
- Cellular signal strengh
If we measure this data for month, we will find that Cellular Signal Strength and screen brightness probably will be linked to Battery Drainage most. On the other hand 3D resource intensive games will be linked to Phones heating up the most. Can I deduce such relationship using any existing algorithm(s)?
Asked By : MVCNinja
Answered By : D.W.
Fundamental limitations on what you can achieve
The principled answer is: No, you can't reliably deduce such causal connections. When you have only observations of the data, you can use statistical analysis to look for correlations, but you can't deduce causation. Correlation does not imply causation. You are looking for causality, but you cannot infer causality from field data (from external, uncontrolled observations of a system). To test causality you need a controlled experiment.
This is a problem not just in principle, but often in practice as well. In practice, we might have correlations among your "independent variables" or confounding factors you haven't accounted for, which causes correlations that aren't an indication of causality.
For instance, you might find that ambient temperature is correlated to display brightness (because at night it is cooler and darker, so phones turn down display brightness, whereas at day it is warmer and brighter, so phones turn up display brightness to compensate for the daytime light). Now we'd expect that when the display brightness is higher, the battery will be drained faster. Can you deduce this causal connection from statistical inference applied to some observations? Well, you might observe a correlation between display brightness and battery drain rate, which sounds good. But if display brightness is correlated to ambient temperature, then when you do a statistical analysis, you might also discover a correlation between ambient temperature and display brightness -- and that's a spurious correlation that doesn't indicate any causal connection.
Pragmatic, engineering answer
That said, the pragmatic answer is to use statistical analysis of your data to look for correlations, and then hope that at least some of the correlations will represent something real. You have to go into it with eyes open knowing that some of the correlations will be spurious, for the reasons explained above, but maybe that's OK. Maybe statistical analysis can still be useful to you.
So, how should you do the statistical analysis? There are many possible ways. It's best to say what is the best; that probably depends upon what relationships between the input and output you think are plausible.
But a very simple approach that I suspect will work well for you is to use linear regression. Try to find a linear model that expresses an output variable (e.g., the battery drain rate) as a weighted linear combination of the input variables. Linear regression can find the best weights for you that minimizes the prediction error.
Then, the size of those weights tells you exactly what you want to know. Suppose you are predicting the battery drain rate, and in your linear model the weight for the screen brightness input variable is 0.7. This means that if you increase the screen brightness by adding $\Delta$, then the battery drain rate is predicted to increase by adding $0.7 \Delta$.
Tricky details
There are a few things about your specific problem that might be a bit tricky.
One tricky aspect is the business about apps, and how to code that into a linear regression framework. One way is to pick a few categories of apps, and then have a separate indicator input variable for each category (0 if that app is not present, 1 if it is present). However, you don't want to have too many input variables, because that will require a lot more data to reliably form a linear model. A few dozen app categories might be fine, but thousands of input variables for the thousands of different possible apps probably would not be.
Another tricky aspect arises if you expect there to be non-linear dependencies, and you want to find them. For instance, you mention in a comment the possibility that low cellphone signal strength combined with running a 3D game might increase battery drain rate in a way that neither alone would. If you expect those kinds of non-linearities, one standard way to deal with them is to introduce new input variables for "interaction effects": e.g., you introduce a new input variable which is the product of the signal strength times the indicator variable that indicates whether a 3D game is running. (These new input variables are derived from the other input variables, but in a non-linear way.) You can introduce one new "interaction" input variable for each pair of original input variables that you think might exhibit a combined effect of this sort (where the effect of their combination is greater than the sum of their individual effects). This is a standard technique that you can find described in good tutorials on linear regression.
Ultimately, there's lots more to say on statistical data analysis, linear regression, and applying it to practice -- more than I can write in a single answer. I recommend you read a good tutorial or two on linear regression and maybe the relevant chapters of some statistical textbooks. If you still have more questions after all of that, it's also worth knowing about Cross Validated, a Stack Exchange site on statistics.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/26214
0 comments:
Post a Comment
Let us know your responses and feedback