Naked Science Forum

Non Life Sciences => Geek Speak => Topic started by: scientizscht on 04/06/2021 19:36:08

Title: What is the power of regression?
Post by: scientizscht on 04/06/2021 19:36:08
Hello

Regression is used to elucidate the relationship between two factors.

If we simply graph those two factors we can see the line and identify their relationship.

What is so special about regression that is so widely used then?
Title: Re: What is the power of regression?
Post by: Bored chemist on 04/06/2021 20:08:08
If we simply graph those two factors we can see the line and identify their relationship.
Really?
What line do you draw through these?
 [ Invalid Attachment ]
Title: Re: What is the power of regression?
Post by: evan_au on 05/06/2021 10:17:46
All measurements have sources of error.
- Least-squares regression was originally used to estimate the path of comets from astronomical measurements, even though those measurements had errors.
- The recorded measurements were inconsistent with each other, and yielded no valid answer if you took them "as-is".

Quote
If we simply graph those two factors we can see the line and identify their relationship.
But two people are likely to draw different lines.
- "Least-squares" regression software is able to draw a line that minimises the errors according to the well-known "Least-squares" criterion (which has limitations that are not-so-well known by most people who use it)
- Most Regression packages are able to calculate an R2 measure that gives an idea of how well the data fits the line.
- Sometimes the data is not a straight line, and most regression packages allow fitting parabolas, exponentials or other functions - but you had better have a good rationale for the fitted function.

Warnings
- Beware of over-fitting the data; by using a high-dimensional polynomial, you can exactly fit the curve to the data. This just means you are paying more attention to the errors than the bulk of the data.
- Correlation does not prove causation. Just because two variables seem to have a relationship does not mean that one causes the other. You need more evidence than just a regression line (like experimentally varying one parameter, and see how the other one changes).
- Don't keep picking sets of variables until you find a set that meets some arbitrary criterion like R2>0.95. If you try 100 sets of variables (like cheese consumption and risk of premature death), you will eventually find sets that seem to have a high correlation.
- The most important test is how well it predicts results that have not been observed.
      - This is often used in AI systems; present half the data as training data, then see how well it works on the half that wasn't included in the training data.
      - Extrapolate beyond the training data. How well does the regression trendline predict the results outside the original range of data?

See: https://en.wikipedia.org/wiki/Regression_analysis

Quote from: BC
What line do you draw through these?
Funny enough, there are some measurements at my work that look quite like this.
- For me, it happens when you have a uniformly distributed deviation on top of a linear measurement.
- To my eye, this one looks something close to y=x/2 + 100*RAND(0,1)
- For which a regression package would give (approximately) y = x/2 + 50
- Ironically, this data sample would have a poor R2, but would still be a decent line of best fit, simply because (over the measured range), the error is of similar magnitude to the trend. The R2 value would improve if you extended the data out to x=1000 (or 1 million), assuming it was defined over this range...
Title: Re: What is the power of regression?
Post by: Bored chemist on 05/06/2021 11:14:04
To my eye, this one looks something close to y=x/2 + 100*RAND(0,1)
Close, but it's got a product of 3 random numbers in it, to make it a bit more "random"- or, at least, a bit more like a normal distribution.
The Rand function gives a square  distribution.
Adding two of them gives a triangular distribution (like the sum of two dice).
If I remember a stats course I did 30 years ago,  a distribution like that is better represented by a "least linear distance" fit, rather than a "least squares" fit.

Regression analysis's should be more than just "stuff it into Excel"- though that's what most people seem to do.
It helps to have an idea of what the data "should" look like, but also you shouldn't constrain the analysis too much.


Title: Re: What is the power of regression?
Post by: alancalverd on 05/06/2021 13:20:30
"If in doubt, plot log/log and draw a straight line"  Not my words, but I've heard it too often to ignore!
Title: Re: What is the power of regression?
Post by: Eternal Student on 06/06/2021 00:14:49
Hi.

   No one has mentioned that regression can do a bit more than just fit a line (or curve) to some data.
After fitting the line (or curve), the residuals have a distribution of their own.  Linear regression is most powerfull when the residuals have a Normal distribution and if you're a statistician this is usually one of the most important things you want to obtain (not a line of best fit).   Once you have residuals with a Normal distribution the door is open to some powerfull prediction and interpolation.   
    Meanwhile, if you just draw a line of best fit all you have is an indication of some relation between two variables but very little quantitative prediction capability.
Title: Re: What is the power of regression?
Post by: Bored chemist on 06/06/2021 10:34:03
It's also important to recognise that you can do a multi variate regression line to assess how some variable  changes with a number of other factors.
You might be able to "just draw a line" through the data, but it's very hard to "just" draw a hypersurface through it.
Title: Re: What is the power of regression?
Post by: evan_au on 06/06/2021 10:48:14
Quote from: Eternal Student
Linear regression is most powerfull when the residuals have a Normal distribution
The irony is that when you have a normal distribution, you have just proved to yourself that you know very little about what you are measuring.

As BC indicated, if you add a large number of number of distinctly non-normal distributions, you will get something that looks like a Normal distribution.
- This includes Uniform, negative exponential, and even discrete distributions
- This is a result of the "Central Limit Theorem"*
- If you want to show that you really understand the process, you need to isolate those underlying distributions, and explain how they combined to form the Normal distribution.
See: https://en.wikipedia.org/wiki/Central_limit_theorem

*When I was doing introductory statistics at university, we were introduced to the Central Limit Theorem as being fundamental to all of statistics
- The class clown asked "When doesn't it apply?"
- The lecturer eventually conceded that there were some theoretical distributions that didn't necessarily obey the Central Limit theorem - they had strange properties like a non-finite standard deviation; but don't worry, they can never occur in real life!
- Around 1995, I discovered that internet traffic has a non-finite standard deviation, and I realised that it was time for me to learn another branch of statistics (I work in telecommunications...).
- But in most cases, using a large (but finite) standard deviation gives results that are "close enough" after you apply a bit of safety margin
Title: Re: What is the power of regression?
Post by: Bored chemist on 06/06/2021 11:32:48
One relatively simple distribution which does not have a mean and standard deviation is the ratio of two independent normally distributed variables.
You can use this knowledge to upset statisticians.

People- notable statisticians- use the normal distribution because it has well defined properties.
They then use the CLT to "justify" using it, on the basis that, if you look at lost of sub samples, everything looks like a normal distribution.

However, if you know that the distribution isn't normal- for example, rolling dice- you should actually use the analyses for the correct distribution.


A long time ago (before catalytic converters were common) a colleagues of mine was doing some sampling for common air pollutants- benzene, toluene and xylene. They are strongly associated with vehicle  emissions.

He got a bunch of us to hang samplers in our gardens- one near the road and the other at the bottom of the garden (and thus not near the road).
He ended up with ten pairs of data points.
And he analysed them on the basis that they were normally distributed.
So he got a mean and SD for "front" and "back". They overlapped considerably so he came to the conclusion that there was no statistically significant difference between the front and back garden samples.

And then I pointed out that, in 9 cases out of 10, the back garden concentration was lower than the front garden.
If he had been right about there being no difference then that would have been equivalent to tossing a coin 10 times and getting 9 heads. The odds for that are something like 100 to 1 against.
He rewrote that bit, but didn't give me the credit...

Title: Re: What is the power of regression?
Post by: Eternal Student on 06/06/2021 21:24:27
Hi all.

The irony is that when you have a normal distribution, you have just proved to yourself that you know very little about what you are measuring.
   I know what you are trying to say.  This sentence on it's own isn't particularly true or helpful to the OP but it is interesting and ironic as you say.  The OP needs to know that we have a useable statistical model.  The uncertainty or randomness has been contained and modelled. 
    Explaining the uncertainty or randomness in the residuals is a separate project or problem, which you (the scientist) may or may not want to do more work on.  As you (Evan-au) have stated, there is reason to think this could actually be very difficult if the residuals are Normally distributed.

    B_C  then made some comments about using other distributions.  Yes, that's fine and (I expect you know) it is done but this produces a non-standard linear regression model.  It's again interesting and I'm happy to discuss it but it may not be useful to the OP.
    While we're on the topic, B_C mentioned using non-parametric tests.   This might be worth adding to....   Linear regression is closely related to obtaining correlation coefficients, this is easily applied to non-parametric data and gives us techniques like Spearman's rank correlation tests.

   Anyway, the main point is that by formalising the procedure of finding a line of best fit, we generate statistics (numerical quantities) which can be put to many uses and for which there are well established and powerfull techniques but drawing a line of best fit (by eye) doesn't give us anything other than a rough visual guide.
Title: Re: What is the power of regression?
Post by: Colin2B on 06/06/2021 23:44:19
This sentence on it's own isn't particularly true or helpful to the OP but it is interesting and ironic as you say.  The OP needs to know that we have a useable statistical model
It is very difficult to know what the OP needs. He appears here at intervals with seemingly disconnected questions, with no context, and often no response to requests for further information on his application. Who knows whether it’s helpful as we rarely get feedback.
This sort of question is typical: https://www.thenakedscientists.com/forum/index.php?topic=81458.msg625626#msg625626.

Title: Re: What is the power of regression?
Post by: Eternal Student on 07/06/2021 00:50:00
Hi Colin2B,

    Well, we've got to take the optimistic view.  The OP has become very busy and will check the responses when time allows;  or they have a limited internet connection each month;   or they have other health issues.
     So you've just helped by answering a question and engaging in conversation with a person who has poor health, limited funds and a stressful life.  You've done a good thing Colin2B, other moderators and regulars.
Title: Re: What is the power of regression?
Post by: vhfpmr on 08/06/2021 17:21:39
Anyone care to speculate what the law is here? Looks more pareidolia than parabola.  ;D ;D
Title: Re: What is the power of regression?
Post by: evan_au on 09/06/2021 11:12:52
Quote from: vhfpmr
Anyone care to speculate what the law is here?
It would help if you labelled the axes....

I can see that there is a minimum value for the Y-Axis, and a narrow band of very common values on the Y-Axis that are fairly independent of the X-Axis...
Title: Re: What is the power of regression?
Post by: vhfpmr on 09/06/2021 11:48:43
It would help if you labelled the axes....
Yes, I know, I posted it more out of humour than as a serious question, just because I happened to have a bemusing scattergram at the time when a regression thread was running.
Both axes are standard deviation, plotted because I was curious whether more variation in one parameter might indicative of more variation in the other.
Title: Re: What is the power of regression?
Post by: charles1948 on 09/06/2021 19:48:34
On the subject of "graphs", aren't they a kind of analogue device - something like the old "slide-rules" that we had before modern digital calculators and computers

I suppose no-one in modern Science would try to use images of a "slide-rule" to present evidence in support of a scientific theory, so why are "graphs" still employed.

Couldn't the evidence in a graph, be presented in "digital" format, ie as just tables of numbers?  From which conclusions could be arrived at.

Is it because graphs make it easier to see the underlying mathematical processes. If so, why did we abandon slide-rules so quickly?








.



Title: Re: What is the power of regression?
Post by: Bored chemist on 09/06/2021 19:55:42
why did we abandon slide-rules so quickly?
Because they are a bit rubbish.
Very tricky to get an accurate answer from one.
Title: Re: What is the power of regression?
Post by: charles1948 on 09/06/2021 20:26:16
Yes BC, your remark reminds me of something Arthur C Clarke wrote in one of his old books.

It went something like:

If you ask the average person: What's 8 divided by 2 - he'll instantly say: " It's 4."

If you ask a scientist, or engineer, the same question, he'll get out a slide-rule, fiddle about with it for a while, then say: "It's between 3.9 and 4.1"

 BTW, apologies for not using gender-neutral pronouns, but it was an old book.





Title: Re: What is the power of regression?
Post by: Bored chemist on 09/06/2021 20:30:23
If you got enough engineers and asked them enough questions of the form "What's x divided by 2" and plotted the answers they gave vs x then you could do a regression analysis on that data and model the case of x= 8.
You should get an answer that is closer to 4 than "3.9 to 4.1".
Title: Re: What is the power of regression?
Post by: Colin2B on 09/06/2021 23:05:50
On the subject of "graphs", aren't they a kind of analogue device - something like the old "slide-rules" that we had before modern digital calculators and computers
Your world is full of functional and very effective analogue devices.
Stop trolling.
Title: Re: What is the power of regression?
Post by: evan_au on 09/06/2021 23:27:56
Quote from: charles1948
Couldn't the evidence in a graph, be presented in "digital" format, ie as just tables of numbers? Is it because graphs make it easier to see the underlying mathematical processes.
Yes, humans are a very visual species.
- We have the unusual visual quirk that we can see things that don't change, which makes the whole "reading a book" possible.
- As I understand it, for most species, things that don't change do not generate an image.

Graphs work well in 2D.
- They can be extended to 3D on a page for some sets of data
- but I think 3D data would be explored more easily with virtual reality headsets
- I have seen one presentation (as a movie) that worked through data in more than 3 dimensions (I think  it may have been 8 dimensions?). But that was geometrical data, so it was (sort of) possible to see geometrical patterns emerge as the view switched between hyperplanes. I don't this would work nearly so well for data that was noisy in multiple dimensions, as you would only see part of the pattern and part of the noise.

The one I saw was a bit more colorful than this one:


As BC said, regression can find patterns in multi-dimensional data that we wouldn't be able to visualise.
Title: Re: What is the power of regression?
Post by: vhfpmr on 10/06/2021 07:45:17
Couldn't the evidence in a graph, be presented in "digital" format, ie as just tables of numbers?  From which conclusions could be arrived at.
The picture on your TV is broadcast as a list of numbers, would you rather it was displayed like that?
Title: Re: What is the power of regression?
Post by: Eternal Student on 10/06/2021 11:02:24
Hi.

Is it because graphs make it easier to see the underlying mathematical processes?

  Good answers have already been made based on using graphs to visualise data and why human beings tend to want to do this.

   Here's an alternative answer:
    Through time, the use of conventional graphs, bar charts, scatter diagrams etc.  has become a highly evolved and internationally utilised artifact of mathematics.  So much so that is a useful, if conceptually abstract, form of communication in it's own right.   Teaching children in school to use graphs is no different to teaching people a common language.
    We could teach children to use tabulated data and numerical statistics but it would be harder to understand and frequently misunderstood.  Sufficiently difficult to use and understand that it would fail as a mode of communication between two people often.
Title: Re: What is the power of regression?
Post by: vhfpmr on 10/06/2021 11:59:29
Numeric data and graphs are not really alternatives, they're complementary, one to analyse, the other to visualise. Two different tools for two different jobs, like a hammer and a screwdriver. You wouldn't knock screws in with a hammer on the grounds that it must be the right tool because it works so well on nails, or just to prove it can be done.
Title: Re: What is the power of regression?
Post by: evan_au on 11/06/2021 10:40:37
Graphical data draws on the considerable visual processing capability in the rear of our brains, with enormous parallel-processing ability.
- You can (sometimes) identify trends in hundreds of data points at a glance
- A fraction of your brain's 25W of power consumption can be used to navigate a car (while holding a conversation with the person next to you)
- The computer in a self-driving car typically consumes about 1,000 W
- This visual processing is not very precise - maybe to within 1%, but it becomes pretty automatic and unconscious (ie "effortless") after some training

Tables of numbers utilise the number processing ability which is laboriously learned in school, using the conscious processing capabilities at the front of our brains.
- This is mostly serial processing (slower)
- Most people can only do one conscious activity at a time - try holding a conversation while mentally adding up a table of numbers!
- Most people can only remember a couple of numbers at once. Trends may not be apparent when looking at only 2 or 3 numbers at once
- A table of numbers can give enormous precision (some quantities in physics are known beyond 8 digits; many quantities in a national budget are specified to 8 digits)
- But it's hard work!
- Few people could multiply two 8 digit numbers together and get the right answer - but your laptop is probably capable of doing a billion or more such calculations per second.

I have no doubt that some skilled mathematicians and accountants can scan a column of numbers and pick the trends and the outliers, but these are the exceptions


Title: Re: What is the power of regression?
Post by: vhfpmr on 12/06/2021 17:31:32
It occurred that steganography is a case when numerical analysis is more likely get you further than looking at the picture.
Title: Re: What is the power of regression?
Post by: nicephotog on 23/10/2021 06:48:42
Regression as explained by Harvard's business article is the use of a number of factors  (named variables or the main related variable that is consistent) that achieves the best relationship to an outcome as a particular "result" thereby being given a point that it gives more accurate prediction to acquire the result (think sales graphs).
Why i've always been a user of "mode average" by adjusting the band in which the mode set falls between by re calibrating the upper and lower boundary parameters that are possible (not merely the band -  when your outer parameters (boundaries) of what you measure cannot be moved (being unreal - truth bending) to achieve a small tight ratio -where are these weird typos coming from- with multiple results then there is no mode in the result boundaries that would allow either prediction or accuracy).

shouldn't add this but its not bad either...