Naked Science Forum

Non Life Sciences => Geek Speak => Topic started by: scientizscht on 12/06/2021 18:41:59

Title: Why the median does not change?
Post by: scientizscht on 12/06/2021 18:41:59
Hello

Let's say we have 10 airplanes going from city A to city B on the 1/1/21.
Let's say all the 10 planes take 5h each for the journey.
The median journey time will be 5h.

Let's say the same journeys happen on the 2/1/21.
Now the 8 planes take 5h each and the rest 2 planes take 8h each.
What is the median journey time?

What metric would calculate the total transit time impact of the 2 planes which were delayed for 3h each?
Especially when we have a distribution of thousands of airplanes with many outliers where the average can be skewed?

Thanks!
Title: Re: Why the median does not change?
Post by: wolfekeeper on 12/06/2021 19:02:22
Median is just one example of a type of average, along with mean, and mode.

But there's other ways to analyse a distribution, such as standard deviation (the square root of the sum of the squares distance from the mean), as well as higher order measures.

Mean and standard deviation are more commonly employed, and they would be affected by the delays.

But the best model of anything of course is the actual data.
Title: Re: Why the median does not change?
Post by: vhfpmr on 12/06/2021 19:18:36
The median is either the middle number, or the average of the two middle numbers, so the answer is 5.
Title: Re: Why the median does not change?
Post by: Bored chemist on 12/06/2021 19:29:09
What metric would calculate the total transit time impact of the 2 planes which were delayed for 3h each?
A sum.
Three hours for one plane and 3 hours for the other  plane makes 6 hours delay in total.


Title: Re: Why the median does not change?
Post by: scientizscht on 12/06/2021 21:43:26
Thanks for your replies.

To make more clear what I am looking for, imagine any type of distribution of a graph between number of planes vs journey time for one day and another graph like this for another day two.

We want to compare the two distributions to see how much more delay we had in %. Obviously, this % will be based not only on the magnitude of the additional journey time for each delayed plane but also the number of planes with additional journey time. E.g. we can have 1,000 planes delayed by 1h but we can also have 100 planes delayed by 30h or anything else. This dual effect is what I am trying to capture and I am looking for the most accurate analysis to do that which will be as presentative as possible without the risk of skewed results.

1) One way to analyse this, is to take the average of day one and day two and find the difference. Would that be accurate given than the average may not be a good overall approximation for every distribution and given that the distributions of day one and two may be very different?

2) Another way to analyse this, is to take the median of day one and day two and find the difference. Would that be accurate given than the median may not be a good overall approximation for every distribution and also it may be too insensitive to outliers or distributions with two peaks etc?

3) A third way to analyse this, is to calculate the area under the graph by multiplying the number of planes times their journey time for day one and day two and find the difference.

Do all the above three ways give the same result? If not, which is the most accurate especially if we want a result for almost any type of possible distribution?

Or is there any fourth better way?

Thanks!
Title: Re: Why the median does not change?
Post by: Bored chemist on 12/06/2021 21:59:22
It's not really a mathematical question.
Is it better to have one plane delayed by 10 hours ort 10 planes delayed by 1 hour?
What about 2 planes 5 hours late?

What about 1 plane 8 hours late compared to 10 planes 1 hour late?
That's a matter of opinion, rather than a matter of fact.
Another thing that might help us answer is if you tell us what you plan to do with the "answer" you get.
Title: Re: Why the median does not change?
Post by: alancalverd on 13/06/2021 00:17:40
The answer is in every public transport franchise contract. "95% of trains/planes/whatever to arrive within 5 minutes of schedule with an on time median" You get paid as long as not more than 5%  crash or fall into the sea.
Title: Re: Why the median does not change?
Post by: evan_au on 13/06/2021 10:49:42
Quote from: Wolfekeeper
But the best model of anything of course is the actual data.
That is true - but tables of data are hard to read.

One way of retaining most of the data, and yet making it (somewhat) easy to read is to plot the cumulative distribution of travel times for the two days, on the same graph.
- That would show you that one day had very consistent travel times, but the second day had a bunch of delays.
- When you have a table of 5 data points on each day, it's not too hard to compare the tables of raw numbers
- But if you have thousands of points in each series, the cumulative distribution can be read much more easily

 
Quote from: scientizscht
One way to analyse this, is to take the average of day one and day two and find the difference. Would that be accurate given than the average may not be a good overall approximation for every distribution
One factor that makes it hard to compare is that you have a 3-hour "normal" flight time mixed in with a 2 hour  "abnormal" flight delay.
- Different routes will have different "normal" flight times, making it harder to compare.
- As Alan says, if you are interested in delays, subtract out the scheduled flight time from the actual flight time.
- Then you can actually examine the delays.

If you don't really know what you are doing, taking the mean (=average) and the standard deviation is the safest bet.
- The mean takes into account the whole data set, and provides a compact summary of the "center" of the distribution. It converges quite quickly to the "correct" answer as it is an "unbiased estimator"
- The standard deviation also takes into account the whole data set, and provides a compact summary of the "spread" of the distribution. It converges fairly quickly.
      - At this point you may remember from high school that there were two equations of standard deviation?
      - One divided by the sample size (n)
      - The other divided by (n-1)
      - It doesn't matter so much for large sample sizes (thousands of data points), but for small sample sizes (like 5 airline flights), there is a bias in this estimator

Quote from: Sir Charles Dilke (maybe)
There are three kinds of lies: lies, damned lies, and statistics.
There are many traps in statistics, and its better to understand a bit about the traps, so you can use statistics more effectively (or lie about them more effectively...)
Title: Re: Why the median does not change?
Post by: scientizscht on 13/06/2021 14:44:23
Another thing that might help us answer is if you tell us what you plan to do with the "answer" you get.

I want a KPI to measure the significance of delays in the total journey time of the fleet.

E.g. delays (taking into account the magnitude of the delays and the number of planes that were delayed) accounted for 5% of the journey time.

If this KPI goes to 10%, I will know that delays were more significant part of the journeys in total.

My question is mathematically, what is the most accurate way to measure that?

Average delay across the fleet divided by average whole journey time?

Median delay across the fleet divided by median whole journey time?

Total minutes of delays divided by total total minutes of whole journey time?

Would the first and third give the same result?
Title: Re: Why the median does not change?
Post by: Bored chemist on 13/06/2021 15:17:52
I want a KPI to measure the significance of delays in the total journey time of the fleet.
The usual approach would be to find the economic cost of the delays and add them up.
Do you end up paying compensation to delayed passengers (or shippers) and how much does the delay  influence customers' repeat business?

I'm not saying that cash is the best way to calculate this, but it has the advantages of being relatively easy and also of being able to compare with many other KPI.
Title: Re: Why the median does not change?
Post by: alancalverd on 13/06/2021 17:11:55
If you are considering aviation in particular, "raw delays" is not a useful performance indicator.

Once a plane is in the air, they all travel at pretty much the published cruising speed until they descend to land. Most extended flight times are caused by intentional traffic separation, weather diversion (Heathrow fog is notorious) , or diversion due to disruptive passengers (now including "government hijack") and are therefore safety buffers rather than delays. Strong headwinds can cause extended flight times but equally, a strong tailwind can put the arrival ahead of schedule. How is this relevant to "performance"?

I've experienced a 2-hour delay in departure from JFK as hailstones bounced off the runway and the plane 3 ahead of us in the queue was struck by lightning. Does that feature in the KPI of the aircraft, the operator, or North Atlantic weather systems? Same route, another day, we had an hour diversion because one of the radar displays died during climbout, and an extra 20 minutes at Heathrow before anyone could deplane because the fuselage had bent in violent clear-air turbulence and the cabin crew had to muster some large passengers to ram the door open.

Ground handling delays due to no-show passengers after the baggage has loaded can put the flight back to the next day. Been there, seen it: noise restrictions prevented some aircraft from departing Stansted after 9 pm, and promptly at 8.30 everyone had to deplane, unload and reclaim their baggage because some twit had decided not to fly or locked himself in a toilet somewhere, so the flight was rescheduled for 7 am next day.

Funniest was overnight freezing rain at Stansted. No problem, except at 0600 the deicing truck was covered in ice, so they pulled it into the sunshine with a tractor and waited for the ice to melt. All flights delayed by an hour. KPI?

One Friday night, the pilot was stuck in a traffic jam on the wrong side of Paris. By the time he arrived, the rest of the crew were out of hours so they flew a complete relief crew in from Stansted. Pilot KPI, possibly, but not attributable elsewhere. 

A friend flying a Trident 3 out of Belfast thought the plane felt more sluggish than usual. Looked at the fuel ticket and saw he had been given the fuel load for a thirstier Trident 2. Much radio chat and the guy flying the T2 diverted to Birmingham rather than run out of gas over Watford. 100 lives saved. KPI?

Famous Tower to Pilot conversation "For God's sake I directed you to stand 14 and you have gone to stand 41. There's a 747 behind you that won't fit into 14, and now I've got aircraft queued up all the way to the coast until somebody moves a 747 off a stand, and they take an hour to load...." KPI?

"If you've time to spare, go by air." If the passengers walk on and walk off, the KPI is 100%.
Title: Re: Why the median does not change?
Post by: evan_au on 14/06/2021 10:44:00
Quote from: OP
Why the median does not change?
The median is quite reliable at finding the "middle value" of a distribution, ignoring the outliers
- The median treats everything except the middle value as an "outlier"! In this sense, it ignores most of the data.
- But the example you quote has an extreme bimodal distribution, with no middle, and everything is an outlier (including the middle value).
- So the middle value is not very representative of this rather extreme distribution.
- If it were a more realistic distribution, with times of 2, 2, 2.5, 3 & 4 hours, the median would be (slightly) different between the two days
See: https://en.wikipedia.org/wiki/Median

The other common measure of the center of a distribution is the mode (the most common value).
- The mode does not converge very well.
- It mostly pays attention to a small number of values, so you need a lot of samples to get a good answer
- As more values are collected, the most common value can jump around a fair bit
- If you have a continuous distribution (eg flight times measured to the nearest second, rather than the nearest hour), you have to resort to tricks like bins or moving averages to find the mode.
https://en.wikipedia.org/wiki/Mode_(statistics)

But, overall, the mean (=average) is the safest to use for most purposes - takes into account all the data you have, and converges quickly.
Title: Re: Why the median does not change?
Post by: vhfpmr on 14/06/2021 11:54:22
A common example where the median is preferred over the mean is wealth, because the mean gets distorted by a tiny number of mega-rich people.