0 Members and 1 Guest are viewing this topic.
To get a long-lasting and deep sound cancellation, you need very precise sine-wave generators ...
When you add an extra singer, the total sound power increases.It is true that at some instants, in some listening positions, at some frequencies, the sound from the second singer will cancel out the sound from the first singer...but this will be overridden by the fact that at most frequencies, in most listening locations, most of the time, the sound power will be higher.
To get a long-lasting and deep sound cancellation, you need very precise sine-wave generators, not human singers.
The effect you notice is that in many areas of human perception, the response to stimulus is logarithmic, not linear.
This is why the Decibel scale of sound perception is calculated as 10*LOG(P1/P2).Adding a second singer increases the sound level by 3dB, which is just perceptible to the human ear under normal conditionsAdding 9 singers increases the sound level by 10dB, which is audibly much louderAdding 99 singers increases the sound level by 20dBAdding 999 singers increases the sound level by 30dBThere is the same perceived increase in volume in going from 1 to 10 singers as there is going from 100 to 1000.
At an equal number of points, the sound power is doubled through constructive interference, where the listener hears precisely equal intensity from the 2 sources, and 0 degrees phase (in-phase).The average effect, over the entire listening space, is that the sound power is doubled.
The human voice is not a pure, constant frequency of fixed phase - it consists of a number of higher-frequency resonances (or formants) which occur at different frequencies for different people because of the shape of their throat (even if they are singing the same fundamental frequency). Different frequencies will not cancel each other at all - they add in power.
If the two singers do happen to produce the same higher frequency, it will cancel at different positions in the listening space than the fundamental frequency, so those who would hear nothing from a pair of sine-wave generators would hear something from a pair of human singers.
Cancellation occurs only if the two sources have equal amplitude. However, the vocal resonances are excited in bursts by the opening and closing of the vocal folds or "adams apple", which means that the amplitude of sound from the two singers is not constant, and so any cancellation will be short-lived. These bursts occur asynchronously in different singers, and so the sound power adds.
When you add multiple point sources (eg 3 or more singers), the probability of getting a total cancellation drops, since there are very few points in which the amplitude and phase cancel for all 3 singers, at all of their formants.
A soloist can still be heard over a choir of a hundred voices all singing at perhaps half power - they are not fifty times louder.
And yet (under laboratory conditions) it can pick out the periodic motion of air molecules, moving with an amplitude of the width of a hydrogen atom.
So the scale is not distorted - it is actually a reasonable reflection of human perception of many kinds of stimuli.
When you measure the amplitude of a sound (eg the vowel "ee") with a microphone feeding into a computer, you are actually measuring the Sound Pressure Level.The microphone is insensitive to the constant air pressure, and only measures the variation from the average. The average variation/value of a sine wave is zero. So if you average many sine waves, the average will still be zero.
The usual way of doing speech recognition starts by breaking the speech into short segments (eg 5-10ms) and performing a Fast Fourier Transform to identify the main frequencies in this segment. For vowel sounds, there will be several formants which can be used to distinguish the different vowels; for unvoiced sounds like "sh", "s" and "f", there is a white-noise type spectrum.
But the characteristics of speech change rapidly, especially around plosives like "p" and "b", so the changes in successive segments must be tracked.
These algorithms are not simple, so why not start from some freeware or open-source speech recognition software, eg:http://savedelete.com/7-best-free-speech-recognition-software.html
I see, so you could somehow have lots of sound and anti-sound where nothing is heard, but the pressure is constantly higher while it's going on.
thinking about the cochlea...
Sorry, that wasn't really what I was trying to convey - sine waves won't increase the average "DC" sound pressure, only the "AC" sound pressure. I would complete this sentence as "you could have lots of sound and anti-sound where nothing is heard, but there will be an adjacent location where the sound amplitude is increased."
The Fourier analysis can do a similar frequency analysis to the hairs of the cochlea, picking out the main frequencies in a complex sound....The FFT analysis measures all formants simultaneously, with a surprisingly small number of calculations. (The same argument goes for distinguishing fricatives.)
When a sound consists of a fundamental and higher-frequency components, counting the zero-crossings produces a misleadingly high answer, because the higher frequencies produce extra zero-crossings. You can reduce this impact by low-pass filtering the sound, which will help estimate the fundamental frequency - but then you lose the higher frequencies.
But if it's only increased to twice the volume there, the average volume will be less than twice as high as with one singer
compare +ve with -ve movement
How do you get the raw data into the right form to start with though?
The Sound Pressure Level is twice as high where there is constructive interference, which means that the power is 4 times higher, because Power is proportional to (SPL)2. Overall, SPL is increased by a factor of 1.4, and the audio power increases by (1.4)2, ie doubled.
Quotecompare +ve with -ve movementThis method of analysis is equivalent to high-pass filtering, so it will emphasise the higher formants, rather than detecting the fundamental tone. From what I have read, you get most information from the lowest formant, and successively less from each higher formant.
QuoteHow do you get the raw data into the right form to start with though?The functions of speech recognition are normally broken into different modules, which ideally operate in parallel:
Unfortunately, the frequency analysis algorithm is very processor-hungry (FFT and bandpass filtering is best done on a Digital Signal Processor, if you have one)
The equation for the amplitude of a group of singers all singing with the same volume is:Av = Vs SQRT(n)where n is the number of singers, Vs is the volume of the singers, Av is the overall volumeThis assumes that the singers voices are not well correlated, and are adding more or less randomly- and that while they may be roughly the same pitch, they're not exactly in phase. For human singing this is highly likely to be a very good approximation.
If the cancellation idea worked if several people sang together, to avoid violating conservation of energy, power would have to be forced back into the singers, you would find it progressively harder to sing, the more people that were singing!
It turns out the amplitude goes as the square root of the number of people.
No, no. Square wave is perfectly valid....Randomly phased/frequency square waves approach a normal distribution with enough waves added together
...and the sound power grows as the square of the sound pressure amplitude.So sound power grows proportional to the number of singers, conserving energy.
Okay, so if you remove my simplifications you'd get straight to a graph showing the square root values 1, 1.41, 2, etc., but the square waves will eventually catch up with the graph when it gets to higher values. That means that when we're dealing with a millions singers, we've only got 1/1000 the volume that we'd get if all their sounds were completely in phase with each other such that they added up without any cancellation. They aren't in phase though, so it looks to me as if 99.9% of the sound they're producing must be being cancelled out.
Quote from: David Cooper on 30/04/2013 23:11:37Okay, so if you remove my simplifications you'd get straight to a graph showing the square root values 1, 1.41, 2, etc., but the square waves will eventually catch up with the graph when it gets to higher values. That means that when we're dealing with a millions singers, we've only got 1/1000 the volume that we'd get if all their sounds were completely in phase with each other such that they added up without any cancellation. They aren't in phase though, so it looks to me as if 99.9% of the sound they're producing must be being cancelled out.No, volume is best understood as power, rather than amplitude. The power is a million times more with a million singers.
You have to square amplitude to get power.I believe that's your central mistake.
This is a bit more subtle.Basically if you're adding them electronically, there's no problem with that; you can get out more power than you put in, because you have an amplifier!
Let me simplify things further for you. Imagine a gong with ten people hitting it with their bongers (correct technical term for gong-hitting implements not known). If they all hit the gong with their bongers at exactly the same instant, the amplitude of the sound produced will be ten times as loud as if only one hits it.
If they all hit it at different points in time though, with half of them hitting it when it's coming back towards them, they will cancel out a lot of the movement of the gong and it will be a lot quieter.
It might be better to think of half of them hitting the gong from the other side, so if two bongers hit the gong at the same time from opposite sides, they will cancel each other out and create heat instead.
If they're all deaf and blind, they will hit the gong at random times and produce an amplitude of root 10 with the sound energy supposedly being 10, but if they are able to coordinate the movement of their bongers perfectly they can generate a sustained amplitude of 10 with the sound energy supposedly being 100. It doesn't add up.
Quote from: David Cooper on 05/05/2013 18:32:02Let me simplify things further for you. Imagine a gong with ten people hitting it with their bongers (correct technical term for gong-hitting implements not known). If they all hit the gong with their bongers at exactly the same instant, the amplitude of the sound produced will be ten times as loud as if only one hits it.There will be ten times more energy in the gong.
QuoteIf they all hit it at different points in time though, with half of them hitting it when it's coming back towards them, they will cancel out a lot of the movement of the gong and it will be a lot quieter.No, that's not right. If the hit it at different times, they will add ten times more energy to the gong at different times. (To a pretty good approximation, it does depend a bit on precisely what way it's struck).
QuoteIt might be better to think of half of them hitting the gong from the other side, so if two bongers hit the gong at the same time from opposite sides, they will cancel each other out and create heat instead.Only if they hit it at EXACTLY the same time, then they will effectively not have hit the gong; the hammer will bounce back at exactly the same speed it was struck at, and no energy will be added to the gong, but in virtually any normal case, this perfect strike will not happen.
The thing you're continually failing to understand is that the details, really, really matter.
The way the clappers hit the gong, the shape of the clappers, the timing of the clappers, whether the particular point on the gong that it hits is moving or not.
In the normal case, where you hit the gong with ten clappers, the energy added to the gong is ten times the amount of energy- and most of it will NOT end up at the centre; the wave energy will spread out in different directions in an inverse law from each clapper, and only a small fraction ends up in the centre at all.
"Adding up the area", which is equivalent to the mathematical operation of integration. In the audio domain, it is also equivalent to a low-pass filter, which will attenuate higher frequencies, and be better at extracting the fundamental frequency.
Adjusting the boundaries near the zero crossings, which is a form of interpolationDetecting zero crossings, which is a way to estimate frequency.
Separately accumulating the positive and negative areas, and then comparing them at the end. Mathematically, this is just the same as adding them all up (comparison is a form of subtraction)
The problem with using zero crossings is that it only considers the input values when the waveform is near the zero crossings - this is why interpolation is so important.
The Fourier transform (or multiplying by a square/sine wave) uses the entire waveform, and so is more sensitive; interpolation is not needed for these algorithms.
altitude [amplitude] where the wave repeatedly switches direction (up/down)
Quotealtitude [amplitude] where the wave repeatedly switches direction (up/down)Detecting changes in direction is equivalent to the mathematical function of differentiation, or in the audio domain it is a high-pass filter. But integration and differentiation are opposites of each other (an inverse function), so if you are differentiating and then integrating you will be doing a lot of processing work to end up back where you started.
In any computer program (especially doing signal processing, like this one), there are some central loops which may make up only 1% of the code, but take up a majority of the execution time. Focus in on these, and small changes can often make a large impact on processing time.
Have a look at a public domain MP3 encoder. This will include code for a FFT.
I'm definitely not doing and undoing anything
at the highest frequencies, and that's where noise is being generated that is hiding any real signal