Consider video. It is actually a rapidly flashed sequence of still images like movie frames. These stills are flipped up to us at a rate just a little faster than our ability to perceive separate visual events in time. Hence, this staccato of stills seems to the viewer to create a continuous motion. The visual images are flashed at a rate of 60 times/second (every 1 sec/60 or 17 ms) and at that rate we cannot separate one image from the next. These well-defined still frames are being flashed to us at a rate faster than our visual reaction time for discrimination of separate events in time. This is called temporal fusion, the time period of sensory fusion.
The reason we are slow to perceive the rapidly flashing visual images is because we employ a slow speed biochemical sensor (not a speed of light electronic photocell sensor) in our eye/brain system. The relative slowness of our electro/chemical visual sensors results in the “visual fusion” of actually separate-in-time events. It is not a weakness, but it is the nature of our biochemical being that multiple events are perceived as separate events, only if they occur sufficiently separated out in time. If separate events arrive too quickly, they are perceived as one continuous event. Without this “vision fusion” process, video as we know it today would be like watching a strobe light show — a novelty — but not an entertainment medium.
Let’s utilize this card flipping process to introduce the sense of sound into temporal fusion. Most of us have done something like this when we were young. We used a clothesline clip to position a card into the spokes of the bike wheel and we got the sound of a motorcycle. Try an experiment. Take a deck of cards in your hands, arch them back and then with the thumb, release the entire deck in one second. What do we hear? A breathy, fluttering type of sound, but a tone nonetheless. If we flip through 50 cards in one second, we get 50 separate positive pulses of air per second. But we hear this process as if it was a breathy 50 Hz tone, which is a bass note whose location is about four keys up from the bottom end of the piano keyboard.
If we flip one card per second, we hear distinct snaps. If we flip 50 cards per second, we do not hear 50 snaps per second, but perceive a continuous tone of 50 cycles per second. Because we are human, and our detection systems are biochemical, our experiences with sight and sound are quite similar. Rapidly flipped cartoon cards create the impression of continuous motion and rapidly snapped playing cards create the impression of continuous sound. Both effects occur because of the temporal fusion threshold (time) in our ability to detect separate events. Separate events that occur within 1/20 second are perceived as one event. Multiple events that are spaced closer than 1/20 second apart are perceived as a continuous event.
Again, we consider video and find the spatial (location) version of fusion on the video screen itself. At the movies, the image is practically a continuous distribution of colors and shadings because it is a projected photograph (slide shot) of real objects. The smoothness of the image is controlled by the graininess of the film, which long ago was reduced to the levels. Not so with video. We have pixels, dots, or blocks of colors on the screen. The size of the dot, its brightness, and distance to neighboring dots is macroscopic. It becomes visible to the naked eye as we move closer to the screen. Of course, if we sit back far enough, these separate dots seem to merge, fuse into a continuous image. Again, sensory fusion. Separate events in space, as well as those in time, can fuse together into a continuous event. If it were not for our susceptibility to sensory fusion in both time and space (temporal fusion, spatial fusion), we could not enjoy the film or video process as we do today.
This fine grain fusion threshold has to do with the distances between adjacent light cones in our retina. As long as separate distinct light sources on the video screen are sufficiently close enough together, the cones in our retina cannot separate the lights. One of the biggest equipment differences between the movies and video decreased the distance between pixels, and when viewed from the proper distance, the grainy resolution problem of TV is improved.
There is also a graininess aspect to hearing. This isn’t the so-called “grainy sound” effect that has to do with high frequency distortion. Here we concern ourselves with listening to a sound source with one ear and how well we can detect its position changes. This is like listening to a cricket chirp while it crawls along the ceiling and you listen with one ear. We detect this kind of position change by the way sound enters over the folds in our “ear trumpet.” Sound from one direction engages one pattern of the curves of our ear, while sound from another direction engages these same curves in a different pattern. We learn to tell where sound comes from by the way it is changed by the corrugation of our ears.
A third and very important similarity between sight and sound is that we have two eyes as well as two ears. Two sensors, separated but side-to-side, allow us to easily resolve lateral positions and get a fix on depth positions. This third aspect of sensing is available to us through the coordination of our pair of sensors. With sight we can detect depth, what is in front of or behind. Also with sound, we can detect if a sound source is close or far from us. The mechanisms for these detections may be different, but the effect of stereoscopic vision and stereophonic audition is clearly due to two sensors and the coordination of their inputs.
We have considered areas of similarity in the perception of sight and sound. Each contributes, more or less, to perception in the movie process. We have temporal fusion thresholds (those due to time) illustrated by cartoon flash cards and thumbed playing cards. We also have a spatial threshold on the resolution of distinctly separate source positions in space. When it comes to the set up of an A/V room for home theater, the goal is to reduce primary distractions to the sight and sound process of the movie presentation.
Our perception of a sequence of events depends on the time between the events. If they are close enough together (within 1/20 second), the separate events seem to fuse into one single continuous event. If they are separated out more than 1/20 second, they appear as a staccato of events, a stroboscopic presence. A sensory distraction in time usually occurs because a distracting event shortly follows the main or desired event. If the distracting event arrives within the sensory fusion time periods of 1/20 second, we have a fusion distraction. If it arrives later, we have a “post fusion” distraction.
Let’s consider the echo. It is a sensory distraction, a distinct, post-fusion acoustic event. Sound passes by us, hits a distant wall and bounces back. The round trip distance, from the listener to the reflecting wall and back to the listener, delays the hearing of the reflection more than 1/20 second. Sound travels 1130 feet/second, and so the distance covered in 1/20 second is 56.5 feet. A wall half as far, 28 1/4 feet away or more, will provide a detectable echo.
For light, the temporal threshold is roughly the same 1/20 second, but the speed of light is very fast, 186,000 miles/second. A reflecting mirror would have to delay a light reflection by 186,000 miles/second x 1/20 second or 9,300 miles. The mirror would have to be 4,650 miles away for us to detect the flicker effect of an echo of light.
If a room with an average dimension of 15 feet was covered with mirrors, light could reflect three million times and the images would still be inside our optical fusion threshold. On a practical basis, light would die out long before it could reflect three million times, so the entire optical process of distraction lies well within the visual fusion time period. We don’t have to worry about optical echo problems. With light we only experience visual fusion problems, and a little paint or wall paper goes a long way to control them.
On the other hand, sound travels much slower and it will cross the room no more than four times before starting to sound like an echo. In a typical room, sound delay times are easily 1 1/2 seconds. This means sound can be audible for over 100 reflections. The first four of these are inside the sound fusion threshold and the rest arrive outside the threshold in the post fusion time period. With sound we hear four fusion distraction reflections and about 100 post fusion (echo) distraction reflections. Reflections inside the fusion time period produce an image distraction effect, while reflections that arrive outside the fusion period produce an echo effect.
The home theater is set up in a residential sized room. Under no circumstances would anyone ever consider covering all the walls, ceiling, and floor with mirrors and then placing a video screen at one end of the room of mirrors. Such a reflective condition is useful only at the carnival in the “house of mirrors.” What happens in the “house of mirrors”? Disorientation — too many images from too many places. We lose track of the original image source and certainly cannot dismiss all the reflected images of the source. The images are too strong and occur well inside the fusion time period. The source and images merge into one confusing visual space. And we pay good money for this experience but only as a novelty — a game.
We would never really choose to live, work, or watch a movie in a room filled with distractions. Yet, all too often with sound, we expose ourselves to just exactly this acoustical house of mirrors. Normal walls, floor, and ceiling may not reflect much light; but, for sound, they are so acoustically smooth they act like a polished mirror. And this time, it’s not the carnival, but your A/V presentation room.
View Home Theater Acoustics – Vol. 1 as a PDF