Down With Audio Interfaces


I often get asked about the future of interfaces: “Wouldn’t it be great”, people say, “if we could just talk to our computers like in Star Trek? Aren’t voice recognition and talking computers the interface of the future?” A lot of people seem to think that all interface problems can be solved via voice. But I have a one word answer: Voicemail.

Everyone hates voicemail and voicemail systems. And with good reason. These days voicemail is getting pretty “smart” : you can now say “Yes” and “No” instead of pressing 1 or 2 in response to questions (unless you have an accent, in which case don’t bother). You can even say a person’s name to be connected to a someone else’s extension. But technical problems aside, these are patches on a fatally flawed medium.

Audio interfaces will always lack something that visual interfaces posess effortlessly: the ability to jump around at will. If you don’t care about the information in a paragraph, you skip to the next one. You don’t have to inform the piece of paper you are reading that you want to navigate, you just do it. You look here, then there. You scan. You find what’s interesting. Visual interfaces excel because they let you throw away the unneeded information-chaff and focus on what you want to know. There is no analog in the audible world. When you’re listening, it takes substantially longer to know where you are and what you’re listening to. When using an audio interface, you are forced to be linear: a word follows the word that came before it and precedes the word after it. There is no way to get to the last word without hearing the two words before it. There’s no getting around it. It sucks.

An example: imagine you are using a conventional voicemail implementation on a computer with a standard display. There would be a button for skipping, a button for replaying, a button for saving, a button for deleting, a button for hearing the time and date the message was left. In short, all of the normal voicemail actions. If the designer was ambitious, they could even have include a widget to let you scrub through the message. When you want to delete a message, you’d roam the interface with your eyes, reading each button label in a fraction of a second, find the delete button, and click it. Simple and quick. The important point is that you are effortlessly flitting your eyes past all of the information you don’t want, to find the information you do want. In fact, a very common phone system—the cell phone—has a display too, and if its voice mail interface had some hint of humanity to it, it would at least show a visual menu telling you what button performed what action.

But instead, you have to wait for the voicemail system to tell you what number to press to delete the message. Yet before it tells you how to delete the message, it will first tell you how to replay the message, skip the message, move to the previous message, save the message for later, and perhaps force you to listen to an advertisment from your service provider. And because it’s audio and linear, you can’t skip any of it. The more complex voicemail gets, the longer you’ll have to wait. With an audio interface, you have no way of moving past the information you don’t want.

The reason why the Candorville cartoon shown at the top of this post is funny is because it illustrates a lose-lose situation in voicemail: if you have instructions read to you before every message, listening to your voicemail takes forever; if you don’t have the instructions read at all, you’ll never know what to do. It’s a Catch-22. And that’s the crux of the problem: there doesn’t exist a good way of providing instructions in purely audio interfaces. Sometimes a balance can be struck, but it will always be the best of a bad set of solutions. It will never even be good.

I know that I’ve made egregious mistakes because I didn’t want to wait for the instructions and I thought I remembered what button to push. But, do I press 7 to save a message and 9 to delete it? Or is it 9 to save and 7 to delete? Naturally, I remembered incorrectly. The moral of the story is that voice-based interfaces can cost you a date.

That’s not fair. “Talk to our computers like in Star Trek” means to me that you would say “Delete this message” or “Computer, when is my appointment with Sara?”. The computer would understand spoken commands in context and summarize relevant information. Then the problems you point out do not arise. Talking and listening, like looking, are natural human activities.

We are at least 100 years away from such technology, but we are making progress. Please do not discredit audio interfaces in general.


Oh, I should add, that a good audio interface is more fitting than a visual one in some situations, for example when asking for directions while driving, or asking your car to auto-drive you the next whiskey bar.

I agree with Pgan on this one, and let me explain why. Audio interfaces as they are now are only useful in certain situations, but mainly in ubiquitous computing (make a computer do something for you anywhere you are, such as in your house, with “turn off all the lights and appliances”, or to your PDA with “show me the nearest pub”, and so on).

Audio interfaces excel not as systems for general computing but at doing very specific things very quickly–no menus or virtual paths, just straight-to-the-point tasks. In fact, it’s my personal belief that the apex of interfaces will be an AI that understands every word you say and does it immediately–everything else (sifting through websites, browsing virtual galleries, etc) will be done just like they are today, only because they’re virtual (using augmented reality), we can move through the information/content much faster and in a natural manner.


Oh man, someone was telling about some research or developments going on in audio interface design right now. This research involved the acronym TTH — Time To Human — to talk about how easy or frustrating audio systems were. Apparently T-Mobile’s customer service has the lowest (quickest) TTH out of any major cell phone company. I think I’ve never actually been frustrated with T-Mobile’s customer service or anything. They’ve been surprisingly easy to deal with.

Audio interfaces seem to be inherently linear… until you realize that human-to-human conversation can be purely audio, yet be nonlinear and interactive. Perhaps this is what AI will be able to mimic some day? In a conversation, you can interrupt the person and say “but what about xyz? that’s what I want to know” or whatever, analogous to how you would skim a visual text. If audio interfaces became good enough, they would allow for this kind of interruption and be able to process your requests with greater understanding than dudeguybot or smarterchild.

Although I think you’re really on to something with this bit–
“and if its voice mail interface had some hint of humanity to it, it would at least show a visual menu telling you what button performed what action.”

This works for voice mail, but not necessarily for every audio interface. Unless this voice mail visual menu would be sent to your phone as a whole other type of information (if not, the menu would have to come with the phone, or be installed on it or whatever). In that case, you’d have an entirely different type of cell phone technology on your hands (I think?), and every call you make could send you things like menus and all other kinds of multimedia, and this will probably happen when blackberry-like devices replace cell phones.

And I just read your last sentence, ouch… was that the catalyst of making this post?

(PS. Sorry this is such a long / rambling rant.)


Thanks for all of the insightful comments. I think wasn’t as clear as should have been.

I do not mean to say the audio interfaces form a bad method of input–at this they excel, especially in specific domains where visual input is cumbersome or dangerous, such as in the car or on devices too small for keyboards. But they will always lack a benefit that visual interfaces give for output–audio is fundamentally linear.

I’d argue that the ability to interrupt an audio interface to ask for new information does not mean that audio output is not linear. In visual output, the same organ that perceives the information can change what information it is perceiving. That is, the eye plays an active role in information processing. The same cannot be said for audio output. In order to change what is being heard, one must first think about what one wants to be hearing, form it into a sentence, and say it (with possible modifications and corrections). That is, the ear plays a passive roll in information processing.

Now, this linearity can be compensated for: if voice recognition were perfect and computers could flawlessly pass the Turing test (both challenges are many scores, if not more, years away) then indeed one could interrupt the computer asking for new pieces of information as in human-to-human communication. But, for now and in the near future, the natural language processing required for such interaction is truly science fiction. Finally, remember that we can speak faster than we type, but we can read faster than we speak.

And, as a side note, humans augment their communication with gesticulations, intonation, and eye-brow squiggles. Anyone who has been stymied trying to speak a foreign language over the phone can attest to just how important those non-audio factors are to understanding.


Needless to say, audio interfaces also lack spacial information. You can’t design and you can’t plan with an audio interface. So there will always be a place for visual (even in Star Trek).
Current audio interfaces are so fallable that in order to use them effectivly you need some form of feedback. As you are currently talking to the interface, that feedback really has to be visual to not be distracting. Voicemail systems and the likes, while presumably necessary, are not really the ideal place to develop audio interfaces. Currently, the place to develop them are on desktop machines and the likes, that can provide visual feedback.
I’d be interested to know what the current speech-to-text capabilities are for phone-line quality conversations. I would love for all my messages to be displayed in an email-style fashion with summaries I can skip through, and then listen to each as I please. But I would be willing to gamble that too is a little way off.


I like voicemail. I like it, and I don’t make more mistakes with its menus than I do with visual menus. 7 is delete, 9 is save, 1 is hear messages. Hitting 7 twice while listening ends the message and deletes it, hitting 3 jumps ahead a few seconds in the message. Sure I’ve accidentally deleted a message I wanted to keep–I’ve also accidentally clicked “Okay” on a pop-up dialogue I meant to cancel. A lot of the ideas generated here sound cool, but I’m not really with you on the original post. Everyone doesn’t hate voicemail and voicemail systems. I for one am a living exception.

Moreso than anything I just want my voicemail to have an “undo”….


I have a desktop interface to voicemail that’s integrated into my email client. It has the visual control interface (play, skip forward, backward, delete, …) and unread messages are highlighted like new email messages. This completely eliminates login and listening to instructions so that you get straight to listening to the messages.

An indispensible feature is speed adjustment. You can speed up (or slow down) a message which saves time.

Deleted messages go into a trash can enabling them to be restored if desired.

It’s tedious using the voicemail system directly and rarely have to anymore!


That is quite nifty! I had no idea such a thing actually existed. Now if only the carriers would beam it directly to my phone…

Joseph Huang

The post’s logic is invalid.

One current audio interface is bad.
Therefore all audio interfaces now and in the future are bad.

That’s analogous to saying:
One apple in this basket is bad.
Therefore all apples in this basket are bad.


Also, audio can be quite good for output, for things that are fundamentally linear, such as music and listening to a speech. Of course, combined with a flexible audio input would help for other tasks, such as booking airline reservations where the computer would ask you where you want to go, what time, etc. over the phone. I read a book about one implemented at Stanford or something like that.


Call me old fashioned, (hum rest of Bob Seger tune here if you must) – but I will NEVER get used to a machine that speaks in first person and tries to pretend that it is a sentient being and wants me to play along!





