Voice interfaces in electronic art

Talking to computers is an old human dream but the advances in the speech synthesis and voice recognition technologies in the past decade have reached enough accuracy and reliability to help making that dream come true. Voice interfaces can be applied in a big range of scenarios, from art to medicine, and one of the most important uses of voice interfaces is related to accessibility and usability, opening options for interactions that bring possibilities where vision cannot be used or does not fit the environment. Regarding the arts, the conjunction of the internet and the open standards of VoiceXML create a new context for exploration and experimentation with voice interfaces. This paper describes the voice interfaces scenario from the point of view of art. From the first known artwork using voice to the state-of-the-art we have today, voice has increasingly being used as interface. We will describe some artworks that are part of that history and briefly present the use of VoiceXML in the construction of the voice interface of the artwork Voice Mosaic.


INTRODUCTION
Voice interfaces, whether they are speech-only or multi-modal, are a fascinating subject.The human dream of talking to computers in a natural way is not new.Science fiction books and movies that live in our imagination present several examples of this aspiration.We could mention, for instance, some television shows and movies like: a) Star Trek -the Enterprise's staff talk to the ship systems and androids like commander DATA; b) Lost in Space -Will Robinson had in his robot a very loyal and confident friend; c) Star Wars -conversations and human interactions with the robots C3PO and R2-D2; d) Blade Runner -the androids and voice driven interfaces (Perkowitz, 2004).
Until recently, talking to computers was in the realm of fiction -the web has been largely mute and deaf.However in the beginning of the 21st century talking to computers has become possible and easy due the enormous advances in speech synthesis and voice recognition technologies as well as the open standards adopted by the W3C (such as VoiceXML).The accuracy level reached by voice technologies now has allowed us to use them widely on the web.
However, talking to computers adds 'ears' and 'mouths' to the Internet organism, changing the way we interact with it, bringing new possibilities and new challenges as well.We must face the increasing complexity that voice interfaces bring to the web while we also open new channels for digital inclusion, provide more accessibility and increase mobility through voice.All these things affect the human role inside the high-tech social structure we live in, at once causing excitement and fear.
In this context, as said once by Hendrik Willem Van Loon (Loon, 1937), 'The arts are an even better barometer of what is happening in our world than the stock market or the debates in congress,' and we believe that artworks help people to understand and experience the new emergent techno-social world that surrounds us, where convergence and hybridization have become ubiquitous and easy, and 'to talk to computers' is going to become common.
We therefore suggest that we can take advantage of paying attention to artistic expressions in order to get valuable information.In this sense, analyzing voice technologies and the types of voice interactions used in art can be a good way to study their evolution and changes in creative utilizations.The first known electronic artwork dealing with voice technologies emerged at the end of the 20 th century (Gabriel, 2006), and since then we have seen several other interesting creations in art using state-of-the-art technologies that utilize voice recognition and synthesis on the web and telephone.
The objective here is to show a brief panorama of artworks related to voice technologies focusing on the work Voice Mosaic, -piece that allows voice interactions on the web through the telephone, dissolving borders and amplifying the pervasiveness.The work was developed on the web using a voice interface with speech synthesis and voice recognition technologies based on VoiceXML.This artwork was exhibited around the world and received several awards.The VoiceXML technology used to its development is the open standard recommended by W3C for voice applications.Therefore, in the end, this paper will present the main aspects of VoiceXML using the Voice Mosaic development as base for that.

ELECTRONIC ARTWORKS & VOICE
According to Gabriel (2006), the following artworks draw a brief panorama of interesting relationships between electronic art and voice, contextualizing the artistic development from the beginning (in the 1980s) up to now.
These artworks use voice ranging from whistles and blows to speech synthesis and voice recognition.In order to track the involvement of voice technologies in electronic art we will present some works that are very creative or innovative.The information and images of each work were extracted from its respective website and the URL is listed in the references.(Couchot, 1996) -art installation Main characteristic -blow in a digital image.

Le Pissenlit
The principle of this work consists of blowing a digital image -a dandelion flower -that dissolves in different ways depending on the way it is blown (see Figure 1).Although this work does not specifically use the voice itself, it involves breath, an integral part of generating vocal sounds.A newer version of this work from 2005 uses not only the blow itself but also the sound of blowing as input.
'Whistling is a communication primitive in most human languages -it is a kind of time travel to a less articulated state.' (Böhlen, 2003).Based on that, this artwork proposes the use of whistling as an alternative way for human-machine interface design by allowing whistle synthesis and recognition.Whistles can be considered non-verbal vocal manifestations and this work is particularly interesting because it investigates new forms of phoneme-less vocal interfaces.'Whistling is much closer to the phoneme-less signal primitives compatible with digital machinery than the domain of spoken language.'(Böhlen, 2003).Furthermore this work raises questions not only about humancomputer interactions but also about human-animal communication (see Figure 2).
In this work a microphone records the subject's voice and a live camera records his/her mouth.The voice is processed by a computer, and then audio is output right away according to the phonemic content and vocal intensity of the input sound.A video with the subject's mouth is shown as a secondary aspect of the work (see Figure 3).Since this work focuses on the phonemes and on the non-verbal aspects of the audio input, its results can be interesting to analyze beyond traditional word recognition.'Talk Nice is an interactive video installation.By exploring the use of upism in young women, the piece explores the way that people, especially young women, frame themselves by speaking up at the end of a sentence.The interactor is coached to talk up in order to interact with the teens (on the video)' (Zaag, 2000).By analyzing the non-verbal aspects of the interaction such as the upism, this work targets a specific social group and the experience of interacting with it is very successful.It also causes awareness about social characteristics related to non-verbal ways of communicating using voice.

netsong (Alexander, 2000) -web art
Main characteristic -speech synthesis creating a song based on the web links provided by a search on the web.This work performs a song based on a search engine robot.'When provided a search term, the netsong bot will search for this term in a search engine, then choose a page from the search results and begin following links from that page.(…).Gathering text from each page it visits, the netsong bot savours the lyricality and poignant narrative of the web and begins to sing it.'(Alexander, 2000).Besides probably being the first artwork to use speech synthesis on the web, another important thing to notice is its relationship with search engines -one of the most important and influential interfaces of our Digital Era.

Talking Machine (Riches, 1990) -art installation
Main characteristic -speech synthesis created by a physical system that imitates the human vocal system.
'Talking Machine is an acoustic speech synthesizer.The speech sounds are produced using a flow of air and resonators just as in natural speech.The machine has 32 pipes, each one a simplified version of the human vocal tract (see Figure 4).They reproduce the spaces which are formed in the mouth, nose and throat when we speak.(…) The valves which control the flow of air are operated by a computer' (Riches, 1990).This work is particularly interesting due its nature of imitating the human speech system in order to synthesize the voice, using the blow as the agent of the process as it happens in the human body.(Levin, 2003)

-art performance
Main characteristic -algorithm that analyses speech transforming it into images.
'Mesa di Voce is an artwork concerned with the poetic implications of making the human voice visible.A computer uses a video camera in order to track the locations of the performers' heads, and also analyses the audio signals coming from the performers' microphones.In response, the computer displays various kinds of visualizations on a projection screen behind the performers' (Levin, 2003) (see Figure 5).Besides the beauty and art, another remarkable aspect of this work is that the synthesized visualizations are tightly coupled to the sounds spoken and sung by the performers connecting voice-input (verbal and nonverbal) and images.(Levin, 2003)

Inquire Theater (Wilson, 1991) -art installation
Main characteristic -speech recognition interpreted by a virtual navigation system.
'Inquiry Theater is an installation where participants could take a virtual walk down Mission Street in San Francisco's ethnic Mission neighborhood.Speech recognition determined direction of movement and virtual entry into the stores.'(Wilson, 1991).Although this work uses verbal speech recognition, in this work the answer is given through navigation.Besides, it was one of the first works to use speech recognition in electronic art.(Gabriel, 2004)

-web art
Main characteristic -speech recognition and synthesis in an interface by phone integrated in real-time with the web.
The Voice Mosaic is a web-art application developed in three languages -Portuguese, English and Spanish -that converges speech and image, building a visual mosaic on the web (see Figure 6) with the chosen colours and recorded voices of people who interact with it from any place in the globe.The voice interface, developed with open-standards in speech synthesis and voice recognition technologies (VoiceXML), works through phone calls from any telephone -mobile or not.To participate in English, call in US: (800) 289.5570 or (407) 386-2174 / PIN number: 9991421055.The mosaic is seen/heard on the web -http://www.voicemosaic.com.br.As people make phone calls to participatechoosing colours and recording free messagesthey form the mosaic spontaneously and it changes as time goes on.The ongoing aesthetics and final result are unpredictable.
In this context, the work causes time-space collapse, and maps in one screen the participations that come from several different geographical places, in different languages, and different times.Furthermore, using the search field, one can easily locate his/her participation by searching his/her own phone number.Also, one can locate all tiles in the mosaic within the same telephone area, which means to map geographical participations in the visual work.
The work puts together several dualities that do not oppose each other, but complete each other: speech / image, simple / complex, old / new, lowtech / high-tech, time / space, individual / community, passive / active, expected / uncertain, among others, in order to cause reflection and awareness about talking to the web, media convergence and hybridization between the telephone and the web.
Besides the work being available online for interaction, there are also two videos on Youtube about the Voice Mosaic, explaining the artworkhttp://www.youtube.com/watch?v=YUURctJYckM & http://www.youtube.com/watch?v=c_5tfTg8NqY.
Next, we will focus on the development of Voice Mosaic using VoiceXML.

VOICE INTERFACES
The artwork Voice Mosaic has two interfaces -the voice interface accessed by phone and the visual/aural web interface.As the web interface uses common and well known technologies, we will focus here on the voice interface, which is the core of the system.
The voice interface works via telephone (mobile or not) interacting with the web.It is developed with VoiceXML, a structured language that offers support to build dialogs.When accessed by phone, the interface uses a Voice Gateway which allows voice recognition and speech synthesis during the conversation.In the image below (Figure 7) we can see the difference between accessing the application via phone and via web: During the interaction by phone the person talks to the interface, choosing a colour and recording a free speech message.
There are seven options available for choosing the colour.This number, seven, is due the limit of information that a person can hold in the short-term memory.According to Miller (1956) and explained in Zakia (1997), 'There is a limit to the amount of unrelated information a person can hold in shortterm memory (STM), from five to nine items, averaging seven.(…) Since we are limited in the amount of information we can retain correctly in STM, one should be cautious with the amount of information included in a multimedia program if it is going to have some memorable impact'.
The free speech message is limited to 15 seconds because of the web interface where it will be listened -recorded files longer than 15 sec.would generate WAV files larger than 100kb, which is the maximum file size to allow a comfortable user experience while clicking and listening to the mosaic tiles without waiting too long to start playing.
The voice interface was designed using both prerecorded human voice (in the welcome message) and synthesized text-to-speech voices to instruct the user, in order to cause the experimentation of the differences and similarities between both.Also, it is used touch tone and speech tone interactions in order to put side by side voice recognition (human-like feature) and touch recognition (machine-like feature) intending to cause reflection about the two ways of interacting by phonetalking and dialling.
In order to allow data visualization either by tracking or by locating the interactions in the visual mosaic, the voice interface records the Caller ID phone number.Due that we can know where the interactions come from in the globe and also locate all the interactions from within a specific area code.This reveals the space collapse in the mosaic on the web.
The phone calls, through the voice interface, are the way the data (and people) enter the Voice Mosaic on the web.No data enters the work via its web interface, which is used only for purposes of data visualization, interpretation and reflection.

CONCLUSION
While internet is becoming the most important communication via in the world, generally speaking the web has been practically deaf and mute so far -it seems that it still has not been able to deal properly with speech, be it natural language processing or simple voice commands.On the other hand, voice technologies have only reached enough accuracy and reliability to be used in large scale at the beginning of the 21st century, bringing to the surface the possibility of finally using them on the web.
In this context, the VoiceXML language can be used as an open standard for developing voice interfaces and Voice Mosaic is an artwork that allows people to experiment and reflect about the possibility of 'talking to the web', its new benefits and complexities.
From now on we think that it will be possible to provide wider and deeper experimentation with voice interfaces due to the available technologies integrating the web and telephone.We expect it will probably allow us all to break frontiers and go further in artistic/human possibilities and developments.

Figure 7 :
Figure 7: Accessing the web by phone using VoiceXML