Category Archives: Interaction Design

Developing speech applications

Personal background
Application design
Tips on using Vxml 2.1
- Creating grammars
  - Defining the recognition result
  - Managing ambiguity
- Dynamic prompts and Web queries

Personal background

The idea of controlling technology by telling it what to do has been compelling for a long time. In fact, when I was part of a “voice portal” startup in 1999-2001 (Quack.com, which rolled into AOLbyPhone 2001-2004 or so), there was a joking acknowledgement that the tech press announces “speech recognition is ready for wide use” about every ten years, like clockwork. Our startup launched around the third such crest of press optimism. And like movies on similar topics that release the same summer, there was a new crop of voice portal startups at the time (e.g., TellMe and BeVocal). Like the browser wars of a few years earlier between Netscape and IE, in which they’d pull pranks like sneaking their logo statue into the competitor’s office yard, we spiked TellMe car antennas with rubber ducks in their parking lot. Those were crazy, fun days when the corner of University and Bryant in Palo Alto seemed like the center of the universe, long hours in our little office felt like a voyage on a Generation Ship, and pet food was bought online. And a little more than a decade later, Apple has bought Siri to do something similar, and Google and Microsoft have followed.

The idea that led to our startup was wanting to help people compare prices with other stores while viewing products at a brick-and-mortar store. Mobile phones then had poor cameras and browsers, so the most feasible interaction method was to verbally select a product type, brand, and model by first calling an Interactive Voice Response (IVR) service. But a web startup needs more traffic than once-a-week shopping, so other services were added such as movie showtimes, sports scores, stock quotes, news headlines, email reading and composing (via audio recording), and even restaurant reviews. This was before VoiceXML reached v1.0 and we used a proprietary xml variant developed in-house alongside our Microsoft C++-based servers. We were the first voice portal to launch in the US, and that was with all services except the price comparison feature that was our initial motivation. It hasn’t reappeared on any voice portal since, that I know of.

As any developer knows, building on standards often provides many advantages. Once VXML 1.0 appeared, I wanted to see if we could migrate to it, so I bought a Macintosh G4 with OS X v1 when the Apple store first opened in Palo Alto, and used the Java “jar” wrappers for its speech recognition and generation features to prototype a vxml “browser”. When it supported 40% of the vxml spec, I shared it with our startup, recently bought by AOL, but they passed. I stopped work on it and released it as open-source through the Mozilla Foundation (see vbrowse.mozdev.org).

More than a decade later, markup-based solutions like vxml still seem like the most productive way of creating speech-driven applications (compared to, say, creating a Windows-only application using Dragon NaturallySpeaking).

Application design

State-of-the-art web applications tend to adopt the Model-View-Control design pattern, where the model is a JSON finite-state machine representation of all the states (e.g. ViewingInbox, ComposingMessage) supported, and JavaScript is used as controller to create DOM/HTML views, handle user actions, and manage data transfers with the server. This is also the pattern of newer W3C specs like SCXML that aim to support “multi-modal” interactions such as requesting a mapped location on one’s mobile phone by speaking the location (i.e., speech is one mode) and having the map appear in the browser (i.e., browser items are another mode). As “pervasive computing” develops and is able to escape the confines of mobile phones and laptops, additional modes needing support are likely to be, first, recognizing that the user is pointing to something and resolving what the referent is, and secondly, tracking the gaze of the user and recognizing what it’s fixated upon, as a kind of augmented reality hover gesture. Implementing and integrating such modes is part of my interest in the larger topic of intention perception; if you are interested in how these modes fit into a larger theoretical context, I highly recommend the entry on Theory of Meaning (and Reference) in the Stanford Encyclopdia of Philosophy, and Herb Clark’s book “Using Language“.

Vxml is up to v3.0 now, and it might support integration with these non-speech modes. But vxml 2.0 and 2.1 are supported more widely, and creating applications with them that follow the design pattern above requires a bit of thinking. The remainder of this article will share my thoughts and discoveries about how to do that with an excellent freemium platform, Voxeo.com

Tips on Using Vxml 2.1

Before attempting to create a vxml application, I strongly recommend getting a book on the topic or reading the specs online. But as a quick overview, think of conversations as pairs of turns in which one person already has in mind how the other person might respond to what he is about to say, he then says it, and usually allows the other person to interrupt, and as long as the other person says something and it’s intelligible, the speaker will respond with another turn. Under this description, the speaker’s turn gets most of the attention, but the respondent’s turn usually determines what happens next. Each such pair can be conceived of as a state in a finite-state machine, where all the speaker’s reactions to the respondent correspond to transitions out of those states.
To implement such a set of states in vxml2.0 or 2.1, one can create a single text document (aka “Single Page Application (SPA)“) with this as a start,

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE vxml SYSTEM "http://www.w3.org/TR/voicexml21/vxml.dtd">
<vxml version="2.1">
</vxml>

and then for each state, insert a variant of the following between the ‘vxml’ tags:

<form id="fYourNameForTheState">
</form>

To implement each state, add a variant of the following within each ‘form’ element,

<field name="ffYourNameForTheState">
  <grammar mode="voice" xml:lang="en-US" root="gYourNameForTheState" tag-format="semantics/1.0">
    <rule id="gYourNameForTheState">
      ...All the things the speaker might expect the respondent to say that are on-task...
    </rule>
  </grammar>
  <prompt timeout="10s">...The primary thing the speaker wants to tell the respondent in this state, perhaps a question...</prompt>
  <noinput>
    <prompt>...What to say if the prompt finishes and the respondent is silent all through the timeout duration...</prompt>
  </noinput>

  <nomatch>
    <prompt>...What to say as soon as any mismatch is detected between what the respondent is saying and what the speaker was expecting in the grammar; "I didn't get that" is a good choice...</prompt>
  </nomatch>

  <filled>
    <if cond="ffYourNameForTheState.result &amp;&amp; (ffYourNameForTheState.result == 'stop')">
      <goto next="#fWaitForInstruction"/>
    <elseif cond="ffYourNameForTheState.result &amp;&amp; (ffYourNameForTheState.result == 'shutdown')" />
      <goto next="#fGetConfirmationOfShutdown"/>
    <else />
      <assign name="_oInstructionHeard" expr="ffYourNameForTheState"/> <!-- Assumes _oInstructionHeard was declared outside this form in a 'var' or 'script' -->
      <goto next="#fGetConfirmationOfInstruction"/>
    </if>
  </filled>
</field>

We’ll discuss grammars in more depth below, and the rest of the template is largely self-explanatory. But a few minor points:

If you need to recognize only something basic like yes-or-no or digits in a form, then you can remove the ‘grammar’ element and instead add one of these to the ‘field’ element:
- type="boolean"
- type="digits"
- type="number"
Grammars can appear outside ‘field’ as a child of ‘form’, but then they are active in all fields of the form. There are cases in which doing so is good design, but it’s not the usual case.
The only element that “needs” a timeout for the respondent being silent is ‘noinput’; yet, the attribute is required to be part of ‘prompt’ instead.
‘goto’s can go to other fields in the same form, or different forms, but not to a specific field of another form.

I’ve made the ‘filled’ part less generalized than the other parts to illustrate a few points:

The contents of the ‘filled’ element is where you define all of the logic about what to do in response to what the respondent has said.
Although I’ve indented if-elseif-else to highlight their usual semantic relation to each other, you can see that actually ‘if’ contains the other two, and that ‘elseif’ and ‘else’ don’t actually contain their then-parts (which is somewhat contrary to the spirit of XML).
The field name is treated as if it contains the result of speech recognition (because it does), and it does so as a JavaScript object variable that has named properties.
The field variable is lexically scoped to the containing form, so if you want to access the results of speech recognition in another form (perhaps after following a ‘goto’), then you first must have a JavaScript variable whose scope is outside either of the forms, and assign it the object held by the field variable.
A boolean AND in a condition must be written as && to avoid confusing the XML parser. (You might want to try wrapping the condition as CDATA if this really bugs you.)
Form id’s can be used like html anchors, so a local url for referencing a form starts with url fragment identifier ‘#’ followed by the form’s id.

Note that it’s not necessary to start form id’s with “f”, or fields with “ff”, or grammars with “g”, nor is it necessary to repeat names across them like I do here. But I find that simplifying this way helps keep the application from seeming over-complicated.

Creating grammars

To implement the grammar content indicated above by placeholder text, “…All the things the speaker might expect the respondent to say that are on-task…,” one provides a list ‘one-of’ and ‘item’ elements. ‘one-of’ is used to indicate that exactly one of its child items must be recognized. ‘item’ has a ‘repeats’ attribute that takes such values as “0-1” (i.e., can occur zero or one times), “0-” (i.e., can occur zero or more times), “1-” (i.e., can occur one or more times), “7-10” (i.e., can occur 7 to 10 times), and so on. ‘item’ takes one or more children which can be any permutation of ‘item’ and ‘one-of’ elements, which can have their own children, and so on. The children of a ‘rule’ or ‘item’ element are implicitly treated as an ordered sequence, so all the child elements must be recognized for the parent to be recognized. (This formalism might remind you of Backus-Naur Form (BNF) for describing a context-free grammar (CFG). If you need a grammar more expressive than a CFG, you’ll have to impose the additional constraints in post-processing that follows speech recognition.)

If the contents of the grammar rule take up more than about five lines, it’s good practice like in other coding languages to modularize that content into an external file. Each such grammar module is declared within an inline ‘item’ like this,

<grammar mode="voice" xml:lang="en-US" root="gGetCommand" tag-format="semantics/1.0">
  <rule id="gGetCommand">
    <one-of>
      <item>
        <ruleref uri="myCommandLanguage.srgs.xml#SingleCommand" type="application/grammar-xml"/>
      </item>
      <item>
        <ruleref uri="myCommandStop.srgs.xml#Stop" type="application/grammar-xml"/>
      </item>
    </one-of>
  </rule>
</grammar>

and the external grammar file should have this form:

<?xml version= "1.0" encoding="UTF-8"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                         "http://www.w3.org/TR/speech-grammar/grammar.dtd">
<grammar version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" tag-format="semantics/1.0" root="SingleCommand" >
  <rule id="SingleCommand" scope="public">
    ...A sequence of 'one-of' and 'item' elements describing single commands you want to support...
  </rule>
  <rule id="SubgrammarOfSingleCommand" scope="public">
    ...Details about a particular command that would take too much space if placed inside the SingleCommand rule...
  </rule>
</grammar>

Defining the Recognition Result

Human languages usually allow any intended meaning to be phrased in several ways, so useful speech apps need to accommodate this by providing as many expected paraphrases as seem likely to be used. So, a grammar often has several ‘one-of’s to accommodate paraphrases. A naive approach for a speech app would be to provide such paraphrases in the grammar, and take recognition results in their default format of a single string, and then try to re-parse that string with JavaScript case-switch-logic similar to the ‘one-of’s in the markup — a duplication of work (ugh) with the attendant risk that the two will eventually fall out of sync (UGH!). What would be much preferred would be to retain the parse structure of what’s recognized and return that instead of a (flat) string; in fact, this is just what the “semantic interpretation” capability of vxml grammars offers. To make use of this capability, a few things are needed (these may be Voxeo-specific):

The ‘grammar’ elements in both the vxml file and the external grammar file(s) must use attributes tag-format="semantics/1.0" plus root="yourGrammarsRootRuleId"
‘tag’ elements must be placed in the grammars (details on how below), and they must assume there is a JSON object variable named ‘out’ to which you must assign properties and property-values. If instead you assign a string to ‘out’ anywhere in your grammar, then recognition results will revert to flat-string format.
If using Voxeo, ‘ruleref’ elements that refer to an external grammar must use attribute ‘type=”application/grammar-xml”‘, which doesn’t match the type suggested by the vxml2.0 spec, “application/srgs+xml”, http://www.w3.org/TR/speech-grammar/#S2.2.2

To use ‘tag’ elements for paraphrases, one can do this,

<rule id="Stop" scope="public">
  <one-of>
    <item>stop</item>
    <item>quit</item>
  </one-of>
  <tag>out.result = 'stop'</tag>
</rule>

in which the ‘result’ property was chosen by me, but could have been any legal JSON property name. The only real constraint on the choice of property name is that it make self-documenting sense to you when you refer to it elsewhere to retrieve its value.

‘tag’ elements can also be children of ‘item’s, which makes them a powerful tool for structuring the recognition result. For example, a grammar rule can be configured to create a JSON object:

<rule id="ParameterizedAction" scope="public">
  <one-of>
    <item>
      <one-of>
        <item>drill</item>
        <item>bore</item>
      </one-of>
      <ruleref uri="#DrillSpec"/>
      <tag>
        out.action = 'drill';
        out.measure = rules.latest().measure;
        out.units = rules.latest().units;
      </tag>
   </item>
   ...
</rule>

In this example, we rely on knowing that the “DrillSpec” rule returns a JSON object having “measure” and “units” properties, and we use those to create a JSON object that has those properties plus an “action” property.

‘tag’ elements can also be used to create a JSON array:

<rule id="ActionExpr" scope="public">
  <tag>
    out.steps = [];
    function addStep(satisfiedParameterizedActionGrammarRule) {
      var step = {};
      step.action = satisfiedParameterizedActionGrammarRule.action;
      step.measure = satisfiedParameterizedActionGrammarRule.measure;
      step.units = satisfiedParameterizedActionGrammarRule.units;
      out.steps.push(step);
    }
  </tag>
  <item>
    <ruleref uri="#ParameterizedAction"/>
    <!-- This use of rules.latest() should work according to http://www.w3.org/TR/semantic-interpretation/#SI5 -->
    <tag>addStep(rules.latest())</tag>
  </item>
  <item repeat="0-">
    <item>
      and
      <item repeat="0-1">then</item>
    </item>
    <ruleref uri="#ParameterizedAction"/>
    <tag>addStep(rules.latest())</tag>
  </item>
</rule>

These object- and array-construction techniques can be used in other rules that you reference as sub-grammars of these, allowing you to create a JSON object that captures the complete logical parse structure of what is recognized by the grammar.

By the way, if you want to use built-in support for recognizing yes-or-no, numbers, dates, etc as part of a custom grammar, then you’ll need to use a ‘ruleref’ like this,

<rule id="DepthSpec" scope="public">
  <item>
    <ruleref uri="#builtinNumber"/>
    <tag>out.measure = rules.builtinNumber</tag>
  </item>
</rule>
<rule id="builtinNumber">
  <item>
    <ruleref uri="builtin:grammar/number"/>
  </item>
</rule>

URI’s for other types can be inferred from the “grammar src” examples at http://help.voxeo.com/go/help/xml.vxml.grxmlgram.builtin (although these might be specific to the Voxeo vxml platform).

If you follow this grammar-writing approach, then you can access the JSON-structured parse result by reading property-value’s from the field variable containing the grammar (e.g., “ffYourNameForTheState” above), just as if it were the “out” variable of your root grammar rule that you’ve been assigning to. These values can be used in ‘filled’ elements either to guide if-then-else conditions, or be sent to a remote server as we’ll see in the next major section, “Dynamic prompts and Web queries”.

Managing ambiguity

As a side note, if you’re an ambiguity nerd like me, you’ll probably be interested to know that Vxml 2.0 doesn’t specify how homophones or syntactic ambiguity must be handled. But Voxeo provides a way to get multiple candidate parses.

Dynamic prompts and Web queries

So far, we can simulate one side of a canned conversation via a network of expected conversational states. It’s similar to a Choose-Your-Own-Adventure book in that it allows variety in which branches are followed, but it’s “canned” because all the prompts are static. But often we need dynamic prompts, especially when answering a user question via a web query. JavaScript can be used to provide such dynamic content by placing a ‘value’ element as a child of a ‘prompt’ element, and placing the script as the value of ‘value’s ‘expr’ attribute, like this:

<assign name="firstNumberGiven" expr="100"/> <!-- Simulate getting a number spoken by the user -->
<assign name="secondNumberGiven" expr="2"/> <!-- Simulate getting a number spoken by the user -->
<prompt>The sum of <value expr="firstNumberGiven"/> and <value expr="secondNumberGiven"/> is <value expr="firstNumberGiven + secondNumberGiven" /> </prompt>

The script can access any variable or function in the lexical scope of the ‘value’ element; that is, any variable declared in a containing element (or its descendants that appear earlier). Also notice that, by default, adjacent digits from a ‘value’ element are read as a single number (e.g., “one hundred and two”) rather than as digits (e.g., “one zero two”). That’s convenient, because one can’t embed a ‘say-as’ element in the ‘expr’ result, although one can force pronunciation as digits by inserting a space between each digit (e.g., “1 0 2”) perhaps by writing a JavaScript function (see http://help.voxeo.com/go/help/xml.vxml.tutorials.java); otherwise, if the default were to pronounce as digits, then forcing pronunciation as a single number would require a much more complicated function.

I’ve said little to nothing about interaction design in speech applications, although it’s very important to get right, as anyone who’s become frustrated while using a speech- or touchtone-based interface knows well. But one principle of interaction design that I will emphasize is that user commands should usually be confirmed, especially if they will change the state of the world and might be difficult to undo. When grammars are configured to return flat-string results, prompting for confirmation is easy to configure like this:

<prompt>I think you said <value expr="recResult"/> Is that correct? </prompt>

But when a grammar is configured to return JSON-structured results, the ‘value’ element above might be read as just “object object” (the result of JavaScript’s default stringify method for JSON objects, at least in Voxeo’s vxml interpreter). I believe the best solution is to write a JavaScript function (in an external file referenced with a ‘script’ element near the top of the vxml file) that is tailored to construct a string meaningful to your users from your grammar’s JSON structure, then wrap the “recResult” variable (or whatever you name it) in a call to that function. If there is any need to nudge users toward using terms that are easier for your grammar to recognize, then this custom stringify function is an opportunity to paraphrase their commands back to them using your preferred terms.

Now we’re ready to talk about sending JSON-structured recognition results to remote servers, which is the most exciting feature of vxml 2.1 for me, because it’s half of what we need to make vxml documents able to leverage the same RESTful web APIs that dhtml documents can (the other half, being able to digest the server’s response, will be discussed soon; “dhtml” === “Dynamic HTML”, which is a combination of html and JavaScript fortunate enough to find itself in a browser that has JavaScript enabled). Like html forms, vxml provides a way for its forms to submit to a remote server. And also like html, the response must be formatted in the markup language that was used to make the submission, because the response will be used to replace the document containing the requesting form. Html developers realized that their apps could be more responsive if less content needed to travel to and from the remote server, and that if they instead requested just the gist of what they needed, and the response encoded that gist in a markup-agnostic format like XML or JSON, then JavaScript could be used in their browser-based client to manipulate the DOM of the current document and that might usually be faster than requesting an entirely new document (even if most of its resources could be externalized into JavaScript and CSS files that can be cached). Because these markup-agnostic APIs are becoming widely available, they present an opportunity for non-html client markup languages like vxml to leverage them. Vxml developers created a way to leverage these APIs by adding a ‘data’ element alternative to vxml form submission in the vxml 2.1 spec. Here’s an example:

<var name="sInstructionHeard" expr="JSON.stringify(_oInstructionHeard)"/>
<data method="post"
      srcexpr="_sDataElementDestinationUrl + '/executeInstructionList'"
      enctype="application/x-www-form-urlencoded"
      namelist="sInstructionHeard"
      fetchhint="safe"
      name="oRemoteResponse"
      ecmaxmltype="e4x" />

The ‘data’ element isn’t as user-friendly as it might be. For example, one can’t just put the JSON-structured recognition result in it and expect it to be transferred properly; instead, one must first JSON.stringify() it (this method is native to most dhtml browsers circa 2014 and to Voxeo’s vxml interpreter). And the ‘data’ element requires that even POST bodies be url-encoded, so the remote server must decode using something like this (assuming you’re using a NodeJs server):

sBody = decodeURIComponent(sBody.replace(/\+/g, ' '));
sBody = sBody.replace('sInstructionToEvaluate=',''); //Strip-off query parameter name to leave bare value
sBody = (sBody ? JSON.parse(sBody) : sBody);

What the remote server needs to do for its response is easier:

oResponse.writeHead(200, {'Content-Type': 'text/xml'});
oResponse.end('<result><summaryCode>stubbedSuccess</summaryCode><details>detailsAsString</details></result>');

If the server is reachable and generates a response like this, then the variable above that I named “oRemoteResponse” will be JSON-structured and have a ‘result’ property, which itself will have ‘summaryCode’ and ‘details’ properties whose values, in this case, are string-formatted. You have the freedom to use any valid XML element name — which is also a valid JSON property name — in place of my choice of ‘result’. The conversion from the remote server’s XML formatted response to this JSON structure is done implicitly by the vxml interpreter due to the ecmaxmltype="e4x" attribute. (The vxml 2.1 interpreter cannot process a JSON-formatted response as dhtml browsers can.) These JSON properties from the remote server can be used to control the flow of conversation among the ‘form’s in the same way we used JSON properties from “semantic” speech recognition earlier. Coolness!

A few final comments about ‘data’ elements:

To validate the xml syntax of your app, you probably want to upload it to the W3C xml validator; however, the ecmaxmltype="e4x" attribute is apparently not part of the vxml 2.1 DTD, which the validator finds at the top of your file if you’re following my template above, and so you will get a validation error that you’ll have to assume is spurious and ignorable.
My app uses a few ‘data’ elements to send different kinds of requests, so to keep the url of the remote server in-sync across those, I have a ‘var’ element before all my forms in which I define the _sDataElementDestinationUrl url value.
fetchhint="safe" disables pre-fetching, which isn’t useful for dynamic content like the JSON responses we’re talking about
If you want to enable caching, which doesn’t make sense for dynamic JSON content like we’ve been talking about but would be reasonable for static content, you’d do that via your remote server’s response headers.
If the remote server isn’t reachable, the ‘data’ element will throw an ‘error.badfetch’ that can be caught with a ‘catch’ element to play a prompt or log error details, but unfortunately this error is required by the spec to be “fatal” which appears to mean the app must exit (in vxml terms, I believe it means the form-interpretation algorithm must exit). That’s a more severe reaction than in dhtml which allows DOM manipulation and further http requests to continue indefinitely. Requiring such errors to be fatal blocks such potential apps as a voice-driven html browser that reads html content, because it could not recover from the first request that fails. But maybe I’m interpreting “fatal” wrong; Voxeo’s vxml interpreter seems to allow interaction to continue indefinitely if this error is caught with a ‘catch’ element that precedes a ‘form’.
If the remote server is reachable but must return a non-200 response code, the ‘data’ element will throw ‘error.badfetch.DDD’ where DDD is the response code. This error is also “fatal”.

At this point, we’ve covered all that I think is core to authoring a speech application using vxml 2.1. For more details, the vxml 2.0 spec and its 2.1 update are the authoritative references. Voxeo’s help pages are also quite useful.

Up next: Test-driven development of speech applications, and Hosting a speech app using Voxeo.

A font for eco-friendly printing

So you already print double-sided or reuse single-sided prints? You can go even further in your quest for eco-friendly printing.

A font has been developed that reduces the amount of toner used while minimizing loss of readability. The download page includes tips on how to install on a variety of platforms, and here’s a tip for installing a font in Windows 7. Note that after clicking the “Install” button for a font, there is no indication of success beyond the Install button becoming disabled — although you can open the Fonts control panel to verify success.

Making games more fun with artificial stupidity

If one buys into Daniel Dennett’s proposed use of “the intentional stance” to generate explanations and predictions of human behavior (say, in an AI program that observes a person and tries to find ways of helping), then accounting for human error is a tough problem (because the stance assumes rationality and errors aren’t rational). That’s one reason I’m interested in errors.

Game AI faces a similar problem in that some games like chess and pool/billiards allow a computer player to make very good predictions many steps ahead, often beyond the ability of human players. Such near-optimal skill makes the computer players not much fun. One has to find ways of making the computers appear to play at a similar level of skill as whatever human they play against.

I just came across a very interesting article on the topic of how to make computer players have plausibly non-optimal skills. Here’s a good summarizing quote:

In order to provide an exciting and dynamic game, the AI needs to manipulate the gameplay to create situations that the player can exploit.

In pool this could mean, instead of blindly taking a shot and not caring where the cue ball ends up, the AI should deliberately fail to pot the ball and ensure that the cue ball ends up in a place where the player can make a good shot.

An interesting anecdote from the article is that the author created a pool-playing program that understood the physics of the simulated table and balls so well that it could unfailingly knock any ball it wanted to into a pocket. The program didn’t make any attempt to have the cue ball stop at a particular position after the target ball was pocketed, however. Yet naive human players interpreted the computer’s plays as trying to optimize the final position of the cue ball, apparently because they projected human abilities onto the program, and humans cannot unfailingly pocket any ball but seemingly are pretty good at having the cue ball stop where they want.

“current Web designs are three times easier to use for non-disabled users than for users with disabilities”

That quote comes from a new Nielsen Norman Group (aka Jakob Nielsen and Don Norman) study of the usability of many websites for disabled and non-disabled users in the U.S. and Japan (PDF).Â The report also includes 75 design tips for improving usability for the disabled.

The Enter key: Making a mess of submit buttons and textboxes

If one has a submit button in an HTML form, then pressing Enter will trigger the first of these in doc order, as though one pressed the button. At first glance, this seems like a nice feature, but in practice it leads to lots of problems. The root of the problem is that users forget or don’t know about this. (Browsers could help by giving special highlighting to such a button, as desktop apps often do.) Users might be typing in a textbox, not knowing or caring whether it’s a textbox (which will pass on any Enter to the form) or a one-line textarea (which would add a line ending to its content and not pass on the Enter). The Browse button of a file input (in some browsers) will also pass on an Enter rather than trigger a FileOpen dialog, even when it’s in focus.

To prevent such errors, one can change all submit buttons to normal buttons and use script and a hidden input to transmit to the server the name-value pair that the submit button would have provided. Or, if there are no file inputs (checkboxes and radios might also be a problem, though), then one might try changing all textboxes to textareas. However, a gotcha with the second approach occurs when a user enters a string with no spaces that’s longer than the one-line textarea; in that case, a horizontal scrollbar will appear that might hide the text. One can try making the textarea taller, but in Firefox it has to be more than 20 px or else no vertical scrollbar will appear for multi-line entries (because the scroll thumbs are quite large in FF).

Making text entry faster and easier for mobile, games, and the disabled

A Google video about MobileQWERTYâ„¢Â shows how a 3×3 button layout using letter assignments different than the usual abc, def pattern can provide lots of benefits. For example, the speaker says that an average of 2.14 key presses is needed to type each letter of a typical message using an abc layout on a standard phone, but MobileQWERTY’s layout reduces the average to 1.35 key presses.Â That’s 35% more than a full QWERTY keyboard but the abc layout is 114% more!

MobileQWERTYâ„¢Â is shown to provide similar improvements for several Western European languages (and one can see a demo of Japanese at minute 40 in the video). It’s targeted not just at standard mobile keypads (a problem space dominated to this point by Tegicâ„¢ , which owns the IP behind predictive spelling for abc layouts) but also game controllers and input devices for the disabled, children, and the elderly — anyone having trouble managing fine finger movements.

The most impressive thing in the video to me is seeing how fluidly someone trained in MobileQWERTYâ„¢Â can type typical messages.Â I really liked the small form factor of my freebie Sprint Samsung phone, but had to give it up for a Treo650’s fuller keypad.Â I think MobileQWERTYâ„¢ could turn out to be a better solution for mobile than Apple’s touch typing and predictive spelling.Â Let’s hope it’s an option in Google’s Android phone OS.

Understanding Remote Presence

A comment on: Understanding Remote Presence

The authors are Scandinavian, and start off with an interesting observation: “When you enter a Scandinavian home in the wintertime you will soon realize the importance of light, and how different lamps are crucial for carrying out work and daily house chores. But the use of light is also essential to show that you are at home and to manifest the presence of life. […] using light […] in some communities […] is a mutual social activity with the neighbors to show that you are doing well and even that you might welcome visits.”

The authors are also interested in the variety of sentimental artifacts in the home, “A key concern for us is to focus not only on the pattern of communication within a co-residential social unit, but also to investigate what people keep in their homes to act as a reminder of people to whom they are close.”

The authors did an ethnographic study where they visited homes and asked about the significance of objects while being given a tour. One of the more interesting findings is that some people have mixed feelings about the phone as an object, because it reminds them of painful conversations with remote loved ones. The authors also asked subjects to enact specific scenarios, such as leaving the house, in order to observe habits such as leaving lights on if the owner were staying in the neighborhood.

From the observations, the authors created several concept devices that combined the qualities of light source, keepsake item, and awareness of presence with a remote loved one. One of these concepts, called the “6th Sense” lamp, was developed into a prototype used in a followup study.

These lamps are made in pairs that are connected via GSM wireless network. When a human is near one lamp, the other lamp brightens, giving each lamp owner a sense of the other owner’s activity around the lamp. There was a 2-week study with 6 families of different kinds. Subjects were prepped with:

– a story of how in the old days in small villages, people could look out their window and see their parents’ homes, and could get a sense of how they were by how the house was lit inside

– a simple ritual once the lamps were installed where each user called the other and turned on the lamps while on the phone

– journals with prepared questions that were kept by the subjects

The authors wereinterested in the subject’s perceived quality of (a) sense of presence, and (b) sense of being under surveillance. One subject (a father whose sons complained that he only called about practical matters) said the sense of presence wasn’t useful because he spoke to his sons so often by phone. And the only subjects who worried about surveillance were parents that didn’t want to intrude on their children.

These studies were very interesting in how they identified the meaningfulness of routine, practical activities like turning on lights, and how we might be able to introduce artifacts that are easily embedded in such activities and which enable even greater, yet subtle, human interactions.

Massively conferenced phone calls that mimic physical space

A comment on: The Mad Hatter’s Cocktail Party: A Social Mobile Audio Space supporting Multiple Simultaneous Conversations

Question: Imagine you have several group conversations going on via conference call, but instead of having separate calls, all the groups are lumped into a single call so that individuals can migrate in and out of conversations at will. Could an algorithm identify the different “floors” of conversation in real time, so that the volume of all conversations that a caller is not in are largely muted? That is, can an algorithm simulate the acoustics of a large meeting room, where the conversation that one hears best is the one of the group one is participating in?

Method: Create a full-duplex (i.e. you can hear while speaking) conference call using mic’d iPaq PDAs that are connected via 802.11b wireless network to a GStreamer central audio exchange server. Create a “naive” Bayesian algorithm that is trained off-line with audio files from human conversations. The conversations are recorded during a party game that forces conversational groups to split up and reform. The human trainer then segments the audio according to which group is in the audio. People give subtle audio cues in their speech about whether they are participating in the conversation, such as not interrupting the current speaker but jumping in when that speaker indicates he is finishing. The Bayesian training should enable the algorithm to pick up on these cues and attentuate the volume of incoming audio streams that aren’t in the same conversation as a particular user, and make these decisions for all users simultaneously.

Findings:

– When the system assigned floors correctly, users preferred it to having no floor assignment (i.e. no volume adjustment). This makes sense, since so many people were in the call at once that it was practically impossible to have group conversations without managing the call as if they were one, all-inclusive group.

– Users suggested a “maintain the current volume level” widget, since the system sometimes reassigned other people in the conversation to another group and this wasn’t noticed until the other person’s audio was so muted that they had trouble speaking normally to each other and thereby getting the system to notice that they were actually in the same group.

– Unlike in Fact-To-Face (FTF) interactions, it often took participants quite awhile to notice that another person had moved to another group (whether by that person’s choice or by the system’s mistake).

IMO, this could be a very useful feature in stationary and mobile conferences, but only if there is really a need for people to move silently in and out of conversations. The only situation like that that I can think of are phone-based chat rooms (which don’t exist yet AFAIK) and party lines. The need for a maintain-this-volume button is also problematic with mobile use, since pressing keys on cellphones while talking is very awkward.

Using only audio to support telepresence in office environments

A comment on: Hanging on the â€˜wire: a field study of an audio-only media space

Question: How effective is an audio-only connection for office-based collaborative work?

Method: The authors setup a conference-call-type (but full-duplex) connectivity in the offices of a group of video editors who were already a social group, within 100 ft of each other. Each editor received a mutable open mike and speaker system that all users to speak and hear each other simultaneously. In their regular routine, the editors rarely worked on the same video segments.

Findings:

– The lack of a formal way to take and release “the floor” led to overlapping speech and the need to repeat oneself and overtly manage the floor…this appears to be done most of the time through facial expressions and gestures. It was also more of a problem than in phone conversations, since interactions were open-ended (hours long) and less formal.

– There was lots of joking and voice play

– What little time was devoted to work tasks involved scheduling meetings, primarily

– Similar to the problem of managing the floor, another major problem was determining who was “on” the system and thus how careful one had to be about what was said…in at least one case, catty gossip was overheard by the subject of the gossip

– A phone ring that interrupted the speaker evolved into an informal sign-off, since it typically indicated that the speaker muted the mic and took the call

– Some sudden background noises, like ringing phones, caused pain for some participants who were listening in with headphones instead of speakers. This was the only problem that the authors thought merited automated help (i.e. to monitor and squelch loud noises)

IMO, this study showed that open-ended, non-task-related exchanges aren’t served well by an audio-only Computer-Supported Collaborative Work (CSCW) system. But this result suggests a followup study of how audio-only conference connections might help limited, task-related exchanges. The mobile blue collar subjects in one of the other studies here would be an ideal group.

Using suggestive sounds to indicate state in complex systems

A comment on: Effective sounds in complex systems: the ARKOLA simulation

A prototype was created to answer the question, if you have a complex system that needs monitoring, like a soda bottling operation, and there is too much visual information to display on screen at a time, would auditory clues about the system’s behavior be helpful?

The prototype simulates the imaginary “ARKola” bottling plant, where empty glass bottles, cola nuts, and carbonated water are delivered at intervals at different ends of the plant. The nuts and water are heated, the bottles filled and then capped, and then marshalled for shipping. Each of these steps has its own box in the graphical simulation and its own sounds (such as the clinking of newly arrived bottles). The screen can show only 1/4 of the simulation at a time, so the user has to listen for the crash of broken bottles, the spill of boiling syrup, and so on, in order to adjust several controls and keep the plant running smoothly.

In the test, users were put into partner pairs where each partner was in a different room. In the control group, the partners could speak with each other and monitor the GUI. In the other group, the partners heard the auditory icons (“earcons”) also.

Conclusions:

Earcons in general seemed very useful
However, more recognizable sounds, like breaking glass, could distract from less recognizable sounds that indicated more important problems.
The stopping of a sound tended to be ignored, even if the stopping was important.
The partners who had earcons tended to segment the work and talk with each other about the other’s problems. The partners who didn’t have earcons tended to ignore the other person’s problems and focus on their own.
The overall performance of the simulated plant was inferrable from the earcons almost as a gestalt, similar to the way people suspect problems with their car based on the sound of the engine.

Designing to support communication on the move

A comment on: Designing to support communication on the move

A new ethnographic study by Jacqueline Brody about blue-collar mobile communication needs (this one from 2003; prev was from 2001).

She points out that there is no definitive definition of ‘mobile work’ and offers a list of example instances: “working at multiple (but stationary) locations, walking around a central location, travelling between locations, working in hotel rooms, on moving vehicles, and in remote meeting rooms.”

This was a 2-page report and its findings are fairly general:

Information arriving in one medium tends to force the interaction to remain in that medium, because it’s so hard to switch. Mobile tools should make this easier.
Mobile workers need to prevent interruptions while reassuring employers that they are “on the job”
Current tech limits the extent to which mobile workers can tap into shared resources like knowledge bases.
Current tech depends too much on having a flat work surface available (when using a laptop) or a free hand (when using a cellphone, PDA, or laptop)

Designing for Mobility, Collaboration, and Information Use by Blue-Collar Workers

A comment on: Designing for Mobility, Collaboration, and Information Use by Blue-Collar Workers

The focus of this study was mobile blue-collar workers like copy repairmen, and how they solve problems using wireless equipment like cellphones and laptops.

The general findings were:

The tasks most in need of better support are
- scheduling later Fact-To-Face mtgs, which is complicated by the need to find mutually available times and locations, record these decisions while having only one or no hands free, and where the other person may not even be able to talk with you at the moment
- delivering docs to complete an interaction. For example, during a cellphone conversation, the other participant might request that a contract be sent in advance of a mtg; since cellphones generally can’t be used to access or transmit files, the promiser has to remember to send the doc later, perhaps hours later, when he has access to a computer or fax.
Some companies need to provide phone services to their mobile employees but are so concerned about cell minutes that they opt for phone cards instead of actual phones. Finding a phone is often a hassle, and these phones are often not close enough to the worksite to allow guidance over the phone.
Mobile workers need a good way of indicating a ‘busy’ status, to prevent interruptions. (The article seems to suggest a software solution, but I’m not sure why taking one’s phone off-hook wouldn’t be sufficient, and more reliable than a new type of software.)
Workers need a hands-free, out-of-car communication solution.

“Taking Email to Task” – PARC HCI research

A comment on: Taking Email to Task

This paper notes that email has morphed from simple asynchronous msg exchange to a means of managing tasks. For example, users have been observed sending themselves email as reminders, since they know that they scan their inbox several times a day. Similarly, users forward msgs to themselves to keep the msg within the visible area of the inbox. The inbox is also used as a file store — users will send themselves an email with an attachment as a way of quickly accessing the file during the run of the task.

Based on these observations, a prototype was developed that:

– Allows senders to indicate a deadline, and all recipients of this msg see a progress bar in the inbox showing how much time remains for this item

– The inbox and outbox are shown in one view, since the two together represent current tasks

– Attachments appear as their own item in the inbox, which sidesteps the problem of att’s being hidden within msgs

These features (and others) were implemented as an Outlook plugin, and a study was done across a variety of users in diff locations. The added features were found to be generally very helpful, although most users stopped using it due to the plugin not supporting the full range of Outlook features they had become dependent on.

The problem examples suggest some features that the authors didn’t discuss:

– Allow recipients to create a summary meaningful to them that would be shown in place of the subject text (in a diff color)…useful for archiving how-to info and such

– Allow senders to mark some recipients as “responsible”. The inbox would also be split into 4 bins: MyTasks, OthersTasks, Unread, and Read. Any msg where I’m marked responsible would go to MyTasks from Unread once I read it; similarly for msgs where only others are responsible, for OthersTasks. The bins help avoid important msgs falling out of the visible area (and MyTasks/OthersTasks/Unread should expand to fit new msgs until full, and then pulse).

– Items with responsible parties should also have a status: {assigned, accepted, declined, finished}…this is a lightweight form of bugtracking/task resolution.

– Changes to deadlines, those responsible, or status should cause the display of that item in everyone’s display to change

This paper fired my already keen interest in Personal Information Management tools (“PIMs”) even further.

Intention Perception

how do we infer what people are up to just by seeing a few of their actions?