Category Archives: DHTML

Developing speech applications


Personal background

The idea of controlling technology by telling it what to do has been compelling for a long time. In fact, when I was part of a “voice portal” startup in 1999-2001 (, which rolled into AOLbyPhone 2001-2004 or so), there was a joking acknowledgement that the tech press announces “speech recognition is ready for wide use” about every ten years, like clockwork. Our startup launched around the third such crest of press optimism. And like movies on similar topics that release the same summer, there was a new crop of voice portal startups at the time (e.g., TellMe and BeVocal). Like the browser wars of a few years earlier between Netscape and IE, in which they’d pull pranks like sneaking their logo statue into the competitor’s office yard, we spiked TellMe car antennas with rubber ducks in their parking lot. Those were crazy, fun days when the corner of University and Bryant in Palo Alto seemed like the center of the universe, long hours in our little office felt like a voyage on a Generation Ship, and pet food was bought online. And a little more than a decade later, Apple has bought Siri to do something similar, and Google and Microsoft have followed.

The idea that led to our startup was wanting to help people compare prices with other stores while viewing products at a brick-and-mortar store. Mobile phones then had poor cameras and browsers, so the most feasible interaction method was to verbally select a product type, brand, and model by first calling an Interactive Voice Response (IVR) service. But a web startup needs more traffic than once-a-week shopping, so other services were added such as movie showtimes, sports scores, stock quotes, news headlines, email reading and composing (via audio recording), and even restaurant reviews. This was before VoiceXML reached v1.0 and we used a proprietary xml variant developed in-house alongside our Microsoft C++-based servers. We were the first voice portal to launch in the US, and that was with all services except the price comparison feature that was our initial motivation. It hasn’t reappeared on any voice portal since, that I know of.

As any developer knows, building on standards often provides many advantages. Once VXML 1.0 appeared, I wanted to see if we could migrate to it, so I bought a Macintosh G4 with OS X v1 when the Apple store first opened in Palo Alto, and used the Java “jar” wrappers for its speech recognition and generation features to prototype a vxml “browser”. When it supported 40% of the vxml spec, I shared it with our startup, recently bought by AOL, but they passed. I stopped work on it and released it as open-source through the Mozilla Foundation (see

More than a decade later, markup-based solutions like vxml still seem like the most productive way of creating speech-driven applications (compared to, say, creating a Windows-only application using Dragon NaturallySpeaking).

Application design

State-of-the-art web applications tend to adopt the Model-View-Control design pattern, where the model is a JSON finite-state machine representation of all the states (e.g. ViewingInbox, ComposingMessage) supported, and JavaScript is used as controller to create DOM/HTML views, handle user actions, and manage data transfers with the server. This is also the pattern of newer W3C specs like SCXML that aim to support “multi-modal” interactions such as requesting a mapped location on one’s mobile phone by speaking the location (i.e., speech is one mode) and having the map appear in the browser (i.e., browser items are another mode). As “pervasive computing” develops and is able to escape the confines of mobile phones and laptops, additional modes needing support are likely to be, first, recognizing that the user is pointing to something and resolving what the referent is, and secondly, tracking the gaze of the user and recognizing what it’s fixated upon, as a kind of augmented reality hover gesture. Implementing and integrating such modes is part of my interest in the larger topic of intention perception; if you are interested in how these modes fit into a larger theoretical context, I highly recommend the entry on Theory of Meaning (and Reference) in the Stanford Encyclopdia of Philosophy, and Herb Clark’s book “Using Language“.

Vxml is up to v3.0 now, and it might support integration with these non-speech modes. But vxml 2.0 and 2.1 are supported more widely, and creating applications with them that follow the design pattern above requires a bit of thinking. The remainder of this article will share my thoughts and discoveries about how to do that with an excellent freemium platform,

Tips on Using Vxml 2.1

Before attempting to create a vxml application, I strongly recommend getting a book on the topic or reading the specs online. But as a quick overview, think of conversations as pairs of turns in which one person already has in mind how the other person might respond to what he is about to say, he then says it, and usually allows the other person to interrupt, and as long as the other person says something and it’s intelligible, the speaker will respond with another turn. Under this description, the speaker’s turn gets most of the attention, but the respondent’s turn usually determines what happens next. Each such pair can be conceived of as a state in a finite-state machine, where all the speaker’s reactions to the respondent correspond to transitions out of those states.
To implement such a set of states in vxml2.0 or 2.1, one can create a single text document (aka “Single Page Application (SPA)“) with this as a start,

<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.1">

and then for each state, insert a variant of the following between the ‘vxml’ tags:

<form id="fYourNameForTheState">

To implement each state, add a variant of the following within each ‘form’ element,

<field name="ffYourNameForTheState">
  <grammar mode="voice" xml:lang="en-US" root="gYourNameForTheState" tag-format="semantics/1.0">
    <rule id="gYourNameForTheState">
      ...All the things the speaker might expect the respondent to say that are on-task...
  <prompt timeout="10s">...The primary thing the speaker wants to tell the respondent in this state, perhaps a question...</prompt>
    <prompt>...What to say if the prompt finishes and the respondent is silent all through the timeout duration...</prompt>

    <prompt>...What to say as soon as any mismatch is detected between what the respondent is saying and what the speaker was expecting in the grammar; "I didn't get that" is a good choice...</prompt>

    <if cond="ffYourNameForTheState.result &amp;&amp; (ffYourNameForTheState.result == 'stop')">
      <goto next="#fWaitForInstruction"/>
    <elseif cond="ffYourNameForTheState.result &amp;&amp; (ffYourNameForTheState.result == 'shutdown')" />
      <goto next="#fGetConfirmationOfShutdown"/>
    <else />
      <assign name="_oInstructionHeard" expr="ffYourNameForTheState"/> <!-- Assumes _oInstructionHeard was declared outside this form in a 'var' or 'script' -->
      <goto next="#fGetConfirmationOfInstruction"/>

We’ll discuss grammars in more depth below, and the rest of the template is largely self-explanatory. But a few minor points:

  1. If you need to recognize only something basic like yes-or-no or digits in a form, then you can remove the ‘grammar’ element and instead add one of these to the ‘field’ element:
    • type="boolean"
    • type="digits"
    • type="number"
  2. Grammars can appear outside ‘field’ as a child of ‘form’, but then they are active in all fields of the form. There are cases in which doing so is good design, but it’s not the usual case.
  3. The only element that “needs” a timeout for the respondent being silent is ‘noinput’; yet, the attribute is required to be part of ‘prompt’ instead.
  4. ‘goto’s can go to other fields in the same form, or different forms, but not to a specific field of another form.

I’ve made the ‘filled’ part less generalized than the other parts to illustrate a few points:

  1. The contents of the ‘filled’ element is where you define all of the logic about what to do in response to what the respondent has said.
  2. Although I’ve indented if-elseif-else to highlight their usual semantic relation to each other, you can see that actually ‘if’ contains the other two, and that ‘elseif’ and ‘else’ don’t actually contain their then-parts (which is somewhat contrary to the spirit of XML).
  3. The field name is treated as if it contains the result of speech recognition (because it does), and it does so as a JavaScript object variable that has named properties.
  4. The field variable is lexically scoped to the containing form, so if you want to access the results of speech recognition in another form (perhaps after following a ‘goto’), then you first must have a JavaScript variable whose scope is outside either of the forms, and assign it the object held by the field variable.
  5. A boolean AND in a condition must be written as &amp;&amp; to avoid confusing the XML parser. (You might want to try wrapping the condition as CDATA if this really bugs you.)
  6. Form id’s can be used like html anchors, so a local url for referencing a form starts with url fragment identifier ‘#’ followed by the form’s id.

Note that it’s not necessary to start form id’s with “f”, or fields with “ff”, or grammars with “g”, nor is it necessary to repeat names across them like I do here. But I find that simplifying this way helps keep the application from seeming over-complicated.

Creating grammars

To implement the grammar content indicated above by placeholder text, “…All the things the speaker might expect the respondent to say that are on-task…,” one provides a list ‘one-of’ and ‘item’ elements. ‘one-of’ is used to indicate that exactly one of its child items must be recognized. ‘item’ has a ‘repeats’ attribute that takes such values as “0-1” (i.e., can occur zero or one times), “0-” (i.e., can occur zero or more times), “1-” (i.e., can occur one or more times), “7-10” (i.e., can occur 7 to 10 times), and so on. ‘item’ takes one or more children which can be any permutation of ‘item’ and ‘one-of’ elements, which can have their own children, and so on. The children of a ‘rule’ or ‘item’ element are implicitly treated as an ordered sequence, so all the child elements must be recognized for the parent to be recognized. (This formalism might remind you of Backus-Naur Form (BNF) for describing a context-free grammar (CFG). If you need a grammar more expressive than a CFG, you’ll have to impose the additional constraints in post-processing that follows speech recognition.)

If the contents of the grammar rule take up more than about five lines, it’s good practice like in other coding languages to modularize that content into an external file. Each such grammar module is declared within an inline ‘item’ like this,

<grammar mode="voice" xml:lang="en-US" root="gGetCommand" tag-format="semantics/1.0">
  <rule id="gGetCommand">
        <ruleref uri="myCommandLanguage.srgs.xml#SingleCommand" type="application/grammar-xml"/>
        <ruleref uri="myCommandStop.srgs.xml#Stop" type="application/grammar-xml"/>

and the external grammar file should have this form:

<?xml version= "1.0" encoding="UTF-8"?>
<!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
<grammar version="1.0" xmlns="" xml:lang="en-US" tag-format="semantics/1.0" root="SingleCommand" >
  <rule id="SingleCommand" scope="public">
    ...A sequence of 'one-of' and 'item' elements describing single commands you want to support...
  <rule id="SubgrammarOfSingleCommand" scope="public">
    ...Details about a particular command that would take too much space if placed inside the SingleCommand rule...

Defining the Recognition Result

Human languages usually allow any intended meaning to be phrased in several ways, so useful speech apps need to accommodate this by providing as many expected paraphrases as seem likely to be used. So, a grammar often has several ‘one-of’s to accommodate paraphrases. A naive approach for a speech app would be to provide such paraphrases in the grammar, and take recognition results in their default format of a single string, and then try to re-parse that string with JavaScript case-switch-logic similar to the ‘one-of’s in the markup — a duplication of work (ugh) with the attendant risk that the two will eventually fall out of sync (UGH!). What would be much preferred would be to retain the parse structure of what’s recognized and return that instead of a (flat) string; in fact, this is just what the “semantic interpretation” capability of vxml grammars offers. To make use of this capability, a few things are needed (these may be Voxeo-specific):

  1. The ‘grammar’ elements in both the vxml file and the external grammar file(s) must use attributes tag-format="semantics/1.0" plus root="yourGrammarsRootRuleId"
  2. ‘tag’ elements must be placed in the grammars (details on how below), and they must assume there is a JSON object variable named ‘out’ to which you must assign properties and property-values. If instead you assign a string to ‘out’ anywhere in your grammar, then recognition results will revert to flat-string format.
  3. If using Voxeo, ‘ruleref’ elements that refer to an external grammar must use attribute ‘type=”application/grammar-xml”‘, which doesn’t match the type suggested by the vxml2.0 spec, “application/srgs+xml”,

To use ‘tag’ elements for paraphrases, one can do this,

<rule id="Stop" scope="public">
  <tag>out.result = 'stop'</tag>

in which the ‘result’ property was chosen by me, but could have been any legal JSON property name. The only real constraint on the choice of property name is that it make self-documenting sense to you when you refer to it elsewhere to retrieve its value.

‘tag’ elements can also be children of ‘item’s, which makes them a powerful tool for structuring the recognition result. For example, a grammar rule can be configured to create a JSON object:

<rule id="ParameterizedAction" scope="public">
      <ruleref uri="#DrillSpec"/>
        out.action = 'drill';
        out.measure = rules.latest().measure;
        out.units = rules.latest().units;

In this example, we rely on knowing that the “DrillSpec” rule returns a JSON object having “measure” and “units” properties, and we use those to create a JSON object that has those properties plus an “action” property.

‘tag’ elements can also be used to create a JSON array:

<rule id="ActionExpr" scope="public">
    out.steps = [];
    function addStep(satisfiedParameterizedActionGrammarRule) {
      var step = {};
      step.action = satisfiedParameterizedActionGrammarRule.action;
      step.measure = satisfiedParameterizedActionGrammarRule.measure;
      step.units = satisfiedParameterizedActionGrammarRule.units;
    <ruleref uri="#ParameterizedAction"/>
    <!-- This use of rules.latest() should work according to -->
  <item repeat="0-">
      <item repeat="0-1">then</item>
    <ruleref uri="#ParameterizedAction"/>

These object- and array-construction techniques can be used in other rules that you reference as sub-grammars of these, allowing you to create a JSON object that captures the complete logical parse structure of what is recognized by the grammar.

By the way, if you want to use built-in support for recognizing yes-or-no, numbers, dates, etc as part of a custom grammar, then you’ll need to use a ‘ruleref’ like this,

<rule id="DepthSpec" scope="public">
    <ruleref uri="#builtinNumber"/>
    <tag>out.measure = rules.builtinNumber</tag>
<rule id="builtinNumber">
    <ruleref uri="builtin:grammar/number"/>

URI’s for other types can be inferred from the “grammar src” examples at (although these might be specific to the Voxeo vxml platform).

If you follow this grammar-writing approach, then you can access the JSON-structured parse result by reading property-value’s from the field variable containing the grammar (e.g., “ffYourNameForTheState” above), just as if it were the “out” variable of your root grammar rule that you’ve been assigning to. These values can be used in ‘filled’ elements either to guide if-then-else conditions, or be sent to a remote server as we’ll see in the next  major section, “Dynamic prompts and Web queries”.

Managing ambiguity

As a side note, if you’re an ambiguity nerd like me, you’ll probably be interested to know that Vxml 2.0 doesn’t specify how homophones or syntactic ambiguity must be handled. But Voxeo provides a way to get multiple candidate parses.

Dynamic prompts and Web queries

So far, we can simulate one side of a canned conversation via a network of expected conversational states. It’s similar to a Choose-Your-Own-Adventure book in that it allows variety in which branches are followed, but it’s “canned” because all the prompts are static. But often we need dynamic prompts, especially when answering a user question via a web query. JavaScript can be used to provide such dynamic content by placing a ‘value’ element as a child of a ‘prompt’ element, and placing the script as the value of ‘value’s ‘expr’ attribute, like this:

<assign name="firstNumberGiven" expr="100"/> <!-- Simulate getting a number spoken by the user -->
<assign name="secondNumberGiven" expr="2"/> <!-- Simulate getting a number spoken by the user -->
<prompt>The sum of <value expr="firstNumberGiven"/> and <value expr="secondNumberGiven"/> is <value expr="firstNumberGiven + secondNumberGiven" /> </prompt>

The script can access any variable or function in the lexical scope of the ‘value’ element; that is, any variable declared in a containing element (or its descendants that appear earlier). Also notice that, by default, adjacent digits from a ‘value’ element are read as a single number (e.g., “one hundred and two”) rather than as digits (e.g., “one zero two”). That’s convenient, because one can’t embed a ‘say-as’ element in the ‘expr’ result, although one can force pronunciation as digits by inserting a space between each digit (e.g., “1 0 2”) perhaps by writing a JavaScript function (see; otherwise, if the default were to pronounce as digits, then forcing pronunciation as a single number would require a much more complicated function.

I’ve said little to nothing about interaction design in speech applications, although it’s very important to get right, as anyone who’s become frustrated while using a speech- or touchtone-based interface knows well. But one principle of interaction design that I will emphasize is that user commands should usually be confirmed, especially if they will change the state of the world and might be difficult to undo. When grammars are configured to return flat-string results, prompting for confirmation is easy to configure like this:

<prompt>I think you said <value expr="recResult"/> Is that correct? </prompt>

But when a grammar is configured to return JSON-structured results, the ‘value’ element above might be read as just “object object” (the result of JavaScript’s default stringify method for JSON objects, at least in Voxeo’s vxml interpreter). I believe the best solution is to write a JavaScript function (in an external file referenced with a ‘script’ element near the top of the vxml file) that is tailored to construct a string meaningful to your users from your grammar’s JSON structure, then wrap the “recResult” variable (or whatever you name it) in a call to that function. If there is any need to nudge users toward using terms that are easier for your grammar to recognize, then this custom stringify function is an opportunity to paraphrase their commands back to them using your preferred terms.

Now we’re ready to talk about sending JSON-structured recognition results to remote servers, which is the most exciting feature of vxml 2.1 for me, because it’s half of what we need to make vxml documents able to leverage the same RESTful web APIs that dhtml documents can (the other half, being able to digest the server’s response, will be discussed soon; “dhtml” === “Dynamic HTML”, which is a combination of html and JavaScript fortunate enough to find itself in a browser that has JavaScript enabled). Like html forms, vxml provides a way for its forms to submit to a remote server. And also like html, the response must be formatted in the markup language that was used to make the submission, because the response will be used to replace the document containing the requesting form. Html developers realized that their apps could be more responsive if less content needed to travel to and from the remote server, and that if they instead requested just the gist of what they needed, and the response encoded that gist in a markup-agnostic format like XML or JSON, then JavaScript could be used in their browser-based client to manipulate the DOM of the current document and that might usually be faster than requesting an entirely new document (even if most of its resources could be externalized into JavaScript and CSS files that can be cached). Because these markup-agnostic APIs are becoming widely available, they present an opportunity for non-html client markup languages like vxml to leverage them. Vxml developers created a way to leverage these APIs by adding a ‘data’ element alternative to vxml form submission in the vxml 2.1 spec. Here’s an example:

<var name="sInstructionHeard" expr="JSON.stringify(_oInstructionHeard)"/>
<data method="post"
      srcexpr="_sDataElementDestinationUrl + '/executeInstructionList'"
      ecmaxmltype="e4x" />

The ‘data’ element isn’t as user-friendly as it might be. For example, one can’t just put the JSON-structured recognition result in it and expect it to be transferred properly; instead, one must first JSON.stringify() it (this method is native to most dhtml browsers circa 2014 and to Voxeo’s vxml interpreter). And the ‘data’ element requires that even POST bodies be url-encoded, so the remote server must decode using something like this (assuming you’re using a NodeJs server):

sBody = decodeURIComponent(sBody.replace(/\+/g, ' '));
sBody = sBody.replace('sInstructionToEvaluate=',''); //Strip-off query parameter name to leave bare value
sBody = (sBody ? JSON.parse(sBody) : sBody);

What the remote server needs to do for its response is easier:

oResponse.writeHead(200, {'Content-Type': 'text/xml'});

If the server is reachable and generates a response like this, then the variable above that I named “oRemoteResponse” will be JSON-structured and have a ‘result’ property, which itself will have ‘summaryCode’ and ‘details’ properties whose values, in this case, are string-formatted. You have the freedom to use any valid XML element name — which is also a valid JSON property name — in place of my choice of ‘result’. The conversion from the remote server’s XML formatted response to this JSON structure is done implicitly by the vxml interpreter due to the ecmaxmltype="e4x" attribute. (The vxml 2.1 interpreter cannot process a JSON-formatted response as dhtml browsers can.) These JSON properties from the remote server can be used to control the flow of conversation among the ‘form’s in the same way we used JSON properties from “semantic” speech recognition earlier. Coolness!

A few final comments about ‘data’ elements:

  1. To validate the xml syntax of your app, you probably want to upload it to the W3C xml validator; however, the ecmaxmltype="e4x" attribute is apparently not part of the vxml 2.1 DTD, which the validator finds at the top of your file if you’re following my template above, and so you will get a validation error that you’ll have to assume is spurious and ignorable.
  2. My app uses a few ‘data’ elements to send different kinds of requests, so to keep the url of the remote server in-sync across those, I have a ‘var’ element before all my forms in which I define the _sDataElementDestinationUrl url value.
  3. fetchhint="safe" disables pre-fetching, which isn’t useful for dynamic content like the JSON responses we’re talking about
  4. If you want to enable caching, which doesn’t make sense for dynamic JSON content like we’ve been talking about but would be reasonable for static content, you’d do that via your remote server’s response headers.
  5. If the remote server isn’t reachable, the ‘data’ element will throw an ‘error.badfetch’ that can be caught with a ‘catch’ element to play a prompt or log error details, but unfortunately this error is required by the spec to be “fatal” which appears to mean the app must exit (in vxml terms, I believe it means the form-interpretation algorithm must exit). That’s a more severe reaction than in dhtml which allows DOM manipulation and further http requests to continue indefinitely. Requiring such errors to be fatal blocks such potential apps as a voice-driven html browser that reads html content, because it could not recover from the first request that fails. But maybe I’m interpreting “fatal” wrong; Voxeo’s vxml interpreter seems to allow interaction to continue indefinitely if this error is caught with a ‘catch’ element that precedes a ‘form’.
  6. If the remote server is reachable but must return a non-200 response code, the ‘data’ element will throw ‘error.badfetch.DDD’ where DDD is the response code. This error is also “fatal”.

At this point, we’ve covered all that I think is core to authoring a speech application using vxml 2.1. For more details, the vxml 2.0 spec and its 2.1 update are the authoritative references. Voxeo’s help pages are also quite useful.

Up next: Test-driven development of speech applications, and Hosting a speech app using Voxeo.

Print Friendly, PDF & Email

Testing custom REST APIs using a NodeJs http server

When you’re developing a REST API, and it’s not entirely clear how well your configuration of the client-side is working, it can be handy to get a server up quickly to log what it receives. In the Java servlet world, this might be done by introducing a filter class that inspects and logs all requests and responses. But if you don’t have the servlet itself setup yet, a faster way is to launch a NodeJs-based server locally.

If you’ve setup Eclipse IDE for debugging standalone JavaScript using NodeJs, then you can paste-in this as a new .js file and you’re almost home:

//Derived from

http = require('http');
//fs = require('fs');

port = 51100;
host = '';

server = http.createServer( function(req, res) {

    var htmlPrefix = '<html> <body>',
    	htmlSuffix = '<form method="post" action="http://'+ host +':'+ port +'"> String value to POST (type exit to quit): <input name="formSent" type="text"/> <input type="submit" value="Submit"/> </form> </body> </html>',
    	html = null, //fs.readFileSync('index.html');
    	body = '';
    var generateResponse = function() {
        html = html || htmlPrefix +'Received '+ req.method + (body ? ': '+body : '') + htmlSuffix;
        res.writeHead(200, {'Content-Type': 'text/html'});
        if (body == "formSent=exit") {
                //console.log("Exiting process");
    if (req.method == 'POST') {
        req.on('data', function (data) {
            body += data;
            console.log("Partial body: " + body);
        req.on('end', function () {
            console.log("Body: " + body);
    } else { //GET


server.listen(port, host);
console.log('Listening at http://' + host + ':' + port);

Once you see the console entry about “Listening…”, just open a browser tab to

If you can’t kill the process by entering exit into the web form, then you can kill the node process from the commandline.

Print Friendly, PDF & Email

Debugging standalone JavaScript in Eclipse IDE

There are some nice online JavaScript debuggers like and, but if you want to edit a .js file that’s part of a version-controlled project, you’d probably prefer to do so in an IDE.

The Eclipse IDE has the JSDT plugin that provides syntax highlighting. (And it’s baked into the “Eclipse for Java EE Developers” edition, but can be installed in other versions, too.) To evaluate the code and see the output in the IDE’s Console view, some extra steps are necessary. Z0ltan’s guide is very helpful. Some extra hints:

  1. Install nodeJs first. This is an executable that wraps Google’s V8 js engine, providing a commandline interpreter (and lots of useful libraries, I believe, such as allowing the coding of a web server using javascript). As of 2014 April 29, the Windows installer doesn’t broadcast through the OS that the Path has been updated, so as a workaround you may need to open the Environment Variables dialog and then close it with the OK button.
  2. If you copy any text from Z0ltan’s page, or this one, into your IDE, make sure to re-type all the quotes; otherwise, copy/paste tends to pick up “fancy” quote characters that cause weird errors.
  3. In between Z0ltan’s steps 1 and 2, you need to select “Program” in the left pane, and then hover over the buttons above it to select the one having hover text “New launch configuration”.
  4. There’s a comment at the end of Z0ltan’s guide advising that you add the /D flag like this:
    /C "cd /D ${container_loc} && node ${resource_name}"

    if your js source file might be kept somewhere other than the C drive. But Microsoft’s documentation of cmd.exe flags just says /D disables autorun, so I’m not sure how helpful that is.

The NodeJs API pages start here, and there’s some helpful guidance on error-handling on Joyent.

Print Friendly, PDF & Email

Why Palm’s webOS is the future of Android (and desktop computing)

Do you connect these dots in the same way I do?

  1. The current practice in OSs and browsers of asking the user at install time whether to proceed with the install, as a way of avoiding security threats, just doesn’t work. Users do not have the right kind of information at that time to decide.
  2. The threat of compromised systems and data loss is severe enough that consumer and enterprise OSs will have to be designed in a different way to manage installation risks. The widespread acceptance of smartphone apps indicates that smartphones will need such protection, too.
  3. Google’s NativeClient project is a good way of handling the risk because it provides a sandbox, and it’s better than alternatives like Java and Flash because it allows apps to run faster (because the apps are compiled natively rather than into bytecode).
  4. Palm’s webOS for its new smartphones has a very similar design to NativeClient (and since NativeClient is open source, could be built on top of it, for all I know). Specifically, webOS’ plugin development kit (PDK) will allow allow apps written in C and C++, two languages which by themselves allow altering memory contents almost anywhere in RAM and thus open to abuse by malicious app coders, but the PDK will sandbox apps, apparently in much the same way that NativeClient does. WebOS’ other interface, the Mojo SDK, allows apps written in Javascript to access data on the phone in much the same way that NativeClient’s browser plugin design would allow.
  5. Thus, webOS seems to provide a glimpse into what smartphone and desktop OSs will be like in coming years, if they deal with security threats in the inspired way detailed in the NativeClient design.

And there’s another force pushing Google’s Android smartphone OS in the same direction as webOS:

  1. Google always seems to prefer keeping its apps as platform-agnostic as it can by leveraging browsers when it can. The exceptions are Google Earth, GTalk, etc which must be installed either for performance reasons or to gain access to “hooks” in the OS that browsers can’t offer.
  2. Google’s apps for Android are Java-based (i.e., not browser-based) for apparently no strong reason. In fact, it seems that if Google had had Palm’s insights about how a web-oriented OS could be made back when Android was being designed, then Android would be very much like webOS so that Google wouldn’t have to split its app-building competence and resources across so many platforms (of course, the iPhone and Blackberry platforms would still make their own demands). Google’s efforts to build ChromeOS is another strong bit of evidence of its desire that there be fewer platforms and that they resemble browsers more.
  3. Eric Schmidt has said that Android and ChromeOS will eventually merge. I’m not sure if he came to this conclusion before or after learning about the design of Palm’s webOS, but webOS seems like a good hint of what such a merge would result in.

Am I pulling too hard on thin threads, or does this paint the same strong picture for you that Palm’s webOS really is a glimpse of the future? It sure is a fun way for me to stretch my thinking about what smartphones can do and be.

If this is an accurate prediction, then two consequences come to mind:

  1. The current unspoken practice of web engineers looking into the Javascript source of their competitors, learning new tricks, and helping the craft of web engineering to improve will suffer because companies will want to shift their presentation and business logic out of Javascript and into compiled native code for greater performance and out of a misguided attempt to protect their intellectual property.
  2. Having Google compete in the same idea space will help inspire both toward even better ideas. Of course, Google won’t buy Palm (why would it need to?), and it’s unlikely that having similar platform designs will affect the market share of either of them. As long as Palm can capture a significant share of the growing global demand for smartphones, it should be able to survive. And it’s likely to always have an advantage over Android in the beauty of its UI, given the DNA of the two companies.

UPDATE: Google released an “NDK” for Android way back in June 2009, which sounds like webOS’ planned PDK and also sounds like it was built on NativeClient. So, my prediction above that webOS is the future of Android has things a bit turned around.

Also, although the NDK seems to have a very similar design to NativeClient, and might have been built on NaCl, I’m somewhat doubtful because NaCl relies heavily on a feature known as “segmented memory” in the 386 chip architecture, and I wonder if that same feature is present in mobile CPUs such as ARM.

UPDATE: Other devs are worried that we might lose the ability to view html source and thus lose one of the primary learning and innovation paths for web app devs.

    Print Friendly, PDF & Email

    Steve Souders’ 14 rules for faster-loading websites

    Within a few years, if you notice in Firebug that a site doesn’t adhere to these, you should ask yourself why…

    These rules are an excerpt from Steve Souders’ best-seller, High Performance Websites.

    Since modular techniques for building dynamic pages (such as .ascx files in ASP.NET) don’t immediately work well with some of the rules (such as placing all scripts at the bottom of the page), new patterns are sure to emerge.

    Print Friendly, PDF & Email

    Top 7 most common mistakes with Javascript syntax

    Ok, my basis for saying “most common” is not at all scientific, just a gut sense of how many times I’ve seen 15 mins or more spent trying to find a bug that ended up being just a syntax error in one’s script. Tools like JSLint and Firebug are really helpful in general, but not with these issues. (I’ve heard Aptana should be, but I couldn’t get it to work and it kept crashing on me.)

    1. Spurious comma after the last entry in a JSON object string


    2. Use of = instead of : in a JSON object string


    3. Misspelled function and object names


      onReadyStateChange instead of onreadystatechange

    4. Spurious parenthesis after a function call


    5. Using parenthesis instead of braces for access within arrays and objects


    6. Spurious : between keyword case and a switch value

      case: 1:

    7. Expecting “False” or new Boolean(“False”) to evaluate as true
    Print Friendly, PDF & Email

    The Enter key: Making a mess of submit buttons and textboxes

    If one has a submit button in an HTML form, then pressing Enter will trigger the first of these in doc order, as though one pressed the button. At first glance, this seems like a nice feature, but in practice it leads to lots of problems. The root of the problem is that users forget or don’t know about this. (Browsers could help by giving special highlighting to such a button, as desktop apps often do.) Users might be typing in a textbox, not knowing or caring whether it’s a textbox (which will pass on any Enter to the form) or a one-line textarea (which would add a line ending to its content and not pass on the Enter). The Browse button of a file input (in some browsers) will also pass on an Enter rather than trigger a FileOpen dialog, even when it’s in focus.

    To prevent such errors, one can change all submit buttons to normal buttons and use script and a hidden input to transmit to the server the name-value pair that the submit button would have provided. Or, if there are no file inputs (checkboxes and radios might also be a problem, though), then one might try changing all textboxes to textareas. However, a gotcha with the second approach occurs when a user enters a string with no spaces that’s longer than the one-line textarea; in that case, a horizontal scrollbar will appear that might hide the text. One can try making the textarea taller, but in Firefox it has to be more than 20 px or else no vertical scrollbar will appear for multi-line entries (because the scroll thumbs are quite large in FF).

    Print Friendly, PDF & Email

    Simplifying markup and CSS selectors through “semantic” tags

    In a brown bag yesterday, Kevin Lawver suggested a best practice for DHTML: Prefer “semantic” markup instead of overuse of div. The principles are:

    • If the content identifies a major section, use an h tag;
    • If the content is list-like, such as left nav or header/footer links, use ul and li;
    • If the content is text-only, use p;
    • Otherwise, one might use div or span, but one might as well use shorter tags such as b or i…where one might adopt a practice of always wrapping text inputs with i and button or dropdown inputs with b and more general kinds of content with, say, u (assuming the styles of these tags are set to “vanilla” styles globally).

    Shorter tags make responses lighter. They also allow for CSS selectors that reflect the structure of the doc; for example, a selector containing “ul li i” suggests a container for a textbox (i.e., the i) within a radio group (i.e., the “ul li”)…which is a common pattern for a radio labelled “other:”. Note that since the selector uses type/tag names, the markup does not need to include a class attribute-value pair; that makes the markup and the CSS both shorter, too.

    But the key benefit proposed was that such markup is more self-documenting than lots of nested divs because the tags indicate the kinds of things they should contain. That is, the main benefit is improved readability and hence better maintainability.

    Read more on the microformats page about “POSH” 

    Print Friendly, PDF & Email

    Memory leaks due to iframes in IE (also how to file-upload via dojo)

    1) Changing the src of an iframe in IE may cause later events to fire multiple times – once for each change of src.  See

    06-24-2006, 09:21 AM
    Yeah, I see what you mean, big time slow down in IE. No problem in FF. I didn’t test any others. I Thought it might be a memory problem so I tracked memory usage in Task Manager. No real problem with memory usage but I noticed actual CPU usage was spiking and then getting pegged a 100%. The more I loaded pages into the iframe in IE after that the longer CPU usage would remain pegged at 100% and this corresponded exactly with the amount of time that the frame was blank. I then had a look at your source code and saw that you had commented out this line:

    //currentfr.detachEvent(“onload”, readjustIframe) // Bug fix line

    Those two little red slashes at the beginning make it a comment. Why did you do that? I’m like 99% sure that this is the problem as that is an IE specific line designed to prevent multiple instances of the resizing event. Without that line, each time you load something into the iframe an event gets attached to it. After 20 loads, you have 20 events all firing at the same time. Almost has to be it. Just remove the red slashes and you should be fine.

    2) Alex Russell of DojoToolkit explains that memory problems can occur with IE in both DOM event handlers and XHR due to the browser’s reference counting mechanism not realizing that it can recover some closures after they’re no longer needed.  See (which also explains how dojo can be used for file uploads).

    Print Friendly, PDF & Email

    Trick IE into reserving a connection for Comet traffic

    “Comet” is a dhtml technique for sending updates from the server to a browser client. One way to do it is to place a hidden iframe in one’s page and have the server write a <script> to the response whenever there’s an update; the script executes as soon as the client receives it, and it might popout a window containing a new message or change styling in the parent doc. In order for the client to keep getting updates, the server purposely never indicates that it has finished writing the response.

    A webapp that uses Comet typically needs one connection for Comet and one to send requests to the server whenever the user makes some kind of edit (aka RPC = remote procedure call).  While the second connection isn’t needed all the time, IE6 permits only two connections to be used at any time, so any other windows open in IE have to fight with your service for them, which sometimes leads IE to close one of your connections prematurely.

    Although it’s not friendly to such other services, one can trick IE into reserving one of its connections for your Comet page by requesting that page from a different domain than one makes RPCs from. If the server must maintain state, the domains can even point to the same machine (although you probably need to map both domains to a switch that then uses a cookie to find the particular machine)

    To simulate such a setup on a Window dev machine, add the following line to C:WINDOWSsystem32driversetchosts       fake-domain1                       fake-domain2

    Print Friendly, PDF & Email

    IE6 doesn’t support onclick handlers for

    Instead, you have to use an onchange handler in the SELECT, like this:

    function fAlertA() { alert(‘A’); }
    function fAlertB() { alert(‘B’); }
    function fHandleActionSelection(nSelect)
    var iOptionIndex = nSelect.selectedIndex;
    if (iOptionIndex > 0) { //skip over ‘Actions’
    var sFnName = nSelect.options[iOptionIndex].value;
    <select onchange=”fHandleActionSelection(this)”>
    <option value=”fAlertA”>a</option>
    <option value=”fAlertB”>b</option>

    Print Friendly, PDF & Email

    Showing and hiding rows of a table

    Great advice here:

    In FF1.5, if a <tr>’s style is changed from display:none to display:block, the row doesn’t appear correctly; for example, a <textarea> within a <td> might appear very small.

    The poor display gets worse if one toggles back and forth; for me, vertical space where the row would normally go kept growing with each toggle.

    The external blog entry points out that FF1.5 uses display:table-row for displaying, so one might do this:

      try { = “table-row”;
      } catch (e) { //for IE, etc = “block”;

    Print Friendly, PDF & Email

    “Internet Explorer cannot open the Internet site…Operation aborted”

    There was another problem that just appeared today where IE was popping an error saying “Internet Explorer cannot open the Internet site…Operation aborted”. Web search indicated people saw a similar problem when using the Google Maps API, evidently because IE doesn’t like scripted changes to a table DOM before the rendering finishes:

    The suggested change was to wrap the JS inside a setTimeout with wait of 1ms. That worked for me.

    Print Friendly, PDF & Email

    Unspecified error for createStyleSheet()

    Our rich text editor (RTE) was working in IE just yesterday, at least from my dev server. But once installed in QA, a JS error was being popped whenever the Compose page loaded, and the error indicated M$ method createStyleSheet(), which is called on the RTE iframe’s doc object. If one takes the option to debug offered by the error popup, VisualStudio can only say “unspecified error”, but then the RTE becomes usable!

    This suggested a timing problem, and sure enough, a web search turned up problems if one calls createStyleSheet() before the doc obj’s readyState property reaches “complete”:

    To workaround this, one could just keep trying:

    function foo (sCssPath) {

    try {


    } catch() {

    setTimeout(“foo(‘”+sCssPath+”‘)”, 10);



    But that runs the risk of looping forever. A better way might be to use the M$-only onReadyStateChange handler:

    if (bIE) {

    if (eDoc.readyState == “complete”) {


    } else {

    eDoc.onreadystatechange = function() {

    if (eDoc.readyState == “complete”) {






    Details at MSDN

    Print Friendly, PDF & Email

    Window.onload and document.body.onload are the same

    I’m used to coding with window.onload but while integrating code from someone else I found references to document.body.onload, and I thought this must be some other event.  But on page 135 of “Dynamic HTML” by Goodman, he says:

    by a quirk of HTML tag structure, all window object event handlers are associated with the BODY element

    Print Friendly, PDF & Email

    Don’t rely on the text of

    Sometimes I see dropdowns marked up like this:

    <option><%= localized string %></option>

    which forces the server-side code to do checks like this:

    if (sDropdownChoice == L10nStrings(“localized string”))

    Looking up strings server-side to do comparisons like this is is inefficient. Instead, markup the page like this:

    <option value=”sender”><%= localized string %></option>

    so the server code can compare like this:

    if (sDropdownChoice.Equals(“sender”))

    Even better, use a staticly-defined string such as,

    static public readonly string s_sSender = “sender”;

    and use that in both the markup creation and the comparison, which helps avoid typos and helps with maintainability by making sure that if the string ever needs changing, the update is made everywhere it should be.

    Print Friendly, PDF & Email

    Avoid script collisions via namespaces

    If you plan to use a JS function named something generic like “update()”, then you run the risk of another engineer on your team unknowlingly writing a same-named function and killing yours.  Just as bad, some external party like an ad vendor might have script with the same name.

    Another problem is when you have to maintain someone else’s code, and figure out where “update()” is defined.

    To avoid such problems, put all your methods in object(s) that are named after your dev team and that suggest what file they are defined in.  Furthermore, make the name specific enough to indicate the specific functionality.  For example, change “update()” to

    var aol = {wsl: {
    fUpdatePagesDropdown: function(sNewDefaultId) {


    If I see a JS call that starts “aol.wsl.”, I immediately know it’s defined in lite.js.  Similarly, if I have inline script in a page like Compose, I’ll create a namespace called “aolCompose”.

    btw, I like putting the commas before object properties, because it helps me remember not to put a comma after the last property — which is a common source of hard-to-debug problems.

    Print Friendly, PDF & Email

    Avoid cross-site scripting vulnerabilities

    “Cross-site scripting” (aka XSRF = cross-site request forgery) is an evil practice where someone tries to trigger your site into sending sensitive info, such as user logins, to their own site. A typical method is to “inject” script into your pages so the user’s browser will render the script as though it came from you, and the script might send the user’s cookies to the hacker site so they could append them to their own requests, making it very difficult for your server to tell the requests aren’t coming from the real user, and thereby allowing the hacker to damage the account.

    A typical trick is to put an unbalanced quote before some evil script into a form, such as:

    ‘ <script>document.location=”document.cookie”</script>

    The bet is that, if you take this submission and paste this value into your response html to confirm what was typed in, that you will wrap your input value in the same kind of quote. This embedded quote then breaks the html at that point, and the hacker’s <script> would be executed as though it was part of your intended markup.

    One way to avoid such vulnerabilities is to “sanitize” everything you paste into your pages that wasn’t created directly by you. For example, anything from the request or from a data-layer whose data is created by some user (such as email or contact names) should be sanitized before sending it out as part of a response.

    You’ll want one sanitizing method that accepts HTML and alters it such that control over fonts, colors, etc is preserved. You might also want a stricter method that escapes markup or strips it completely. The former is useful when sanitizing html-formatted email messages, and the later, for sanitizing names, addresses, filenames, etc where you do not want to allow any idiosyncratic styling.

    Writing a good sanitizer is hard. If I come across a good open-source one, I’ll post about it. AntiSamy seems like a great solution, and is open.

    Print Friendly, PDF & Email

    JavaScript debugging

    Copied from someone’s email:
    This link has great info, tips, tools to debug Dojo and JavaScript, etc.

    I believe some of you asked how to do step-thru debugging of individual JS files that use the Dojo toolkit. You have to set “djConfig.debugAtAllCosts=true” and make sure the page has the “dojo.hostenv.writeIncludes();” statement added in a script element. (I believe it has to be in the head section.)

    Print Friendly, PDF & Email

    Firefox 1.5 ignores changing tabIndex of input type=file

    One can change the tabIndex of a file input in FF1.5 and verify the value has changed, but the browser will ignore the change and use whatever value was staticly defined at page load.

    As a workaround, see if you can define ‘tabindex’ for all items in the markup. If you’ll be dynamicly adding non-file inputs, set their tabIndex to be the same as whatever they should come after in tab order — browsers use doc order to break ties in tabIndex values. To dynamicly add file inputs, put something like this in the markup:

      <input type=file id=file0 tabindex=N onchange… /><span id=file1container></span>

    Then, when you add a new file input, assign to the innerHTML property of the span rather than using createElement. (You probably also want to style the old input as display:none, depending on what you’re trying to do.)  The tabindex of the new input can be N also.

    UPDATE: Once the file input is filled, one might want to take it out of the tab order. Setting tabIndex = -1 has no effect in FF1.5, and setting the style to display:none (or visibility:hidden) will cause Safari 2.0.3 to ignore it when sending the request. I haven’t found a solution that works for all browsers.

    Print Friendly, PDF & Email

    Nonvisible file inputs are not “Accessible”

    File inputs are pretty ugly, and can be confusing to naive users (e.g., “Should I type what I want in that box?”).  Quirksmode describes a nicer solution that (Compose) happens to use:

    But if one cares about Big-A accessibility, such a solution won’t work for users with normal vision who are limited to the keyboard (no mice), such as those with carpal tunnel syndrome.  The reason it won’t work is that such users rely on tab order, and tabbing into an invisible file input consumes 2 tabs before one sees any visible result — the user won’t know they are on top of the Browse button.

    One might suggest putting the nonfunctional button in the tab order, and when it receives focus, call click() on the nonvisible file input. But in my experiments, click() did not trigger a FileOpen dialog.

    One might also suggest faking focus on the nonfunctional button by drawing a rectangle around it when the nonvisible file input receives focus.  But remember that file inputs consume 2 tabs — it might make sense to draw the rectangle on the second tab, when focus is on the Browse button, but what about the first tab?  Even if there are separate events from the file input from the tabs, which seems doubtful, what would one display for the first tab?

    A decent compromise is to get rid of the nonfunctional button and make the file input visible.  But like the use, hide the inputs once they are used and show a checkbox with the filename instead, with focus moved to this checkbox (so blind users will be aware of the result of their file selection).

    Print Friendly, PDF & Email

    Debugging Javascript in IE when you don’t control the page

    So everything works great from your dev server, but once deployed, a script problem is discovered. Even worse, it only happens in IE.

    Something that worked for me was to install Fiddler for .Net 2.0. There’s a nice walkthrough WMV video (why do voices in walkthrough videos all sound like the famous Ruby one?). The video doesn’t say much that one couldn’t figure out from the UI, with one exception: The black bar at the lower right for setting breakpoint patterns. For example, enter “bpu” followed by a space and some substring of the filename you’d like to break on. (If “Capturing” isn’t already showing in the bottom left status bar, click there and it will begin recording.) When one visits the page of interest, Fiddler will suspend the network request and flash in the taskbar. A red ‘T‘ will show in the left panel for the breakpointed resource, and a red box will show in the right panel. Click on the yellow ‘Break on Response’ button, and if it shows “response is encoded” below that, double-click on that message. You should see a ‘Text View’ tab; click on it. Edit the resource as you like, and then click the green ‘Run to completion’ button. Note that Fiddler breakpoints even if the file is already in cache!

    This allows for inserting JS alerts and such, but it seems it could also allow one to insert a script tag for FirebugLite, and then you’d be able to debug after page load, too.

    The particular problem this helped with was: Our rich text editor (RTE) wouldn’t allow one to type in it (after the page load and as long as the page was in view). This usually happened only after the cache had been cleared. This pointed to a timing issue, and sure enough, the problem was that the RTE’s iframe wasn’t complete before we tried to enable design mode on it. This is the same issue I blogged about with IE’s createStyleSheet method, but in this case there was no script error indicated. Our code to handle this case had a typo in it — I had spelled “onreadystatechange” as “onReadyStateChange” as O’Reilly’s Dynamic HTML: The Definitive Reference spells it, which is a typographic convention for highlighting the words within handler names that will cause you grief if you follow the spelling literally. Use all lowercase!

    Print Friendly, PDF & Email

    OnBeforeUnload fires twice in IE when href!=# and onclick causes nav away

    We had been using href=foo for leftNav links instead of href=# because screenreaders read <a href=#…> as “link this page” and that may mislead SR users to tab around looking for the content. (The real action of clicking is to submit a form via an onclick.)

    But we have to use href=# because otherwise IE fires onbeforeunload once because href!=# and once because the onclick causes a nav change (because it calls submit() on a form). We can’t use setTimeout to ignore the 2nd firing because we need the form submission to happen, and we can’t skip the 1st firing because we need it when the user clicks a non-link (such as form buttons and the window ‘X’ close button).

    So, we have to live with the screenreader (JAWS) behavior of reading these links as “link this page”.

    Print Friendly, PDF & Email

    IEDeveloperToolbar almost as useful as Firebug

    I was just told that there’s an IE Developer Toolbar:

    At first glance, it provides a significant fraction of the utility of Firebug, such as showing DOM structure changes post-load. It also has a “Trace Style” feature, evidently for revealing the cascade of styles that apply to a particular element, but when I tried it all it did was open a window displaying my stylesheet.

    It does not appear to support live editing of markup or styles.

    Print Friendly, PDF & Email

    Write better JS by watching YUI Theater

    Yahoo has a great set of videos about JS:

    The ones by Douglas Crockford — “The JavaScript Programming Language”, “An Inconvenient API: The Theory of the DOM”, and “Advanced JavaScript” — are intended to be watched in that order, but I think watching the DOM one first helps to clarify what happens in the browser when JS is run.

    btw, Yahoo Video has no rewind/forward capability, so you might want to download the slides as well (linked next to each video).

    Print Friendly, PDF & Email

    IE JS doesn’t support indexOf() for arrays

    Firefox supports indexOf() for arrays, but IE doesn’t (at least IE7 doesn’t). To work around this, write your own version…maybe like this:

    function fiArrayContains(ax, xTarget) {

    var n;
    if (fbIsArray(ax)) {

    n = ax.length;
    for (i=0; i < n; i++) {

    if (ax[i] == xTarget) {

    return i;



    return -1;


    function fbIsArray(x) {

    return fbIsObject(x) && x.constructor == Array;

    fbIsObject(x) {

    return (typeof x == ‘object’ && !!x) || fbIsFunction(x);

    fbIsFunction(x) {

    return typeof x == ‘function’;


    Print Friendly, PDF & Email