AI-Enhanced AAC: DEAN

Dynamic, Expressive, Augmented Narrator (DEAN) is an AI-enhanced AAC research platform designed to help people who use speech-generating devices participate more effectively in face-to-face conversation. Rather than requiring users to compose every message from scratch, DEAN listens to a communication partner’s speech, generates possible responses, and presents them for the user to select, preserving the user’s control, identity, and agency. It integrates research on conversational timing, rapid pragmatic actions, partner engagement, personalization, and expressive speech to support communication that remains connected to the flow of interaction. As a research probe, DEAN is being used to investigate whether conversational AI can help augmented speakers stay in time, in control, and recognizably themselves during real-world conversations. You can try out the interface yourself on the Demo Tab.

Background

System Architecture

Personalization

Interface

Field Testing

Introduction

Timing in Conversation

Integration Point

Development

Summary

For decades, speech-generating devices have helped people compose and speak messages, but they remain poorly suited to the temporal and sequential organization of ordinary face-to-face interaction. Conversation unfolds enchronically—that is, in time: each action is fitted to what has just happened and helps shape what can happen next. Participants manage this flow through words, gaze, gesture, facial expression, overlap, continuers, repair, agreement, disagreement, and topic shifts, often within only a few seconds.

The composition constraints of SGDs disrupt this organization. While an augmented speaker is composing, the partner may not know what action is underway or which prior turn the emerging utterance is designed to address. During that delay, partners may continue talking to fill the silence, shifting topics,  or commenting on the delay itself. As Project Enchrony showed, by the time the SGD utterance is spoken, the interactional environment may have changed enough that its relevance is harder to recognize.

Thus, composition delay is not simply slow message production. It can change the sequential status of an utterance, making an otherwise appropriate contribution harder to place, more likely to be misunderstood, or more vulnerable to being treated as late, irrelevant, or disconnected from the ongoing conversation.

Project DEAN was developed to explore whether conversational AI could help close that gap. Rather than asking an augmented speaker to compose every utterance from scratch, DEAN listens to the communication partner, generates possible responses, and presents those options to the augmented speaker for selection. The goal is not to replace the speaker, but to reduce the burden of composition while preserving the speaker’s control, identity, and agency.

From the beginning, DEAN was designed as a research probe, not a commercial device. Its purpose is to let us study what happens when AI enters AAC-mediated conversation:

  • Can the system respond quickly enough to enable augmented speakers to use their devices in NOW and NEAR TIME? 
  • Are the generated utterances relevant and factually correct? 
  • Do they represent the augmented speaker’s point of view and current stance in the conversation?
  • Do they represent the interactional context, i.e., setting, participants, shared history?
  • How does the augmented speaker interact with the AI during face-to-face social interaction?
  • What role does the AI take in a conversation? What points of view does it represent? 
  • How should DEAN be designed to enable the augmented speaker to maintain pragmatically effective, engaged interactions?
  • Does speech prosody change when augmented speakers can use their devices within the NOW and NEAR time boundaries of the enchronic temporal-sequential frame?
  • Do DEAN’s contributions support or undermine the speaker’s self-presentation? 
  • Does DEAN help the partner remain engaged? 
  • How does DEAN represent the interactional context, i.e., setting, participants, shared history?”
  • Can AI help augmented speakers remain in time, in control, and recognizably themselves during face-to-face conversation?

These questions shaped the entire development process.

Conversational timing became the central design challenge for DEAN. The system needed to operate within the temporal demands of conversation, not simply generate fluent language. This meant that DEAN also had to be evaluated according to other interactional criteria: sequentially fitted, personally appropriate, and usable by the augmented speaker.

To guide this work, we developed three social-interaction criteria for AI-AAC:

  • Sufficiency and Sincerity: Does the response provide what the partner needs to understand, while remaining truthful and acceptable to the augmented speaker?
  • Timing and Sequencing: Does the response arrive quickly enough to remain connected to the prior turn?
  • Content and Manner: Does the response fit the speaker’s style, stance, self-presentation, and desired level of disclosure?

These criteria helped us move beyond ordinary AI benchmarks. A response can be grammatical and relevant in a general sense, but still fail as an AAC utterance if it is too late, too generic, too revealing, too formal, or simply not something the user would choose to say.

DEAN did not develop in isolation. It emerged from the broader Project Converse effort to understand why current AAC technologies remain poorly suited to face-to-face conversation and which design principles might better support augmented speakers in real interaction. Over the course of our earlier research, we came to see that the central design problem was not simply how to help someone produce a message. The deeper problem was how to help them remain active, visible, responsive, expressive, and socially present within the ongoing flow of conversation. Each of the projects we engaged in addressed a facet of this larger question and contributed to DEAN’s development.  

Enchrony provides the temporal-sequential foundation for work on DEAN. It showed that conversation is organized in time and that many augmented speakers are forced to operate their technologies outside the enchronic frame in which ordinary conversational responses are expected, recognized, and understood. Move responds to that problem by developing rapid access to short, pragmatically powerful words and phrases—continuers, repair initiators, agreements, disagreements, interruptions, social responses, and topic-management moves—that allow augmented speakers to act before the interactional moment has passed. Engage addresses the partner-coordination problem by examining how interface configurations affect shared attention, visibility of message construction, and mutual engagement during AAC-mediated interaction.If Enchrony identified the importance of timing, Move identified the value of short, pragmatically focused language, and Engage identified the importance of partner coordination, Intone addresses how AAC utterances should sound when delivered—and whether they are delivered soon enough for vocal form to matter. In ordinary conversation, intonation, word stress, loudness, rhythm, and voice quality help speakers distinguish actions such as accepting, resisting, questioning, repairing, affiliating, or closing a sequence. However, in SGD-mediated interaction, utterances are often delayed long enough that the pragmatic force of intonation and word stress may be weakened or lost. After several seconds, the prior turn may no longer be active in the same way for the partner; the sequential moment has shifted, and the expressive contour of the utterance may no longer carry the same interactional impact. DEAN brings these strands together in an AI-enhanced AAC probe.

Now Time=0 to a few seconds; Near Time=2-10 seconds; Delay Time=10 seconds to several minutes.

DEAN is our attempt to build an interactional AAC system in which timing, language, interface design, partner coordination, and expressive speech are treated as mutually dependent. The shared design principle across Project Converse is that AAC systems should be evaluated by how well they support participation in the temporal, sequential, embodied, and social organization of face-to-face conversation. The goal is not only to help augmented speakers produce messages, but to help them remain in time, in control, and recognizably themselves within the interaction.

Initial discussions about DEAN began in 2022, and active development began in 2023 as large language models became widely available. The first phase focused on whether LLMs could generate plausible AAC response options based on a partner’s spoken turn. This early work was promising, but it also revealed how difficult the problem really was.

Todd Hutchinson played a central role in this phase. Todd is a lifelong SGD user, co-researcher, and prototype evaluator. His autobiography provided a rich source of personal material for exploring whether DEAN could generate responses grounded in his life history, experiences, and conversational identity. The team used this material to personalize the system, then evaluated how well the generated responses aligned with Todd’s perspective.

The results were mixed. Research by Sayantan Pal and members of CADL showed that the large language model (LLM) trained on Todd’s book outperformed a variety of other LLMs (Pal, et al., 2024). Also, DEAN could generate fluent, sometimes humorous, and often topically relevant utterances. However, many responses still did not work for Todd. Some of the language used by the LLM was not the language Todd was accustomed to using. Some disclosed information in ways he would not choose. Other utterances were redundant, overly polished, or interactionally awkward. This was a major turning point: training on background materials helped, but it was not enough. DEAN needed to support not just topical relevance, but speaker-aligned utterances — responses that Todd could accept as his own in that moment.

As DEAN developed, it became clear that the system could not be understood as simply “an LLM attached to an AAC device.” DEAN is a layered AI-AAC system with four interdependent layers.

The Hardware and Communications layer manages the technical flow of the system. It includes hardware, server communication, cellular connectivity, automatic speech recognition, text-to-speech, speech output, and logging. This layer determines whether DEAN can reliably capture partner speech, process it, return responses, and speak the selected utterance in the field. This layer is discussed in greater detail in the Hardware and Communication sections.

The Language layer generates and organizes candidate utterances. It includes the LLM, prompting strategies, personalization, light RAG, persona information, action-sequence prompting, and emerging discourse memory. This layer determines whether DEAN can produce language that is relevant, personal, non-redundant, and interactionally useful. This layer is addressed more fully in the LLM and Personalization sections.

The Social layer represents the prompting sent to DEAN that influences its conversational behavior. This includes details about the augmented speaker’s personality and preferences (Persona), as well as fine-tuning DEAN’s responsiveness in providing relevant spoken actions in response to interpreting partner talk. These will be discussed in the Personalization section of this report. This layer is examined in greater detail in the Personalization section.

The Interface layer is the visible and actionable representation of DEAN (buttons, layout, object behavior, etc.) that determines how the augmented speaker – and their partner – sees, evaluates, and uses DEAN in their interactions. All the other layers interact with the interface, making its features – including its AI—usable or unusable for conversation. DEAN’s development has therefore focused on aligning all four layers. A more detailed discussion of this layer appears in the Interface section.

Since January 2026, DEAN has been used for field testing with 2 participants, each using DEAN interfaces customized to their access and representation needs. This study represents much of our research design efforts this year. The design, implementation and initial results of the study are presented in the Field Testing Section of this project.

Across Project Converse, DEAN developed from a diverse set of AAC-related interaction projects, to early experiments with large language models to a field-ready AI-AAC research probe used in our AI/social interaction research. Its central contribution is not merely that it adds AI to AAC, but that it provides a way to study what AI must become in order to support real conversation.

DEAN integrates the major strands of Project Converse. Enchrony provided the temporal framework. Move provided rapid pragmatic action. Engage highlighted partner coordination and interface visibility. Intone contributed expressive speech and prosody. Discourse memory added continuity across turns. Together, these components make DEAN a functional model for studying the future of conversational AAC.

The guiding question remains: Can AI help augmented speakers remain in time, in control, and recognizably themselves during face-to-face conversation? DEAN is our research platform for answering that question.

Directors: Higginbotham, Golleru
Associate Researchers: Buckley, Agarwal
Start Date: April 2023

DEAN’s hardware and network infrastructure evolved over four years from a multi-device laboratory prototype to a web-based system that participants access through a browser link on their own devices. This progression was not linear. Building this infrastructure required solving a series of interconnected engineering problems over the grant period: developing a stable server platform, integrating ASR and speech synthesis with an AAC interface, connecting a language model to live conversational input, and reducing end-to-end latency to a level that could support real interaction. Each phase produced working components while also revealing the constraints that shaped the next design, and these cumulative efforts became the hardware and communication foundation for DEAN, our Dynamic, Expressive, Augmented Narrator.

The earliest architecture, developed in 2023, established the core processing pipeline: partner speech is captured through a microphone, transcribed via ASR, sent to a language model for response generation, and returned as synthesized speech output through the OS-DPI interface. Todd Hutchinson served as the initial prototype evaluator during this phase, and his written materials, including a 60-chapter memoir, were used to study how personal information affected the relevance of AI-generated responses. This phase confirmed that the pipeline was technically viable but left latency, portability, and system stability unresolved. It also defined the technical requirements that would guide subsequent development: the system had to receive partner speech, convert it to text, generate relevant utterance options, display those options through OS-DPI, and produce speech output fast enough to support live conversation. It also needed to log system events and user selections so that we could evaluate timing, response quality, and interactional fit during research sessions.

Initial version of system architecture for Conversant AI

In 2024, we purchased and installed a dedicated server for AI development and connected OS-DPI to external AI services. We also developed a Raspberry Pi-based hardware configuration with an attached microphone, battery power, and cellular connectivity. The Raspberry Pi handled speech capture and managed communication among the partner, augmented speaker, OS-DPI interface, ASR, LLM (BlenderBot, GPT-4), and TTS components. This architecture supported portable testing and let us isolate the contribution of each component to latency and reliability. It also clarified that carrying external hardware into participant sessions was not sustainable as a long-term testing strategy.

First implementation of DEAN hardware: Raspberry Pi w/ battery, Wifi hotspot, external microphone and Microsoft Surface Pro (Todd used the Surface Pro as a tablet).

A significant source of instability during this period was the tunneling process used to route transcripts from the Raspberry Pi to the server. We tested cellular and mobile connectivity options to reduce reliance on participant Wi-Fi, but these were unstable in practice. We then replaced tunneling with direct HTTP posting, so ASR transcripts were transmitted to the backend server as soon as they were generated. This change reduced a recurring failure point without requiring changes to other system components. We also modified the speech recording workflow. Earlier versions required the augmented speaker to press a record button before each partner’s utterance, which introduced an unnatural interaction burden: the user had to anticipate when the partner would speak, activate recording, wait for transcription and response generation, and then select a response option. We moved recording and speech transfer into OS-DPI directly so the browser could access the device microphone, send speech to the ASR system, receive the transcript, and forward it to the server without requiring manual input between turns. A related challenge arose when partners continued speaking after an initial utterance was recognized, causing subsequent speech to exceed the LLM input buffer’s capacity. We addressed this by giving the augmented speaker a control that lets them choose whether to incorporate the additional partner utterance into a new round of response generation or to proceed with already-generated options. This gives the augmented speaker agency over the conversational pace rather than forcing the system to resolve the timing automatically.

We also migrated the project from Google Cloud Platform to a dedicated on-premises server at the University at Buffalo (UB), maintained by the university’s IT department. We had initially used cloud services because they were fast to set up and flexible for early experiments. Over time, they became expensive and created limitations when we needed additional RAM or GPU support for locally serving large language models. The UB server eliminated monthly hosting costs, gave us direct control over hardware upgrades, including GPU access, and provided institutional support for maintenance, backups, and security. Hosting human-subjects data on UB infrastructure also strengthened our compliance posture. The UB server now hosts OS-DPI along with several web applications, providing a stable and sustainable platform for all ongoing development.

The Prometheus dashboard is used to monitor the DEAN system status.

As DEAN became more distributed across OS-DPI, ASR, the LLM, speech synthesis, and server-side routing, we added Prometheus and Grafana to monitor system behavior during live testing. Prometheus collected metrics at the system and service level: API status, Raspberry Pi connection state, server activity, request timing, response-generation latency, connection health, and processing behavior across sessions. Grafana rendered these metrics in a dashboard we monitored in real time during testing.

This infrastructure was necessary because most of DEAN’s early failures did not involve a single component breaking outright. Instead, they arose from small delays, brief interruptions, and misattributed errors distributed across components. Prometheus captured these signals continuously. Grafana let us identify where delays accumulated, whether the server and web interface were communicating correctly, whether the Raspberry Pi remained connected, and whether the system was stable enough for participant testing. The result was a shift from post-session guesswork to real-time diagnosis, which became especially important as DEAN transitioned from a Raspberry Pi-based prototype to a web-based system accessed by participants on their own devices.

The Grafana dashboard is used to monitor the DEAN system performance.

This same focus on system visibility also shaped our work to integrate participants’ everyday AAC systems into DEAN. For Hutchinson, we connected his Minspeak device to DEAN through Bluetooth so that his own device output could enter the system’s communication pathway. This integration added a new networking and attribution challenge: DEAN needed to distinguish between partner speech captured through ASR and augmented-speaker output coming from the AAC device. This distinction was important for two reasons. First, DEAN needs to recognize the augmented speaker’s self-composed utterances as part of the ongoing conversation, not only as selections from AI-generated options. Second, the augmented speaker’s contributions must be preserved in discourse memory so the system can maintain conversational coherence across turns. We continued to refine speaker attribution so that DEAN correctly identifies Hutchinson as the source of his Minspeak output rather than misattributing it to his partner.

In Year 5, we transitioned DEAN from the Raspberry Pi-dependent prototype to a fully web-based system. Participants received a URL pointing to their personalized OS-DPI interface and could open DEAN on their own device using their own internet connection, microphone, and speakers. The server no longer needed to be paired with participant-side hardware for testing to proceed. This change reduced setup time, eliminated equipment transport, and made DEAN deployable for remote and field-based sessions.

Todd independently switching from Minspeak to DEAN.

Concurrent with the shift to web-based delivery, we replaced the language model backend with Llama 70B served through Groq, moving away from the GPT-4 and Azure TTS configuration used in earlier testing. End-to-end LLM response time decreased from approximately 10 seconds in 2023 to approximately 1 second in Year 5 testing. This reduction mattered practically: a 10-second delay between partner speech and the appearance of response options is too long to support live turn-taking, while a 1-second delay is within a range that participants can use in face-to-face conversation. We also integrated ElevenLabs for speech synthesis, which produced more natural-sounding output through the participant’s own device audio compared to earlier Azure TTS solutions. The web-based DEAN setup now combines personalized OS-DPI interfaces, Deepgram ASR, Llama 70B through Groq, retrieval-augmented generation (RAG) memory, persona information, action-sequence prompting, MOVE utterances, and ElevenLabs speech synthesis into a system that participants can access from their own devices.

High-level architecture of the current DEAN.

Each infrastructure decision across the grant period, from server migration to ASR integration to latency reduction, shaped what DEAN could do in practice. By Year 5, the system no longer required lab-managed equipment or external hardware. Participants could access their personalized DEAN interface from their own devices, in their own settings, which is the practical condition for the broader participant testing that continues in Year 5 and beyond.

Persona

Action Sequences

Discourse Memory

A “persona” represents a specific type of user or customer within a target audience, often used in marketing and design to understand their needs, behaviors, and motivations (Wikipedia, 2024).

A central goal of DEAN is to ensure that AI-generated communication reflects the individual using the AAC system rather than producing generic responses. To accomplish this, we developed a persona framework that captures characteristics such as personality, demographics, lived experiences, physical abilities, interaction style, and important relationships. These elements are used to guide the AI toward responses that better align with the user’s context-based identity, preferences, and conversational patterns.

To evaluate the impact of persona information, we compared responses generated by a large language model (LLM) with and without persona-based training. By November 2025, we had assessed approximately 125 generated utterances. Responses from the persona-enhanced system were judged to be directly related to the user’s personal information 78% of the time, compared to just 19% for the standard LLM. While the non-personalized model directly answered questions more often, its responses were generic and repetitive.

However, when prompted with a persona, the LLM responded much more appropriately.

One shortcoming we noticed in DEAN’s responses to Todd during our initial testing was a homogenous “stance” on the topic (Higginbotham et al., 2025). The responses often included reworded versions of the same idea, resulting in differences without distinction. This left Todd with only one way to respond despite a variety of options. For example, when we asked, “Do you prefer to be alone during a snowstorm, or do you like having people around??, DEAN’s responded with two semantically redundant utterances: 1) I like being alone because it’s a great time to relax and read, and 2) I prefer being alone for relaxation and solitude.  

We considered how to address this problem by examining the connection between communicative purpose and language choices. In other words, we asked how does the understanding of what type of communication action or action sequence (e.g., opening or closing conversations, sharing information, making requests) someone believes they are engaged in influence their interpretation of their partner and their choice of language in replying (Drew, 1997, 2018; Levinson, 1983)? For instance, we respond differently when “Hey” is used as a greeting (e.g., A: “Hey, B!” B: “Hey! How are you?”) than if it is used to summon our attention (A: “Hey, Bea!” B: “Yes?”). We wondered if making information about action sequences available to the  LLM through targeted prompting could help it create responses that reflect more varied communicative actions.  

For our pilot study we developed a testing interface that allowed us to review and revise our prompt easily.

Code for system prompt in Jupyter notebook testing interface.

Initial trials revealed that a multi-step process was necessary to guide the LLM through the logic of first identifying possibly relevant action sequences for a given target utterance then generating a response appropriate to each labeled action sequence. 

First and second part sequences resulting from action sequence prompts.

When the full pilot study was completed, findings indicated that the GPT effectively matched utterances with an action sequences label with and without the action sequence prompt, however, the labels assigned were not entirely drawn from the original list provided by the prompt. While the ‘new’ labels were seen as qualitatively appropriate (according to the researchers’ informal assessment), the process by which the GPT generated these labels was unclear. 

A larger scale trial of the action sequence prompt was undertaken with results of this work presented at ASHA 2025. The responses achieved under both conditions were evaluated by researchers at CADL using a questionnaire developed in accordance with the criteria identified by Higginbotham et al (2025) published in ATOB, and the results were compared. Analysis  indicated that the responses produced under the action sequences prompting condition were, overall, more varied than those produced using the LLM alone.  

The data were then examined to determine which types of first utterances (i.e., belonging to which action sequence categories) generated the largest differences in the variety of responses between the prompting conditions. The slide below summarizes three levels of differentiation across the 16 action-sequence categories. 

Seven categories were strongly differentiated (≥ .80), four showed moderate differentiation (0.40–0.60), and six were not distinguishable from the non-prompting condition in the judges’ ratings. 

When we did a post hoc review of the six sequences that did not show a difference between the Action Sequences prompt and the LLM alone, a number of factors were identified that might have impacted the lack of change.  The quality of responses seemed to have been impacted when two sequences included use of  the same word. This happened twice, once with the word “minimization” in the sequences Emotion Display – empathy/minimization & Disclosure – support/minimization, and again with the word “apology” in the sequences Complaint– excuse/apology & Apology– acceptance/rejection. In each case, one of the sequences was on the “No difference” list and the other was on the “medium difference” list. Apology – acceptance/rejection sequences also included one hallucination in which the AI provided an apology as a response to the query which itself contained an apology. For the Complaint- excuse/Apology sequences, when the action sequence prompt was used, the model often provided an apology first then provided an additional phrase that provided an excuse. Interrater differences in the interpretation of “excuse” may have impacted ratings of the responses to these items. Six of the eight sequences showing no discrimination between the prompting condition and the LLM alone were found to have formatting inconsistencies within the prompt, slashes and dashes used interchangeably. Finally, two of the labels may have been underspecified and/or lacked clear exemplars (e.g., Narrative / Response).  

These results suggest that action sequence prompting can make a difference – at least for some categories – for providing a greater variety of responses compared to non-prompting. Practically speaking the results also point out which areas we need to work on to provide greater differentiation from the non-prompting condition. Qualitative results of our recent fieldtest of DEAN, an AI enhanced SGD probe (see later section for details)  indicate that the lack of responses reflecting  a variety of communicative acts, especially disagreement, is one of the most pressing concerns for SGD users interacting with an AI-enhanced device. Making literature outlining the role of action sequences in conversation more usable to the LLM through prompting, for the generation of conversational contributions, may prove to be a viable strategy for broadening the communicative range of these responses. Trials using a prompt revised to address the limitations of this study will be conducted in the coming weeks. 


AI-AAC systems require more than real-time response generation. To support coherent, relevant, and user-centered conversation, the system must be able to remember what has already been said, who said it, when and where it occurred, and how prior turns relate to the current moment. In DEAN, we are developing a conversational discourse memory structure that allows the system to track the unfolding interaction, connect current turns to prior exchanges, and integrate this discourse history with the augmented speaker’s persona and background information.

Conversational discourse memory refers to the ability to track, store, and retrieve extended stretches of talk. In ordinary conversation, participants use this memory to understand references to earlier turns, follow the progression of ideas, avoid unnecessary repetition, and make contributions that remain relevant to the ongoing interaction. This process depends on both working memory, which supports immediate processing, and long-term memory, which supplies shared experiences, background knowledge, and broader expectations about how conversations unfold.

For augmented speakers using DEAN, conversational discourse memory presents an engineering problem. The augmented speaker brings their own memory, intentions, and background knowledge to the interaction, but DEAN also needs an internal representation of the conversation in order to generate useful candidate utterances. Without a discourse memory structure, DEAN can only respond to isolated turns. With discourse memory, the system can listen, remember, and reason across the interaction, allowing it to support more coherent, contextually grounded, and interactionally appropriate AAC output.

To guide this work, we reviewed cognitive science models of discourse comprehension and identified Walter Kintsch’s Construction-Integration model as a strong basis for DEAN’s discourse memory architecture. Kintsch’s model distinguishes among three levels of discourse representation: surface memory, textbase memory, and the situation model. Together, these levels provide a useful framework for engineering a memory system that can retain exact wording when needed, extract propositional meaning, and maintain a broader understanding of the conversational situation.

Surface memory refers to the retention of the exact wording and linguistic form of an utterance, including phrasing, syntax, and specific lexical choices. This level is important when the system needs to quote or refer back to something that was said directly.

Textbase memory captures the explicit ideas or propositions conveyed by an utterance, independent of the exact wording. This level supports summarization, tracking of claims, and recognition of what information has already been introduced.

Situation-model memory represents the broader understanding of the conversation, including events, relationships, intentions, implications, and relevant background knowledge. This level is especially important for inference, prediction, topic continuity, and contextually appropriate response generation.

For example, if a partner says, “I can’t go out tonight because my car broke down again,” the three memory levels would support different kinds of representation:

An example of different memory levels.

In conversation, all three levels operate together. Participants hear the exact words, register the main ideas, and build a broader understanding of what is happening. Over time, exact wording usually fades unless it is especially memorable, such as an insult, compliment, joke, or emotionally charged phrase. What remains most strongly is the situation model: the participant’s broader understanding of what the conversation meant and why it mattered.

A simple example illustrates the distinction:

  • Surface memory: “She literally said, ‘I’ll be back by noon.’”
  • Textbase memory: She said she would return at noon.
  • Situation model: She plans to be home around lunchtime, so I can meet her afterward.

Integration of Kintsch’s Model into DEAN

When an augmented speaker uses DEAN, the system records the typically speaking partner’s contributions through automatic speech recognition and tracks the augmented speaker’s selections from the user interface, including alphabetic entry, MOVE vocabulary, and AI-generated options. DEAN’s discourse memory structure uses these inputs to maintain multiple levels of conversational context.

At the surface level, DEAN preserves selected wording from both the partner and the augmented speaker. At the textbase level, it identifies the main ideas, propositions, and topical content of the exchange. At the situation-model level, it constructs a continuously updated representation of the interaction, including relevant participants, topics, intentions, prior commitments, unresolved issues, and background knowledge from the user’s light RAG.

This memory structure is intended to support DEAN’s ability to generate utterances that are not merely locally responsive, but discourse-aware. The goal is for DEAN to recognize what has already been established, what remains unresolved, what the partner is likely referring to, and how the augmented speaker’s next contribution can fit the ongoing interaction.

In this way, conversational discourse memory becomes a central component of DEAN’s AI-AAC architecture. It allows the system to move beyond isolated prompt-response generation toward a more interactionally grounded model of conversation: one that listens to the unfolding discourse, remembers relevant prior material, integrates that material with user-specific knowledge, and helps the augmented speaker produce responses that are timely, coherent, and personally meaningful.

Early Development

Designing DEAN

Designing for Different Users

Field Work Probes

From the inception of Project Converse, we were interested and concerned about interface development. While we wanted to explore user interfaces for our projects, we were wary that premature interface designs would direct our subsequent research design efforts. So we held off, designing paper-based interfaces which could be modified on the spot, written on, cut up, etc., for the next interface iteration. We even built a few OSDPI prototypes to examine the feasibility of using head and eye tracking for lexical choice and prosodic control, as demonstrated below:

Using head and eye tracking with SmyleMouse for prosodic control.

Our first interface was developed to test the functioning of our trained Large Language model.

An early 2023 sketch of the system, interface and human-computer interaction setup.
The interface used for LLM testing in Feb 2024.

As we feared, our attention became fixated on working with the large language model until Todd, our co-researcher and participant in these experiments, complained about his experiences: the frequency of inappropriate and inadequate utterance choices and the LLM’s inability to represent his intentions and identity. Todd shocked us back to reality, reminding us that we need to design for interaction rather than be fixated on improving an inadequate LLM.

Interface development for Project Engage and Move. 

During the spring and summer of 2023, we developed a series of interfaces for our Project Engage and Move research.  Here is a preliminary design sketched out in OSDPI. For Project Engage, we wanted to test how different interfaces impacted interaction.

We used OSDPI as a sketch tool for Engage.

You can read more about the next 2 interface iterations in the Engage section of our websites, under Projects. These eventually became incorporated with the final DEAN interface.

Engage communication board interface with no composition display.
Engage Interface with a display and repair vocabulary

During this time, we also developed a WebSocket connection between OSDPI and AI, which we demonstrated at the International Augmentative and Alternative Communication conference in 2023. This was a critical component to all future research:

First conversant AI demo.

After our initial paper prototype testing of the Move interface, Katrina Fulcher-Rood headed up the OSDPI implementation of Move. Here are some early interface designs for our 2024 research project. Here is an early interface copying the organization of our paper-based Move interface:

Making the paper-based Move interface digital.

It became evident that the interface was too complex to display everything at once, so we developed an interface where different pragmatic views could be accessed via border tabs. This version was used in our Move research study. You can read more about this work in the Move Project.

A revised digital interface.

Here is a video of the first integration of the Move language actions into the an early DEAN prototype:

Combining Move with early DEAN prototype.

By the end of 2024, we were far enough along with all our projects to begin to integrate the other projects into the formation of DEAN (Dynamic, Expressive Augmented Narrator). The goal of DEAN was to combine the quick pragmatic language actions afforded by MOVE with the responsive utterances provided by our conversational AI. Here are a couple of designs developed in the spring of 2025 for Todd. First, we designed an AI interface for Todd with a keyboard:

AI interface with pragmatic phrases and a keyboard.

Using design concepts from Project Move, the move vocabulary can be changed by selecting the rounded buttons on the right.

AI response selection buttons get moved down in next iteration.

In a second iteration, we kept the Move buttons, made the QWERTY keyboard a popup and moved the AI selection buttons down for faster and more accurate selection. Also, we started out placing the AI selection buttons right below each of the response utterances, but Todd had difficulty accessing them, so we placed them at the bottom of the screen for faster and more accurate selection.

Here was our first user test of DEAN: a playful conversation between Todd and Jenna discussing their time at ATIA:

Todd and Jenna discuss a shared memory while using DEAN.

As our Year 5 research agenda took shape, we recruited Linda, an augmented speaker with 12 years of experience using Tobii-Dynavox technologies following her ALS diagnosis. Her initial spelling-and-AI interface was adapted for eye tracking with larger buttons, and the AI-generated selections were placed along the bottom of the screen to better match her gaze patterns.

Initial AI and typing interface for eye tracking access method.

In late summer, 2025, we began to build the final set of DEAN interfaces, combining Move with AI-utterance prediction. This version of the  MOVE-integrated DEAN interface is organized around conversational actions. Each selection produces a ready-to-speak communicative move, allowing users to contribute without first composing a full message. To support mutual gaze and partner engagement, we removed the message display and centered the interface on direct selection. We also adapted the layout for mobility limitations: the numbered and arrow controls are positioned along the lower edges because Todd and Linda had difficulty accessing the upper quadrants of the display. This design reduces reach demands while keeping conversational actions accessible.

Optimized layout for reach.

We reduced the number of CAI-generated candidate utterances from eight to two, lowering both generation time and the user’s reading and scanning burden. The utterances displayed on the right update dynamically based on DEAN’s inference about the relevant action sequence. This served as an additional experimental manipulation, allowing us to test whether a smaller set of sequence-aware suggestions improves interactional fit, uptake, and timely participation. Rather than relying on extended message construction, DEAN supports conversation through immediately available, action-oriented responses designed to remain accessible and socially usable.

In early 2026 we arrived at our DEAN test probes, allowing us to test the use of Move and AI for our field tests:

Todd’s direct-selection Move interface. The yellow buttons allow him to manipulate which partner utterance he wants the AI to respond to.
Linda’s eye tracking interface.
Linda’s spelling interface. Note: some of the letters were moved out of typical QWERTY order because Linda had difficulty accessing the upper left area of this display. 

The initial results of our field research can be found in the DEAN Field Test section of this website, which also includes videos of Todd and Linda using DEAN to converse.

Conversation is made up of actions called moves. These include: seeking information or assistance, agree/disagree, tease, redirect, comment, repair, hold the floor, give, describe, etc., and during the course of conversation, we are required to produce fast, simple, and pragmatically powerful responses. Typically speaking, partners can contribute short moves, e.g., “wait”, “yes”, “you’re wrong,” and longer, more detailed utterances with similar efficiency within the enchronic time frame. Our goal in developing our AI-enhanced probe, DEAN, was to provide the augmented communicator with relevant responses in real time; however, even when the responses appear on the screen within a second or less, the user must review and select them, which pushes out the enchronic window. 

We learned in our work with the MOVE project in which we developed and bench tested rapid access interaction tools (i.e., words and short phrases such as “wait,” “yes,” “no,” “I agree,” “what do you mean,” “go ahead,” and “can I say something.”)  that after training the augmented speaker could use Moves to act quickly to stay in-time within an unfolding conversation.

Our work with the Move project led to the integration of MOVE with DEAN, changing DEAN from a system that mainly generated candidate sentences into a hybrid interaction system. AI-generated utterances support context-specific responses; MOVE supports immediate pragmatic action. Together, they allow DEAN to support both the “long game” of conversational content and the “short game” of moment-by-moment participation.

Todd uses the DEAN interface during field testing.