[HOWTO] How To Use Wit.ai, Windows Speech Recognition, Google Cloud Voice Recognition, or Recognissimo in Conversations

Announcements, support questions, and discussion for the Dialogue System.
Post Reply
User avatar
Tony Li
Posts: 22129
Joined: Thu Jul 18, 2013 1:27 pm

[HOWTO] How To Use Wit.ai, Windows Speech Recognition, Google Cloud Voice Recognition, or Recognissimo in Conversations

Post by Tony Li »

(Link to example scenes at end of post. Also includes Windows Speech Recognition, Google, & Recognissimo examples.)

A Dialogue System user developed an interesting way to run conversations. I wanted to post a description here in case it can help others.

Subtitles can be sent to a text-to-speech plugin such as RT-Voice. The Dialogue System has very easy integration for it.

This post covers the other direction: allowing the player to make responses simply by speaking.

In place of a traditional response menu UI, his solution listens for keywords, which he refers to as "intents." When editing a response dialogue entry, he puts the intent text into the Menu Text field. When the player speaks a keyword associated with a response, the Dialogue System chooses that response.

To accomplish this, he used wit.ai, an online speech recognition service. He started with the Unity / wit.ai integration at https://github.com/afauch/wit3d

He used the parts of wit3D that initiate recording, save a file of the recording, send it to wit.ai, get back the JSON, and parse the JSON for the intent. Then he passes the intent to a simple subclass of UnityUIDialogueUI that he named WitDialogueUI. (When he receives input from wit.ai, he calls a static method in WitDialogueUI.) The basic code is here:

WitDialogueUI.cs

Code: Select all

using UnityEngine;
using System.Collections;
using PixelCrushers.DialogueSystem;

public class WitDialogueUI : UnityUIDialogueUI {

    //Use a singleton to allow access from static methods
    public static WitDialogueUI Instance;

    public override void Awake() {
        base.Awake();
        Instance = this;
    }

    public bool showMenu = true; // Show response menu. Useful for debugging.

    public bool listening = false; // True when listening for a Wit.AI voice command.

    public Response[] responses;

    public override void ShowResponses(Subtitle subtitle, Response[] responses, float timeout) {
        base.ShowResponses(subtitle, responses, timeout);
        this.responses = responses; // Remember the responses to check when wit.ai returns an intent.
    }

    public override void HideResponses() {
        base.HideResponses();
        responses = null; // Response menu is done, so no responses to check.
    }

    // Get the Intent from the _Handler script (part of Wit3d)
    public static void getIntentFromWit3d(string Wit3Dintent) {
        if (!string.IsNullOrEmpty(Wit3Dintent)) {
            foreach (var response in Instance.responses) {
                if (string.Equals(Wit3Dintent, response.formattedText.text)) {
                    // We have a match, select the choice:
                    Instance.OnClick(response); // Simulate a click on a response button.
                }
            }
        }
    }    
}
So using this technique, he could create a game like The Wayne Investigation on Amazon Echo, or a mobile game that he could play in the car without having to look at a screen, or an accessible game for visually impaired players. Pretty neat!

Example scene (Wit.ai): WitAI_Example_2017-01-09.unitypackage

Example scene (Windows Speech Recognition): DS_WindowsSpeechRecognitionExample_2020-08-20.unitypackage
User avatar
Tony Li
Posts: 22129
Joined: Thu Jul 18, 2013 1:27 pm

Re: Using Wit.ai or Windows Speech Recognition in Conversations

Post by Tony Li »

Here's an example dialogue UI integration script for Google Cloud Speech-To-Text.

Code: Select all

using Google.Cloud.Speech.V1;
using PixelCrushers.DialogueSystem;
using System.Threading;
using System.Threading.Tasks;
using UnityEngine;

public class GoogleSpeechDialogueUI : StandardDialogueUI
{
    public bool showMenu = true; //Show response menu. Useful for debugging

    public static bool listening = true; //True when listening for a Google STT voice
    public static bool foundMatch = false;

    private Response[] responses = null;
    private Response matchingResponse = null;

    private SpeechClient speech = null;
    private SpeechClient.StreamingRecognizeStream streamingCall = null;
    private Task task;
    private object writeLock;
    private bool writeMore;
    private NAudio.Wave.WaveInEvent waveIn;
    private bool didNotUnderstand = false;
    private string s = string.Empty;

    public override void ShowResponses(Subtitle subtitle, Response[] responses, float timeout)
    {
        this.responses = responses;
        if (NAudio.Wave.WaveIn.DeviceCount < 1)
        {
            Debug.Log("No microphone! Using basic response menu.");
            base.ShowResponses(subtitle, responses, timeout);
            return;
        }
        if (showMenu)
        {
            base.ShowResponses(subtitle, responses, timeout);
        }
        task = StartListening();
    }

    private Task StartListening()
    {
        listening = true;
        foundMatch = false;
        speech = SpeechClient.Create();
        streamingCall = speech.StreamingRecognize();
        matchingResponse = null;

        // Write the initial request with the config.
        streamingCall.WriteAsync(
        new StreamingRecognizeRequest()
        {
            StreamingConfig = new StreamingRecognitionConfig()
            {
                Config = new RecognitionConfig()
                {
                    Encoding =
                    RecognitionConfig.Types.AudioEncoding.Linear16,
                    SampleRateHertz = 16000,
                    LanguageCode = "en",
                },
                InterimResults = false,
            }
        });

        // Process responses as they arrive.
        Task processResponses = Task.Run(async () =>
        {
            while (await streamingCall.ResponseStream.MoveNext(
                default(CancellationToken)))
            {
                foreach (var result in streamingCall.ResponseStream
                    .Current.Results)
                {
                    foreach (var alternative in result.Alternatives)
                    {
                        string text = alternative.Transcript;
                        Debug.Log("Heard: " + alternative.Transcript);
                        CheckResponses(text.Trim());
                    }
                }
            }
        });

        // Read from the microphone and stream to API.
        writeLock = new object();
        writeMore = true;
        waveIn = new NAudio.Wave.WaveInEvent();
        waveIn.DeviceNumber = 0;
        waveIn.WaveFormat = new NAudio.Wave.WaveFormat(16000, 1);
        waveIn.DataAvailable +=
            (object sender, NAudio.Wave.WaveInEventArgs args) =>
            {
                lock (writeLock)
                {
                    if (!writeMore) return;
                    streamingCall.WriteAsync(
                        new StreamingRecognizeRequest()
                        {
                            AudioContent = Google.Protobuf.ByteString
                                .CopyFrom(args.Buffer, 0, args.BytesRecorded)
                        }).Wait();
                }
            };
        waveIn.StartRecording();
        Debug.Log("Speak now.");

        return processResponses;
    }

    public override void Update()
    {
        base.Update();
        if (listening)
        {
            if (!string.IsNullOrEmpty(s))
            {
                Debug.Log(s);
                s = string.Empty;
            }
            if (foundMatch)
            {
                listening = false;
                Debug.Log("Stopping listening.");
                waveIn.StopRecording();
                lock (writeLock) writeMore = false;
                streamingCall.WriteCompleteAsync();
                OnClick(matchingResponse); // Simulate a click on a response button
            }
            else if (didNotUnderstand)
            {
                DialogueManager.ShowAlert("Sorry, I don't understand. Say something else.");
                didNotUnderstand = false;
            }
        }
    }

    public void CheckResponses(string text)
    {
        if (string.IsNullOrEmpty(text)) Debug.Log("text is blank");
        if (!string.IsNullOrEmpty(text))
        {
            foreach (var response in responses)
            {
                var menuOption = response.formattedText.text;
                s += "Checking option: " + menuOption + " against " + text + "\n";
                if (string.Equals(text, menuOption, System.StringComparison.OrdinalIgnoreCase))
                {
                    s += "Found a match for '" + menuOption + "': " + text + "\n";
                    //we have a match, select the choice:
                    foundMatch = true;
                    matchingResponse = response;
                    didNotUnderstand = false;
                    return;
                }
            }
        }
        didNotUnderstand = true;
    }
}
User avatar
Tony Li
Posts: 22129
Joined: Thu Jul 18, 2013 1:27 pm

Re: [HOWTO] How To Use Wit.ai, Windows Speech Recognition, Google Cloud Voice Recognition, or Recognissimo in Conversati

Post by Tony Li »

Recognissimo integration:
Contributed by Matias Gesche:

Code: Select all

using System.Collections.Generic;
using UnityEngine;
using Recognissimo.Components;
using PixelCrushers.DialogueSystem;
using PixelCrushers;

public class SpeechRecognissimo : StandardDialogueUI
{
    public float speechRecognitionTimeout = 5;

    private VoiceControl m_voiceControl;
    private Response[] m_responses;
    private Response m_timeoutResponse;
    private float m_timeLeft;
    private bool m_isWaitingForResponse;

    public override void ShowResponses(Subtitle subtitle, Response[] responses, float timeout)
    {
        // Remember the responses to check when we recognize a keyword:
        if (responses.Length > 1)
        {
            // If we have more than one, assume the last response is the "I don't understand"/timeout special response.
            // Record it and remove it from the array of regular responses:
            m_timeoutResponse = responses[responses.Length - 1];
            var responsesExceptLast = new Response[responses.Length - 1];
            for (int i = 0; i < responses.Length - 1; i++)
            {
                responsesExceptLast[i] = responses[i];
            }
            responses = responsesExceptLast;
        }
        else
        {
            m_timeoutResponse = null;
        }
        m_responses = responses;
        m_timeLeft = speechRecognitionTimeout;
        m_isWaitingForResponse = true;

        // Show the responses:
        base.ShowResponses(subtitle, responses, timeout);

        // Identify the keywords to recognize:
        // (Each response's menu text can have keywords separated by pipe characters.)
        var allKeywords = new List<string>();
        foreach (var response in responses)
        {
            var responseKeywords = response.formattedText.text.Split('|');
            allKeywords.AddRange(responseKeywords);
        }

        // Set up the voice control recognizer:
        m_voiceControl = GetComponent<VoiceControl>();
        if (m_voiceControl == null)
        {
            m_voiceControl = gameObject.AddComponent<VoiceControl>();
        }
        m_voiceControl.AsapMode = true;

        foreach (var keyword in allKeywords)
        {
            m_voiceControl.Commands.Add(new VoiceControlCommand(keyword, () => OnPhraseRecognized(keyword)));
        }

        m_voiceControl.InitializationFailed.AddListener(e => Debug.LogError("Voice Control initialization failed: " + e.Message));
        m_voiceControl.StartProcessing();
    }

    public override void HideResponses()
    {
        base.HideResponses();

        // Stop speech recognition when we hide the menu:
        if (m_voiceControl != null)
        {
            m_voiceControl.StopProcessing();
            m_voiceControl.Commands.Clear();
        }
        m_timeoutResponse = null;
        m_isWaitingForResponse = false;
    }

    public override void Update()
    {
        base.Update();

        // Update speech recognition timer:
        if (m_isWaitingForResponse && m_timeoutResponse != null)
        {
            m_timeLeft -= Time.deltaTime;
            if (m_timeLeft <= 0)
            {
                // If time runs out, use the timeout response:
                OnClick(m_timeoutResponse);
                m_isWaitingForResponse = false;
            }
        }
    }

    private void OnPhraseRecognized(string phrase)
    {
        Debug.Log("Recognized: '" + phrase + "'");

        // Match the user's spoken phrase with one of the responses:
        foreach (var response in m_responses)
        {
            var responseKeywords = response.destinationEntry.DialogueText.Split('|');
            foreach (var responseKeyword in responseKeywords)
            {
                if (string.Equals(phrase, responseKeyword, System.StringComparison.OrdinalIgnoreCase))
                {
                    OnClick(response);
                    m_isWaitingForResponse = false;
                    return; // Exit the loop once the response is found and clicked.
                }
            }
        }
    }
}
User avatar
Tony Li
Posts: 22129
Joined: Thu Jul 18, 2013 1:27 pm

Re: [HOWTO] How To Use Wit.ai, Windows Speech Recognition, Google Cloud Voice Recognition, or Recognissimo in Conversati

Post by Tony Li »

Here's another Recogmissmo integration with an example scene, including an alternative scene that uses overhead bubble panels.

DS_RecognissimoMenuPanelWithBubble_2024-04-10.unitypackage
Post Reply