Written language and speech are rapidly becoming the user interface of the future. We already see that voice assistants (like Alexa or Siri) or textual chatbots are influencing the technology and the way we’re using it. In this article, we share our learnings (including challenges) of building a chatbot in a short period of time. We also share some ground rules for chatbot building and give insights into important aspects that should always be considered in the building process.
Chatbots seem to have quite a lot of benefits. Question-answering systems like Alexa have been welcomed into households because they enable the residents to quickly get important information or manage their errands easier and faster. Textual chatbots, on the other hand, are assisting in various fields with automated messaging; these range from celebrity social media account management, e-commerce site support, food ordering, tedious form filling (DoNotPay), or even mental health (WoeBot). Even so, we shouldn’t get too carried away with the chatbot’s capabilities — it’s pretty obvious to most of us when we’re dealing with a chatbot rather than a living, breathing person.
In many cases, chatbots — with their natural language interface — empower humans and aim to make their lives easier. The most important aspect we would like to point out is that chatbots shouldn’t simply make decisions or take actions for humans but assist them. As we wanted to learn more about chatbots, and, how to assist humans well, we decided to build a chatbot for internal vacation booking using machine learning with a big emphasis on experience design (XD) and quality analysis (QA).
Chatbots are especially interesting when thinking about user experience. Instead of creating more common visual or tactile user interface (UI) elements like on a normal webpage, the UI element for chatbots is the conversation. So more common design principles for UI elements only find limited application for conversations.
Likewise, chatbots provide novel challenges for quality analysts. Machine learning introduces new complications to the ways we assess quality. For a chatbot, quality can only be measured by whether it’s successfully communicating if it’s understanding users and if it’s understood by users. This is in contrast to more conventional, deterministic apps, where a discrete set of defined behaviors should have a discrete set of predictable outcomes.
From a technical aspect, chatbots appealed to us as we wanted to work entirely with neural networks to accomplish the features of a conversational interface. This was tricky as language and conversations themselves are already hard to master for humans, as context greatly impacts meanings which leads to ambiguity. This meant that we would have to work with multiple models and modularize it.
So how did we do it
Before we even started with the project we identified the scope of the chatbot, what would it be able to handle and what not. This helped us to stay focused and to consider different user journeys that could be taken. The scope we came up with included company internal bookings of leave days. Employees should be able to use the bot as a self-service tool for their leave management: booking and canceling leave, and getting more information about dates already booked.
To define the persona and character of the chatbot we did a slider exercise in which we moved the different sliders for different personality traits.
Figure 1: Slider exercise to define the chatbot’s persona
Through engaging in that exercise, we were able to establish some ground rules for our chatbot:
It had to be helpful and proactive while making it clear it was a robot
It had to use a simple tone and concise sentences to make it easy to understand
We had to own the data to ensure privacy and security
And the chatbot needed to be customizable so that we could improve performance based over time
After assessing available technologies we decided to use Rasa. Rasa is an open-source framework for building chatbots. The framework provides markup files in which the datasets for the model can be stored. This helped us with our aim for data ownership. The framework also provides a certain abstraction of a chatbot that can be found in nearly all other chatbot applications in the industry. Based on this we were able to set something up fast without reinventing the wheel. Because Rasa is completely open-source we were also able to fully change the underlying source code to change the models to our satisfaction. Which also gave us our wanted customizability.
A chatbot can be described by six components (see figure 2). These components might vary by use case, but if you want your system to be loosely coupled and extendable it’s advisable to have these components exist in your system.
Figure 2: The components of a chatbot
The Chat Platform component establishes an interface for the user to interact with. The main machine learning parts are the Intent Classifier, the Named Entity Recognition (NER), the Dialog Model and the Natural Language Generator.
The Intent Classifier takes the message input of the user and classifies it into an intent (a label that indicates what the user is saying). The Named Entity Recognition recognizes entities (labels such as place, time or amount) in the message input of the user and labels them accordingly.
The Dialog Model decides the next action the chatbot will take which depends on the current state of the conversations. The current state holds entities and previous actions taken.
The predicted action is then executed by the Action Resolver. This can lead to external API calls. So the actual business logic is contained by the actions inside the action resolver.
The result of that action is then run through a Natural Language Generator which returns the result in natural language to the user via the Chat Platform again.
Challenges: once it’s built, then what?
Almost every chatbot that aims to be useful needs to be integrated with other systems. In our case, we had to integrate to external systems for our chatbot to resolve actions. We opted for direct communication with other teams and contracts.
Because we used mainly machine learning to accomplish most of the tasks, we also encountered challenges that often happen in intelligent products: we didn’t have enough data to train our models and the data we had was not representative of the data users would produce. Therefore our data lacked in quantity and quality. This was because most of the training and testing data were generated by us.
Surprisingly we found out the intent recognition could be improved by more qualitative data than by the sheer amount of data. This helped us bring something to production that would work and cover most people's conversations. With this approach, it was easier and faster to feed the system with representative data from actual users that were using the chatbot in order to continuously improve.
For the dialog, we had the opposite problem: here the quantity of data was key. And we didn’t have enough conversational data that would enable the chatbot to be able to hold a proper conversation. We also created the data but covering every possible edge case would take too much time and effort. With the data we had, we weren’t able to make the model generalize sufficiently to cover all the possible conversations. This is because there was a limited number of variations on the questions and answers our dialogues contain. What we did was train for several epochs with the same data so it was just about to overfit on our data and essentially be something similar like a decision tree.
This could be done better by choosing a different machine learning model for the dialog or by making the scope of the chatbot bigger and therefore enable more conversations and ways of connecting sentences and dialogues. This would make more sense in terms of leveraging the generalization obtained from neural nets.
Figure 3: Continuous Delivery workflow for the chatbot
Another challenge was tying the training into the system and our continuous delivery approach. For this, we created a continuous delivery workflow that is a part of Continuous Delivery for Machine Learning. This helped us keep an overview of our deployments and create a safe environment that enabled us to go to production at any time. Also, it significantly helped to get real-world data from users to further enhance our models.
When training the Dialog Model, we tried to bring it to the verge of overfitting but not let it overfit too much. This was tricky because of the small amount of data we had. The conversational dataset consisted of dialogs that were randomly connected to form an example conversation. We split those randomly generated conversations into training and test data. So the actual size of our training and test data varied from each training and test run. This made it hard to stop deterministically after a certain amount of batches or epochs. We added and configured early stopping to address this issue.
The quality assessment of chatbots is difficult. Conventional ways of testing and QA don’t apply fully in this area because the conversational interface is non-deterministic by design.
We created quality principles to aid us in driving the implementation.
Our priority is to cover the functionality of leave management
We did that by having the main conversation flows depicted in our integration tests. Then, on each new deployment, we got fast feedback on the quality change. If the deployment breaks one of the main dialog possibilities, it’s worth investigating.
Focus on comprehension abilities and ease of use
Chatbots are tools for human communication which is multidimensional and can be very diverse. For that reason, we made sure to get user’s feedback often by performing usability tests.
Drive improvements with strategic analysis
To aid the first two principles, we used production conversation flow logs to spot where the conversations broke, how users were talking to the chatbot. Using this strategic analysis we can refine the chatbot. After going live, the chatbot is being used by users, so quality analysis and the chatbot’s improvements are continuous. Real data representing real users’ behavior as input for models help improve the models over time.
To create good confidence in our delivery, we used Test-Driven Development (TDD) and created a base layer of unit tests. Also, integration tests for our different components and the main conversation flows were implemented mocking the API we integrated. Both unit and integration tests are run each time we wanted to deploy new changes (see figure 3).
The flakiness of the main conversation integration tests (the answers of the chatbot are asserted in conversation simulation) was an issue. The chatbot used machine learning, so sometimes it would misinterpret a given sentence within the conversation flow. To reduce occasional integration test flakiness, we added more conversations to the training data. In addition, in the integration tests, the sessions were started fresh which may not represent a real-world scenario. When text conversations happen in real life, users tend to have a conversation history. This history also affects the dialog model and the way the chatbot interprets the conversation. To handle conversations slightly better, we introduced slot expiration which resets the data filled in the session after three minutes. For example, the start date of vacation gets reset if no other date was provided and vacation wasn’t confirmed. This handles the situations where the chatbot is stuck and doesn’t allow the user to proceed with the conversation because of filled slots.
Metrics for Intent Classification & Conversation Flow
At the beginning, we had no metrics to determine the quality of models. Later on, we introduced some metrics which enabled us to compare models and their quality. With each training of the chatbot, confusion matrix (represents the performance of the classification algorithm) and intent classification report (provides Precision, Recall Metrics, F1 Score) were generated. These metrics were useful for comparing the older version of the chatbot with the newer one when there were changes in the definition of intents or the model.
Statistical approaches in machine learning for conversational flow metrics were still limited. To understand quality aspects more, we prioritized quality attributes using Analytic Hierarchy Process (AHP) which resulted in a hierarchy of attributes in three different areas (efficiency, effectiveness, and satisfaction).
The main quality attribute for our chatbot was functionality. In this case, it’s the ability to execute requested tasks for annual leaves and the chatbot’s ease of use. Having a limited goal (vacation handling) makes the chatbot less error-prone. Other quality attributes that we extracted were:
performance (avoid inappropriate utterances, good response time, error handling)
humanity (maintain themed discussion, be transparent: disclosing the fact that it’s the chatbot)
affect (provide greetings, have conversational cues)
accessibility (detect meanings and intents)
Using these prioritized attributes, we executed thorough user data analysis to expand our training sets.
When it comes to experience design, chatbots are quite different from usual tasks XD professionals work with. For chatbots, they have only the text available. How do you actually design the text?
As mentioned before, the first step is to define the persona of the chatbot. When creating the chatbot’s answers and text samples, we leaned towards our communication style. Having a certain persona in mind helped us to challenge ourselves by creating answers for the user that resemble the defined persona.
The persona influences the answers the chatbot gives as well as the flow of conversations and the definition of possible behavior. We decided to make the chatbot appear more human. It can be accomplished by, for example, making the user wait a bit before sending out the response as if somebody would be typing. This can create a more natural conversation experience.
And what about thinking of the continuity of the conversation? A conversation initiated by the user shouldn’t end in a response of the bot that is open-ended. This would confuse the user and not fulfill their request. If the answers of the chatbot avoid open ends of the conversation, the user is less confused and frustrated. For instance, if a user asks the chatbot about sick leave, the chatbot replies with a message highlighting that this is out of its scope and informs them that it’s an annual leave-taking chatbot which does not handle sick leave booking. It also tells the user who to contact in this situation.
Another aspect we made sure to cover was the fact that context matters. An example of this is the process of booking leave. To book time off work, certain information is required — the start and end date, for example. These are usually asked for conversationally. But what if the user has no leave days left? Instead of navigating users through the flow of getting to start/end days information, the chatbot immediately lets the user know that unfortunately, they don’t have any leave left. In this case, the user would immediately get feedback before even trying to use unavailable functionality.
One great aspect of building a chatbot is that it’s relatively easy to test it often and early with real users. In the very initial phase, a real person could even pretend to be the chatbot and react with predefined phrases (Wizard-of-Oz model). Later on, it is really helpful to give early users access and check the logs to follow the conversations.
Our chatbot is live and has been working for quite a while now. We were successful in making it go live fast. As we’ve seen, there were quite a few challenges on the way but we’re happy with what was achieved. The intent recognition works well, the architecture of the chatbot seems to be good for an extension, and Continuous Delivery for Machine Learning was a great addition for the delivery.
The test pyramid still very much applies to “intelligent products” like this chatbot. The non-deterministic nature of the chatbot makes integration and functional tests a bit tricky to do. Adding data from production from actual users can be a great tool to solve this. To determine the quality of the actual models, the use of the right metrics is important. Having these metrics available helps make people aware of the chatbot’s overall quality and room for improvement.
Another important lesson we learned was the importance of frequent and early user feedback. By doing lots of user testing sessions we could test out the bot before adding new features or assessing current ones to see if our assumptions were right. Analyzing how people use the chatbot in production can also help coming up with new features or optimizing its user experience. Subtleties in how dialogues are structured and designed can greatly influence a chatbot’s persona and its user experience.
Overall, we learned that human communication is tricky, and to make a high-quality chatbot, we have to understand humans better first. The biggest enabler for this could be the use of open-source data sets and open-source tools for machine learning models. The open-source machine learning models help spread knowledge and integrate research in the area of natural language.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.