This is documentation for boteval
library which helps in chatbot evaluation.
The source code is available at github.com/isi-nlp/boteval
-
Pre release
1. Setup
1. Python 3.9 or newer is required to run this project. |
conda create env -n boteval python=3.9
conda activate boteval
pip install boteval
boteval -h
If you plan to edit/update boteval
code or wants to use latest unreleased version of code, then clone the source code from github and install in development mode
#create a python-3.9 environment (if necessary) and activate
git clone git@github.com:isi-nlp/boteval.git
cd boteval
pip install -e .
boteval -h
The -e
flag enables editable mode i.e., you can keep editing files without having to reinstall.
2. Quick Start
The boteval
is structured around task directory, which has conf.yml
file.
These concepts will be introduced in the later sections; for now lets see if we can blindly run a demo task.
python -m boteval -h
usage: boteval [-h] [-c FILE] [-b /prefix] [-d] [-a ADDR] [-p PORT] [-v] DIR
Deploy chat bot evaluation
positional arguments:
DIR Path to task dir. See "example-chat-task"
optional arguments:
-h, --help show this help message and exit
-c FILE, --config FILE
Path to config file. Default is <task-dir>/conf.yml (default: None)
-b /prefix, --base /prefix
Base path prefix for all url routes. Eg: /boteval (default: None)
-d, --debug Run Flask server in debug mode (default: False)
-a ADDR, --addr ADDR Address to bind to (default: 0.0.0.0)
-p PORT, --port PORT port to run server on (default: 7070)
-v, --version show version number and exit
We create a an example chat task directory using boteval-quickstart
# create an example task directory
boteval-quickstart example-chat-task/
# run demo task
boteval -d example-chat-task/
It should print 127.0.0.1:7070 on a successful launch. Now let’s access this URL in a web browser. You will be granted with a login screen! Wait! Creating user account for the development sake is not necessary, as we have preloaded two accounts. These are the two accounts preloaded by default (UserIDs are shown):
-
dev
— an user account for you to test as a worker -
admin
— an account with administrator privileges.
Where are the passwords for these accounts? See boteval/constants.py#L44-L45 You may export DEV_USER_SECRET and ADMIN_USER_SECRET environment variables (before the initial launch) to set your own passwords
|
Now, login as dev
user one of the browser, and login as admin
user as another browser. Then play around with the web UI.
TODO: We should make a video demo.
The chat data and user annoations will be stored at two places: 1. Inside task-dir/data directory as JSON files — one per chat thread2. Inside database file that you have configured |
The following sections describe how to customize, and deploy this on production.
3. Task Directory
Here is the structure of an example chat task directory:
example-chat-task/
├── conf.yml (1)
├── chat_topics.json
├── __init__.py (2)
├── instructions.html
└── user-agreement.html
4. Config file
conf.yml
file 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
chatbot:
display_name: 'Moderator'
topics_file: chat_topics.json
bot_name: hf-transformers (1)
bot_args: (1)
model_name: facebook/blenderbot_small-90M
onboarding: (2)
agreement_file: user-agreement.html
instructions_file: instructions.html
checkboxes:
instructions_read: I have read the instructions.
iam_adult: I am 18 years or older and I understand that I may have to read and write using toxic language.
ratings: (3)
- question: 'How Coherent was the conversation?'
choices: &choices
- Not at all
- Mostly not
- So-so
- Somewhat
- Very
- question: 'How likely are you going to continue the conversation with the bot?'
choices: *choices
- question: 'To what degree did the bot convince you to change your behavior?'
choices: *choices
limits: (4)
max_threads_per_user: 10
max_threads_per_topic: &max_assignments 3
max_turns_per_thread: 4
reward: &reward '0.01' # dollars
flask_config: (5)
# sqlalchemy settings https://flask-sqlalchemy.palletsprojects.com/en/2.x/config/
DATABASE_FILE_NAME: 'sqlite-dev-01.db' # this will be placed in task dir
SQLALCHEMY_TRACK_MODIFICATIONS: false
mturk: (6)
client:
profile: default # the [default] profile in ~/.aws/credentials file
sandbox: true # sandbox: false to go live
seamless_login: true
hit_settings:
# https://boto3.amazonaws.com/v1/documentation/api/1.11.9/reference/services/mturk.html#MTurk.Client.create_hit
MaxAssignments: *max_assignments
AutoApprovalDelayInSeconds: 604800 # 7 days = 604k sec
LifetimeInSeconds: 1209600 # 14 days = 1.2M sec
AssignmentDurationInSeconds: 3600 # 1 hour = 3.6k sec
Reward: *reward
Title: 'Evaluate a chatbot'
Keywords: 'chatbot,chat,research'
Description: 'Evaluate a chat bot by talking to it for a while and receive a reward'
4.1. Bot Settings
-
bot_name
(str) andbot_args
(dict) are required to enable chatbot backend. -
bot_name
is a string where asbot_args
is dictionary which is provided as key-word arguments.bot_args
can be optional (i.e. missing) for bots that require no arguments.
1
2
3
4
5
6
7
8
9
10
11
12
13
from boteval import log, C, registry as R
from boteval.bots import BotAgent
BLENDERBOT_90M = "facebook/blenderbot_small-90M"
@R.register(R.BOT, 'hf-transformers')
class TransformerBot(BotAgent):
NAME = 'transformers'
def __init__(self, model_name=BLENDERBOT_90M, **kwargs) -> None:
super().__init__(name=f'{self.NAME}:{model_name}')
self.model_name = model_name
Here, with bot_name='hf-transformers'
, bot_args
are optional as there are no arguments of init
method that require value at runtime. However, if we want to change model_name
, here is an example for how to provide it:
chatbot:
#(other args)
bot_name: hf-transformers
bot_args:
model_name: facebook/blenderbot_small-90M
- Seed Conversation
-
chatbot.topics_file
is required to provide seed conversation. seeexample-chat-dir/chat_topics.json
in source repository for an example. - Bot Name
-
Set
chatbot.displayname
property
4.2. Onboarding Settings
onboarding:
agreement_file: user-agreement.html
instructions_file: instructions.html
checkboxes:
instructions_read: I have read the instructions.
iam_adult: I am 18 years or older and I understand that I may have to read and write using toxic language.
The agreement_file
and instructions_file
may contain arbitrary HTML/CSS/JS content.
While the contents of agreement_file
will be shown during user signup / account creation, users can acces the contents of instructions
from a chat window.
The items under onboarding.checkboxes
will be shown during signup page and asked used to provide agreement/consent.
4.3. Ratings Settings
Ratings is a place to configure the input from user after a chat task is done. Currently multiple choice questions are supported (TODO: we probably need to extend this to support other kinds of input).
For multiple choice question, we need to specify question text as question: str
and its choices as choices: List[str]
ratings:
- question: 'How Coherent was the conversation?'
choices: &choices (1)
- Not at all
- Mostly not
- So-so
- Somewhat
- Very
- question: 'How likely are you going to continue the conversation with the bot?'
choices: *choices (1)
- question: 'To what degree did the bot convince you to change your behavior?'
choices: *choices
1 | - &choices defines a reference/pointer variable and *choices references to previously defined variable. This is an elegent way of reusing previously defined config values instead of repeating them. |
4.4. Limits Settings
limits:
max_threads_per_user: 10 (1)
max_threads_per_topic: 3 (2)
max_turns_per_thread: 4 (3)
reward: '0.01' (4)
1 | Maximum number of threads (mtuk/assignments) a worker can do |
2 | Maximum number of threads (mtuk/assignments) we need for a topic (mturk/HIT) |
3 | Maximum number of worker replies required in a thread (/assignment) to consider it as complete |
4 | Reward amount. Note: currently payment can be provided to MTurk workers only; we dont have our own payment processing backend. |
4.5. Flask Server Settings
As you may have figured already, the server side code is powered by Python Flask. Flask is very powerful and flexible system (the reason why we chose to use it!). For Flask’s configuration options, please refer to flask.palletsprojects.com/en/2.0.x/config/#configuration-basics
Here we say that, we expose access to flask.config
datastructure: anything we set under flask_config
key at the root of YAML file will be updated to app.config
flask_config: (5)
DATABASE_FILE_NAME: 'sqlite-dev-01.db' (1)
SQLALCHEMY_TRACK_MODIFICATIONS: false
1 | DATABASE_FILE_NAME is the filename for sqlite3 databse. |
Use a different DATABASE_FILE_NAME for development vs production. When you want to have a fresh start, simply change db filename. |
You may also configure sqlachemy here flask-sqlalchemy.palletsprojects.com/en/2.x/config/ |
4.6. Crowd: MTurk Settings
limits:
max_threads_per_user: 10
max_threads_per_topic: &max_assignments 3 (5)
max_turns_per_thread: 4
reward: &reward '0.01' (5)
mturk:
client:
profile: default (1)
sandbox: true (2)
seamless_login: true (3)
hit_settings: (4)
MaxAssignments: *max_assignments (5)
AutoApprovalDelayInSeconds: 604800 # 7 days = 604k sec
LifetimeInSeconds: 1209600 # 14 days = 1.2M sec
AssignmentDurationInSeconds: 3600 # 1 hour = 3.6k sec
Reward: *reward (5)
Title: 'Evaluate a chatbot'
Keywords: 'chatbot,chat,research'
Description: 'Evaluate a chat bot by talking to it for a while and receive a reward'
1 | profile name should match the ones in $HOME/.aws/credentials file |
2 | sandbox: true to use sandbox and set sandbox: false go live |
3 | MTurk user will be automatically loggen into our system. An a/c would be created if required. The MTurk workers do not need to remember userID or password for logging into our system. |
4 | All these key-values are sent to mturk API; Refer to boto3.amazonaws.com/v1/documentation/api/1.11.9/reference/services/mturk.html#MTurk.Client.create_hit for a full list of options. |
5 | cross references using & and * for reusing previously defined limits |
MTurk integration is achieved via ExternalQuestion However, ExternalQuestion requires hosting our webservice over HTTPS, which require SSL certificate. See Section 6, “HTTPS with Nginx Reverse Proxy”.
When the task is done, we need to submit a form back to MTurk informing the completion. So, Mturk worker gets an additional screen at the end of task where they click a button to notify the task completion to mturk. |
In the current version of this system, we do not automatically launch HITs on Mturk, instead we provide an option to Admin user.
To launch HITs on mturk, follow these steps
-
Login as admin user (See Section 2, “Quick Start”)
-
Go to Admin Dashboard > Topics > Launch on Mturk or Mturk Sandbox (depending on cofig)
5. Adding Bots and Transforms
If __init__.py
file is found at the root of task directory, then the directory is treated as python module and imported it.
Refer to example-chat-task directory for an example.
|
you may have to install additional requirements/libs for your code. |
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from typing import Any
from boteval import log, C, registry as R
from boteval.bots import BotAgent
from boteval.transforms import BaseTransform, SpacySplitter
from boteval.model import ChatMessage
@R.register(R.BOT, name="my-dummy-bot")
class MyDummpyBot(BotAgent):
def __init__(self, **kwargs):
super().__init__(name="dummybot", **kwargs)
self.args = kwargs
log.info(f"{self.name} initialized; args={self.args}")
def talk(self) -> dict[str, Any]:
if not self.last_msg:
reply = f"Hello there! I am {C.BOT_DISPLAY_NAME}."
elif 'ping' in self.last_msg['text'].lower():
reply = 'pong'
else:
reply = f"Dummy reply for -- {self.last_msg['text']}"
return dict(text=reply)
def hear(self, msg: dict[str, Any]):
self.last_msg = msg
@R.register(R.TRANSFORM, name='my-transform')
class MyDummyTransform(BaseTransform):
def __init__(self, **kwargs) -> None:
super().__init__()
self.args = kwargs
self.splitter = SpacySplitter.get_instance()
def transform(self, msg: ChatMessage) -> ChatMessage:
try:
text_orig = msg.text
text = '\n'.join(self.splitter(text_orig)) + ' (transformed)' # sample transform
msg.text = text
msg.data['text_orig'] = text_orig
except Exception as e:
log.error(f'{e}')
return msg
Suppose the below code is placed in <task-dir>/__init__.py
, we import it as a python module.
The
@R.register(R.BOT, name="my-dummy-bot")
and
@R.register(R.TRANSFORM, name='my-transform')
statements register a custom bot and a custom transform, respectively. With this, the following config should be self explanatory (otherwise, revisit Section 4.1, “Bot Settings”)
chatbot:
display_name: 'Moderator'
topics_file: chat_topics.json
bot_name: my-dummy-bot
bot_args:
key1: val1
key2: val2
some_flag: true
transforms:
human:
- name: my-transform
args:
arg1: val1
arg2: [val2, val3]
bot:
- name: my-transform
args:
arg1: val1
arg2: [val2, val3]
6. HTTPS with Nginx Reverse Proxy
-
Running Certbot generated a sample nginx config at
/etc/nginx/sites-enabled/default
withssl_*
fields configured. -
I have added reverse proxy for location
/boteval
→127.0.0.1:7070/boteval
along with necessary proxy headers to make session/logins work.
The below nginx setting was tested to be working when flask app is bound to 127.0.0.1:7070/boteval
on an AWS/EC2 instance whose IP is mapped (via DNS A-record) to dev.gowda.ai
, and ports 80(HTTP), 443(HTTPS) are open to the public.
# Default server configuration
server {
listen 80 default_server;
listen [::]:80 default_server;
root /var/www/html;
# Add index.php to the list if you are using PHP
index index.html index.htm index.nginx-debian.html;
server_name _;
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ =404;
}
location /boteval {
proxy_pass http://127.0.0.1:7070/boteval ;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
server {
root /var/www/html;
# Add index.php to the list if you are using PHP
index index.html index.htm index.nginx-debian.html;
server_name dev.gowda.ai; # managed by Certbot
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ =404;
}
location /boteval {
proxy_pass http://127.0.0.1:7070/boteval;
proxy_set_header Host $http_host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect http://$http_host/ https://$http_host/;
}
listen [::]:443 ssl ipv6only=on; # managed by Certbot
listen 443 ssl; # managed by Certbot
ssl_certificate /etc/letsencrypt/live/dev.gowda.ai/fullchain.pem; # managed by Certbot
ssl_certificate_key /etc/letsencrypt/live/dev.gowda.ai/privkey.pem; # managed by Certbot
include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}
server {
if ($host = dev.gowda.ai) {
return 301 https://$host$request_uri;
} # managed by Certbot
listen 80 ;
listen [::]:80 ;
server_name dev.gowda.ai;
return 404; # managed by Certbot
}
7. Development
-
Model view controller pattern: en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller
-
Flask for server side propgramming: flask.palletsprojects.com/en/2.2.x/api/
-
Server side templating is using Jinja2: jinja.palletsprojects.com/en/3.1.x/templates/
-
Login and user session manager: flask-login.readthedocs.io/en/latest/
-
Database backend and ORM: persistance: flask-sqlalchemy.palletsprojects.com/en/2.x/models/
-
If you do not know how Dabase and ORM works or how you could use them, then this could be most complex piece of the system. Good news is that we build on top of battele-tested SqlAlchemy. See its documentation at docs.sqlalchemy.org/en/14/
-
Some examples for CRUD ops flask-sqlalchemy.palletsprojects.com/en/2.x/queries/
-
-
Themes and styles via Bootstrap : getbootstrap.com/docs/4.5/getting-started/introduction/
-
jQuery: for DOM manipulation and client side templating
Remember to start the server with -d
option i.e., python -m boteval -d <taskdir>
to enable hot reload.
For VS Code users, and we recommend these extensions: * Jinja — for working with Jinja templates * SQLite Viewer — for inspecting databse contents * Remote development — useful for remote deployment with mturk integration) * Asciidoctor — for editing docs |
8. Acknowledgements
-
This project was initially developed at USC ISI for DARPA program named Civil Sanctuary
-
We learned from Mephisto. We used Mephisto in our pilot study and at somepoint we decided to implement this with some differences in tech stack, architecure, and control flow. However, there are some portions of this system that still resemble Mephisto.
-
For a list of contributors to source code, please refer to github.com/isi-nlp/boteval/graphs/contributors