BotEval Docs

This is documentation for boteval library which helps in chatbot evaluation. The source code is available at github.com/isi-nlp/boteval

Boteval Versions

Pre release
- v0.1

1. Setup

1. Python 3.9 or newer is required to run this project.

Install

conda create env -n boteval python=3.9
conda activate boteval
pip install boteval
boteval -h

If you plan to edit/update boteval code or wants to use latest unreleased version of code, then clone the source code from github and install in development mode

Development setup

#create a python-3.9 environment (if necessary) and activate
git clone git@github.com:isi-nlp/boteval.git
cd boteval
pip install -e .
boteval -h

The -e flag enables editable mode i.e., you can keep editing files without having to reinstall.

2. Quick Start

The boteval is structured around task directory, which has conf.yml file. These concepts will be introduced in the later sections; for now lets see if we can blindly run a demo task.

python -m boteval -h
usage: boteval [-h] [-c FILE] [-b /prefix] [-d] [-a ADDR] [-p PORT] [-v] DIR

Deploy chat bot evaluation

positional arguments:
  DIR                   Path to task dir. See "example-chat-task"

optional arguments:
  -h, --help            show this help message and exit
  -c FILE, --config FILE
                        Path to config file. Default is <task-dir>/conf.yml (default: None)
  -b /prefix, --base /prefix
                        Base path prefix for all url routes. Eg: /boteval (default: None)
  -d, --debug           Run Flask server in debug mode (default: False)
  -a ADDR, --addr ADDR  Address to bind to (default: 0.0.0.0)
  -p PORT, --port PORT  port to run server on (default: 7070)
  -v, --version         show version number and exit

We create a an example chat task directory using boteval-quickstart

# create an example task directory
boteval-quickstart example-chat-task/

# run demo task
boteval -d example-chat-task/

It should print 127.0.0.1:7070 on a successful launch. Now let’s access this URL in a web browser. You will be granted with a login screen! Wait! Creating user account for the development sake is not necessary, as we have preloaded two accounts. These are the two accounts preloaded by default (UserIDs are shown):

dev — an user account for you to test as a worker
admin — an account with administrator privileges.

Where are the passwords for these accounts? See boteval/constants.py#L44-L45
You may export DEV_USER_SECRET and ADMIN_USER_SECRET environment variables (before the initial launch) to set your own passwords

Now, login as dev user one of the browser, and login as admin user as another browser. Then play around with the web UI.

TODO: We should make a video demo.

The chat data and user annoations will be stored at two places:
1. Inside task-dir/data directory as JSON files — one per chat thread
2. Inside database file that you have configured

The following sections describe how to customize, and deploy this on production.

3. Task Directory

Here is the structure of an example chat task directory:

example-chat-task/
├── conf.yml             (1)
├── chat_topics.json
├── __init__.py          (2)
├── instructions.html
└── user-agreement.html

1	See Section 4, “Config file”
2	See Section 5, “Adding Bots and Transforms”

4. Config file

Sample conf.yml file

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
chatbot:
  display_name: 'Moderator'
  topics_file: chat_topics.json
  bot_name: hf-transformers   (1)
  bot_args:                   (1)
    model_name: facebook/blenderbot_small-90M

onboarding: (2)
  agreement_file: user-agreement.html
  instructions_file: instructions.html
  checkboxes:
    instructions_read: I have read the instructions.
    iam_adult: I am 18 years or older and I understand that I may have to read and write using toxic language.

ratings: (3)
  - question: 'How Coherent was the conversation?'
    choices: &choices
      - Not at all
      - Mostly not
      - So-so
      - Somewhat
      - Very
  - question: 'How likely are you going to continue the conversation with the bot?'
    choices: *choices
  - question: 'To what degree did the bot convince you to change your behavior?'
    choices: *choices

limits:  (4)
  max_threads_per_user: 10
  max_threads_per_topic: &max_assignments 3
  max_turns_per_thread: 4
  reward: &reward '0.01' # dollars

flask_config: (5)
  # sqlalchemy settings https://flask-sqlalchemy.palletsprojects.com/en/2.x/config/
  DATABASE_FILE_NAME: 'sqlite-dev-01.db'   # this will be placed in task dir
  SQLALCHEMY_TRACK_MODIFICATIONS: false

mturk: (6)
  client:
    profile: default # the [default] profile in ~/.aws/credentials file
    sandbox: true  # sandbox: false to go live
  seamless_login: true
  hit_settings:
    # https://boto3.amazonaws.com/v1/documentation/api/1.11.9/reference/services/mturk.html#MTurk.Client.create_hit
    MaxAssignments: *max_assignments
    AutoApprovalDelayInSeconds: 604800      # 7 days  = 604k sec
    LifetimeInSeconds: 1209600              # 14 days = 1.2M sec
    AssignmentDurationInSeconds: 3600       # 1 hour  = 3.6k sec
    Reward: *reward
    Title: 'Evaluate a chatbot'
    Keywords: 'chatbot,chat,research'
    Description: 'Evaluate a chat bot by talking to it for a while and receive a reward'

1	See Section 4.1, “Bot Settings”
2	See Section 4.2, “Onboarding Settings”
3	See Section 4.3, “Ratings Settings”
4	See Section 4.4, “Limits Settings”
5	See Section 4.5, “Flask Server Settings”
6	See Section 4.6, “Crowd: MTurk Settings”

4.1. Bot Settings

bot_name (str) and bot_args (dict) are required to enable chatbot backend.
bot_name is a string where as bot_args is dictionary which is provided as key-word arguments. bot_args can be optional (i.e. missing) for bots that require no arguments.

Example Bot

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from boteval import log, C, registry as R
from boteval.bots import BotAgent

BLENDERBOT_90M = "facebook/blenderbot_small-90M"

@R.register(R.BOT, 'hf-transformers')
class TransformerBot(BotAgent):

    NAME = 'transformers'

    def __init__(self, model_name=BLENDERBOT_90M, **kwargs) -> None:
        super().__init__(name=f'{self.NAME}:{model_name}')
        self.model_name = model_name

Here, with bot_name='hf-transformers', bot_args are optional as there are no arguments of init method that require value at runtime. However, if we want to change model_name, here is an example for how to provide it:

chatbot:
  #(other args)
  bot_name: hf-transformers
  bot_args:
    model_name: facebook/blenderbot_small-90M

Seed Conversation: chatbot.topics_file is required to provide seed conversation. see example-chat-dir/chat_topics.json in source repository for an example.
Bot Name: Set chatbot.displayname property

4.2. Onboarding Settings

Example onboarding config

onboarding:
  agreement_file: user-agreement.html
  instructions_file: instructions.html
  checkboxes:
    instructions_read: I have read the instructions.
    iam_adult: I am 18 years or older and I understand that I may have to read and write using toxic language.

The agreement_file and instructions_file may contain arbitrary HTML/CSS/JS content. While the contents of agreement_file will be shown during user signup / account creation, users can acces the contents of instructions from a chat window.

The items under onboarding.checkboxes will be shown during signup page and asked used to provide agreement/consent.

4.3. Ratings Settings

Ratings is a place to configure the input from user after a chat task is done. Currently multiple choice questions are supported (TODO: we probably need to extend this to support other kinds of input).

For multiple choice question, we need to specify question text as question: str and its choices as choices: List[str]

Example rating questions

ratings:
  - question: 'How Coherent was the conversation?'
    choices: &choices  (1)
      - Not at all
      - Mostly not
      - So-so
      - Somewhat
      - Very
  - question: 'How likely are you going to continue the conversation with the bot?'
    choices: *choices (1)
  - question: 'To what degree did the bot convince you to change your behavior?'
    choices: *choices

1	- `&choices` defines a reference/pointer variable and `*choices` references to previously defined variable. This is an elegent way of reusing previously defined config values instead of repeating them.

4.4. Limits Settings

Example limits config

limits:
  max_threads_per_user: 10   (1)
  max_threads_per_topic: 3  (2)
  max_turns_per_thread: 4   (3)
  reward: '0.01'           (4)

1	Maximum number of threads (mtuk/assignments) a worker can do
2	Maximum number of threads (mtuk/assignments) we need for a topic (mturk/HIT)
3	Maximum number of worker replies required in a thread (/assignment) to consider it as complete
4	Reward amount. Note: currently payment can be provided to MTurk workers only; we dont have our own payment processing backend.

4.5. Flask Server Settings

As you may have figured already, the server side code is powered by Python Flask. Flask is very powerful and flexible system (the reason why we chose to use it!). For Flask’s configuration options, please refer to flask.palletsprojects.com/en/2.0.x/config/#configuration-basics

Here we say that, we expose access to flask.config datastructure: anything we set under flask_config key at the root of YAML file will be updated to app.config

Example flask config

flask_config: (5)
  DATABASE_FILE_NAME: 'sqlite-dev-01.db' (1)
  SQLALCHEMY_TRACK_MODIFICATIONS: false

1	`DATABASE_FILE_NAME` is the filename for sqlite3 databse.

Use a different DATABASE_FILE_NAME for development vs production. When you want to have a fresh start, simply change db filename.

You may also configure sqlachemy here flask-sqlalchemy.palletsprojects.com/en/2.x/config/

4.6. Crowd: MTurk Settings

limits:
  max_threads_per_user: 10
  max_threads_per_topic: &max_assignments 3    (5)
  max_turns_per_thread: 4
  reward: &reward '0.01'  (5)

mturk:
  client:
    profile: default  (1)
    sandbox: true    (2)
  seamless_login: true (3)
  hit_settings: (4)
    MaxAssignments: *max_assignments  (5)
    AutoApprovalDelayInSeconds: 604800      # 7 days  = 604k sec
    LifetimeInSeconds: 1209600              # 14 days = 1.2M sec
    AssignmentDurationInSeconds: 3600       # 1 hour  = 3.6k sec
    Reward: *reward  (5)
    Title: 'Evaluate a chatbot'
    Keywords: 'chatbot,chat,research'
    Description: 'Evaluate a chat bot by talking to it for a while and receive a reward'

1	profile name should match the ones in `$HOME/.aws/credentials` file
2	`sandbox: true` to use sandbox and set `sandbox: false` go live
3	MTurk user will be automatically loggen into our system. An a/c would be created if required. The MTurk workers do not need to remember userID or password for logging into our system.
4	All these key-values are sent to mturk API; Refer to boto3.amazonaws.com/v1/documentation/api/1.11.9/reference/services/mturk.html#MTurk.Client.create_hit for a full list of options.
5	cross references using `&` and `*` for reusing previously defined limits

MTurk integration is achieved via ExternalQuestion However, ExternalQuestion requires hosting our webservice over HTTPS, which require SSL certificate. See Section 6, “HTTPS with Nginx Reverse Proxy”.

When the task is done, we need to submit a form back to MTurk informing the completion. So, Mturk worker gets an additional screen at the end of task where they click a button to notify the task completion to mturk.

In the current version of this system, we do not automatically launch HITs on Mturk, instead we provide an option to Admin user.

To launch HITs on mturk, follow these steps

Login as admin user (See Section 2, “Quick Start”)
Go to Admin Dashboard > Topics > Launch on Mturk or Mturk Sandbox (depending on cofig)

5. Adding Bots and Transforms

If __init__.py file is found at the root of task directory, then the directory is treated as python module and imported it.

Refer to example-chat-task directory for an example.

you may have to install additional requirements/libs for your code.

Example bot and transform

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from typing import Any

from boteval import log, C, registry as R
from boteval.bots import BotAgent
from boteval.transforms import BaseTransform, SpacySplitter
from boteval.model import ChatMessage


@R.register(R.BOT, name="my-dummy-bot")
class MyDummpyBot(BotAgent):

    def __init__(self, **kwargs):
        super().__init__(name="dummybot", **kwargs)
        self.args = kwargs
        log.info(f"{self.name} initialized; args={self.args}")

    def talk(self) -> dict[str, Any]:
        if not self.last_msg:
            reply = f"Hello there! I am {C.BOT_DISPLAY_NAME}."
        elif 'ping' in self.last_msg['text'].lower():
            reply = 'pong'
        else:
            reply = f"Dummy reply for -- {self.last_msg['text']}"
        return dict(text=reply)

    def hear(self, msg: dict[str, Any]):
        self.last_msg = msg


@R.register(R.TRANSFORM, name='my-transform')
class MyDummyTransform(BaseTransform):

    def __init__(self, **kwargs) -> None:
        super().__init__()
        self.args = kwargs
        self.splitter = SpacySplitter.get_instance()

    def transform(self, msg: ChatMessage) -> ChatMessage:
        try:
            text_orig = msg.text
            text = '\n'.join(self.splitter(text_orig))  + ' (transformed)' # sample transform
            msg.text = text
            msg.data['text_orig'] = text_orig
        except Exception as e:
            log.error(f'{e}')
        return msg

Suppose the below code is placed in <task-dir>/__init__.py, we import it as a python module. The

@R.register(R.BOT, name="my-dummy-bot")

and

@R.register(R.TRANSFORM, name='my-transform')

statements register a custom bot and a custom transform, respectively. With this, the following config should be self explanatory (otherwise, revisit Section 4.1, “Bot Settings”)

chatbot:
  display_name: 'Moderator'
  topics_file: chat_topics.json
  bot_name: my-dummy-bot
  bot_args:
    key1: val1
    key2: val2
    some_flag: true

  transforms:
    human:
      - name: my-transform
        args:
          arg1: val1
          arg2: [val2, val3]
    bot:
      - name: my-transform
        args:
          arg1: val1
          arg2: [val2, val3]

6. HTTPS with Nginx Reverse Proxy

Summary:

Running Certbot generated a sample nginx config at /etc/nginx/sites-enabled/default with ssl_* fields configured.
I have added reverse proxy for location /boteval → 127.0.0.1:7070/boteval along with necessary proxy headers to make session/logins work.

The below nginx setting was tested to be working when flask app is bound to 127.0.0.1:7070/boteval on an AWS/EC2 instance whose IP is mapped (via DNS A-record) to dev.gowda.ai, and ports 80(HTTP), 443(HTTPS) are open to the public.

# Default server configuration
server {
        listen 80 default_server;
        listen [::]:80 default_server;

        root /var/www/html;

        # Add index.php to the list if you are using PHP
        index index.html index.htm index.nginx-debian.html;

        server_name _;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
        }

        location /boteval {
                proxy_pass http://127.0.0.1:7070/boteval ;
                proxy_set_header Host $http_host;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-Proto $scheme;
        }

}


server {
        root /var/www/html;
        # Add index.php to the list if you are using PHP
        index index.html index.htm index.nginx-debian.html;
        server_name dev.gowda.ai; # managed by Certbot

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
        }

        location /boteval {
                proxy_pass http://127.0.0.1:7070/boteval;
                proxy_set_header Host $http_host;
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Real-IP $remote_addr;
                proxy_set_header X-Forwarded-Proto $scheme;
                proxy_redirect http://$http_host/ https://$http_host/;
        }


    listen [::]:443 ssl ipv6only=on; # managed by Certbot
    listen 443 ssl; # managed by Certbot
    ssl_certificate /etc/letsencrypt/live/dev.gowda.ai/fullchain.pem; # managed by Certbot
    ssl_certificate_key /etc/letsencrypt/live/dev.gowda.ai/privkey.pem; # managed by Certbot
    include /etc/letsencrypt/options-ssl-nginx.conf; # managed by Certbot
    ssl_dhparam /etc/letsencrypt/ssl-dhparams.pem; # managed by Certbot
}

server {
    if ($host = dev.gowda.ai) {
        return 301 https://$host$request_uri;
    } # managed by Certbot

        listen 80 ;
        listen [::]:80 ;
    server_name dev.gowda.ai;
    return 404; # managed by Certbot

}

7. Development

Docs for tools and concepts that power this system

Model view controller pattern: en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller
Flask for server side propgramming: flask.palletsprojects.com/en/2.2.x/api/
Server side templating is using Jinja2: jinja.palletsprojects.com/en/3.1.x/templates/
Login and user session manager: flask-login.readthedocs.io/en/latest/
Database backend and ORM: persistance: flask-sqlalchemy.palletsprojects.com/en/2.x/models/
- If you do not know how Dabase and ORM works or how you could use them, then this could be most complex piece of the system. Good news is that we build on top of battele-tested SqlAlchemy. See its documentation at docs.sqlalchemy.org/en/14/
- Some examples for CRUD ops flask-sqlalchemy.palletsprojects.com/en/2.x/queries/
Themes and styles via Bootstrap : getbootstrap.com/docs/4.5/getting-started/introduction/
jQuery: for DOM manipulation and client side templating

Remember to start the server with -d option i.e., python -m boteval -d <taskdir> to enable hot reload.

For VS Code users, and we recommend these extensions:
* Jinja — for working with Jinja templates
* SQLite Viewer — for inspecting databse contents
* Remote development — useful for remote deployment with mturk integration)
* Asciidoctor — for editing docs

8. Acknowledgements

This project was initially developed at USC ISI for DARPA program named Civil Sanctuary
We learned from Mephisto. We used Mephisto in our pilot study and at somepoint we decided to implement this with some differences in tech stack, architecure, and control flow. However, there are some portions of this system that still resemble Mephisto.
For a list of contributors to source code, please refer to github.com/isi-nlp/boteval/graphs/contributors