Can ChatGPT write a proper usability test script?

(Credit: author via Stable Diffusion under CC0 1.0 License)

Conducting usability tests is a UX researcher’s bread and butter.

It’s a critical part of a user-centered product development process, helping to identify and address any issues or challenges that users may encounter. But preparing for a usability test can be a time-consuming and resource-intensive process, requiring the development of screeners, tasks, follow-up questions, and other materials.

One tool that might help lighten the load is generative AI, a type of artificial intelligence that can automatically generate a wide range of content, such as artwork, movies, and music. Recently, OpenAI unveiled ChatGPT, a generative AI with a chatbot interface that can produce code, screenplays, and haiku, among many other things.

But can it write a moderator’s guide for a usability test?

To find out, I asked ChatGPT: “Write an outline for a usability test moderator’s guide. The product is an online brokerage platform for trading stocks and investing. The tasks are for prospects considering the service and for existing customer managing their account.”

Here’s what it returned:

1. Introduction- Purpose of the usability test: To evaluate the usability and user experience of the online brokerage platform for trading stocks and investing.
- Overview of the moderator's role: The moderator will introduce the tasks, ask questions, and observe and record the participants' interactions with the platform.
- Information about the test environment and participants: The test will be conducted remotely, using screen sharing and video conferencing software. The participants will include prospects considering the service and existing customers managing their accounts.
2. Preparation
- Familiarization with the product: The moderator should review the platform's features and functionality, as well as any relevant documentation or user guides.
- Review of the test goals and questions: The moderator should identify the specific goals and questions for each task, and consider how to best observe and measure the participants' performance and satisfaction.
- Creation of a test plan and script: The moderator should create a detailed plan outlining the tasks, questions, and observations for each part of the test. The script should include prompts for introducing the tasks and asking open-ended questions, as well as specific questions to measure performance and satisfaction.
- Selection and setup of equipment: The moderator should ensure that the necessary recording software, cameras, and microphones are set up and working properly.
- Recruitment and scheduling of test participants: The moderator should identify and recruit appropriate test participants, and schedule the test sessions at a convenient time for each participant.
3. Conducting the test
- Welcoming and orienting the participant: The moderator should greet the participant and introduce themselves, explaining the purpose and process of the usability test.
- Introducing the task and goals: The moderator should describe the task and goals, and provide any necessary context or instructions.
- Asking open-ended questions and encouraging participants to think aloud: The moderator should encourage the participant to speak their thoughts and share their reactions as they interact with the platform.
- Observing and taking notes on participant behavior and feedback: The moderator should observe and record the participant's actions and feedback, paying attention to areas of difficulty or confusion.
- Providing assistance as needed, but avoiding influencing the participant's actions: If the participant appears to be struggling with a task, the moderator can offer help or clarification, but should avoid giving direct guidance or influencing the participant's actions.
- Following the script and asking the prepared questions: The moderator should follow the script and ask the prepared questions to gather additional data and insights.
4. Debrief and wrap-up
- Thanking the participant for their time and feedback: The moderator should thank the participant for their participation and feedback.
- Gathering any final thoughts or comments: The moderator should ask the participant if they have any final thoughts or comments about their experience.
- Stopping the recording and completing any necessary post-test tasks: The moderator should stop the recording and complete any necessary post-test tasks, such as transcribing notes or uploading recordings.
- Analyzing the data and sharing the results with relevant team members: The moderator should analyze the data and share the results with relevant team members, including recommendations for improvements or changes based on the findings.

Of course, any real usability test should be defined in close cooperation with the invested stakeholders to ensure that the results gathered are useful for their goals. But since I’ve formerly worked as a UX researcher in the finance space, I could at least compare the output with my prior experience working on similar studies.

And, at least, so far… not bad!

I’ve never created a document exactly like this — many of these materials usually end up living in different places — but one could imagine a scenario where it might be appropriate to keep everything related to a study in one place.

Outline: A-

Next, I asked for guidance on the sample size: “How many participants should we recruit?”

The number of participants to recruit for a usability study will depend on a variety of factors, including the goals and objectives of the study, the complexity and scope of the product or service being tested, and the resources available for conducting the study.In general, usability studies typically aim to gather data from a relatively small number of participants, as the goal is to identify and understand common patterns and issues rather than to estimate population-level statistics. For this reason, usability studies are often designed as formative or exploratory studies rather than summative or confirmatory studies.
As a rough guide, usability studies may aim to recruit between 5 and 15 participants, depending on the complexity and scope of the product or service being tested. This is often referred to as the "rule of five," as it is generally believed that recruiting at least 5 participants can help identify the majority of usability issues and provide a sufficient sample size for analysis.
However, it is important to note that the number of participants may vary depending on the specific goals and objectives of the study, as well as the resources available for conducting the study. It may be helpful to consult with usability experts or review relevant literature to determine the most appropriate sample size for a particular study.

So just like a real UX researcher, ChatGPT said: “It depends.”

It’s a bit cagey on the nuances, and the distinction between formative and summative work is something of a non sequitur here. But it did give 5 to 15 participants as a rule of thumb, which is in line with industry best practices for a small qualitative study. So this answer is just passing.

Sample Size: C-

Next, I asked: “Can you draft a short screener to have potential participants take so we can ensure we have the right people in our study?”

1. Do you currently have an online brokerage account?- Yes
- No2. If you answered "Yes" to the previous question, which of the following firms do you have an account with? (Select all that apply)- Vanguard
- Charles Schwab
- Fidelity
- E*TRADE
- Other (please specify)3. Have you used an online brokerage platform in the past?- Yes
- No4. How frequently do you trade stocks or invest online?- Daily
- Weekly
- Monthly
- Rarely
- Never5. Do you have any experience with investing or trading stocks in general?- Yes, I have a lot of experience
- Yes, I have some experience
- No, I am new to investing6. Are you currently considering opening an online brokerage account?- Yes
- No

It’s a start.

The best practice is to avoid asking direct questions related to the nature of the study, so that respondents don’t get through by answering dishonestly. Thus, it’d be better to ask the respondent to select all of the types of accounts they have, with one of the many options being brokerage accounts.

There are other quibbles, too. Questions 3 and 6 aren’t bad in themselves, but would make more sense if they were conditional on, and closer to, question 1. And the response options for question 4 may make the question difficult for participants to know how to answer.

But worst of all, the final list of question took several prompts from me, unlike the other outputs shown in this article.

The first version didn’t include specific firms that participants have an account with — an important detail for separating prospects and customers. The second version had placeholder names (e.g., “Firm A,” “Firm B”…) instead of actual firms. Finally, I asked it to list the top four firms by assets under management. So ChatGPT failed this one without human assistance.

Screener: F

Moving on to the test itself, I skipped past the typical introductory speech that researchers give at the beginning of a study, as many of those points (e.g. welcoming and orienting the participant, reminding them to think aloud) were already in the outline.

So I asked ChatGPT to “draft 5 open-ended questions to ask participants. These should be ice-breakers or elicit broader context about the study topic.”

1. Can you tell me a little bit about your experience with online brokerage platforms or investing in general?
2. What attracted you to this platform or service?
3. What do you hope to achieve with this platform or service?
4. Are there any specific features or functionality that you are particularly interested in or looking for in an online brokerage platform?
5. Is there anything else you'd like to share about your expectations or motivations for participating in this study?

Though question 5 is a bit odd, it’s good to make sure the participant doesn’t have any burning questions before moving on. And I never specified which platform we were testing, so I can forgive the awkward repeating of “platform or service.”

Icebreakers: B

Then I asked ChatGPT to “prepare a list of 12 potential tasks to test in this study. 6 should be for prospects and 6 should be for current customers.”

Prospects:1. Searching for and comparing different stock options
2. Setting up a new account and completing the registration process
3. Adding funds to the account and making a first trade
4. Viewing account activity and portfolio performance
5. Using the platform's research and analysis tools
6. Contacting customer support for assistance
Current customers:
1. Viewing account activity and portfolio performance
2. Making a trade
3. Viewing and managing alerts and notifications
4. Accessing and using research and analysis tools
5. Updating account settings and preferences
6. Contacting customer support for assistance or questions

The tasks for current customers are pretty sensible. But onboarding flow aside, the prospect tasks might as well be customer tasks. Four are exactly the same!

They also might not be feasible, depending on the stimuli used in the study. For example, registering for a brokerage account may take hours or days pending manual reviews sometimes required by regulations. Plus, signing up requires a lot of sensitive personal information that may not be appropriate to collect in this kind of study. So these tasks could only be done with a prototype.

Tasks: C-

Nevertheless, I asked, “Can you draft scenario prompts to read to the participant for the 12 tasks you listed?” I’m reproducing just the first three here, as all were of similar quality:

1. Searching for and comparing different stock options: "Imagine that you are a beginner investor and are looking to start building a portfolio. You are interested in a range of stocks, but you want to compare different options and make an informed decision. Using the platform, please search for and compare different stock options, and consider factors such as price, performance, and risk."2. Setting up a new account and completing the registration process: "Imagine that you are new to this platform and are interested in setting up a new account. Using the platform, please complete the registration process and set up a new account, including any necessary steps such as verifying your identity and adding personal or financial information."
3. Adding funds to the account and making a first trade: "Imagine that you have just set up a new account on this platform and are ready to start investing. Using the platform, please add funds to your account and make your first trade, considering factors such as the amount of funds you want to add, the type of asset you want to trade, and any associated fees or risks."

For open-ended tasks, these are all right.

But if we want to collect comparable data on task success rates across participants, we want to make sure we’ve provided specific enough instructions that participants should know if they’ve done it or not. And it may be useful to constrain the scenario further so that participants don’t vary as wildly on time spent and the depth of the product explored.

Test Scenarios: B

I then asked it to “create a set of 2 to 5 follow-up questions to ask after tasks. Include some open-ended questions and rating scales.”

Open-ended questions:- How did you feel about the task overall?
- What did you find most helpful or useful about the platform during this task?
- Were there any areas of the platform that you found confusing or frustrating?
- Is there anything else you'd like to share about your experience with the task?
Rating scales:
- On a scale of 1 to 5, how satisfied were you with the task? (1=Very Unsatisfied, 5=Very Satisfied)
- On a scale of 1 to 5, how easy or difficult was the task? (1=Very Difficult, 5=Very Easy)
- On a scale of 1 to 5, how likely are you to use this feature in the future? (1=Very Unlikely, 5=Very Likely)

The second question is somewhat leading: it assumes that participants will find something helpful or useful in the platform, which may not be the case. But the others are more-or-less neutral and open-ended, and similar to ones I would use.

I’ve used variations on each of the rating scale questions in usability tests before, with perceived ease being the most valuable in my experience. Satisfaction ratings are often quite similar and so seldom provide additional information. And asking participants to conjecture about future usage is unreliable, though following up on why they gave the rating can sometimes produce useful insights.

Task Follow-ups: C+

To wrap things up, I asked ChatGPT to “conclude the study with 4 or 5 questions for the participant. This should include some open-ended questions, and some rating scales intended to assess the overall experience.”

Open-ended questions:- What did you like most about the platform or service?
- Were there any features or functionality that you found particularly useful or beneficial?
- What could be improved about the platform or service in your opinion?
- Is there anything else you'd like to share about your experience with the platform or service?
- Would you recommend this platform or service to a friend or colleague? Why or why not?
Rating scales:
- On a scale of 1 to 5, how satisfied were you with the platform or service overall? (1=Very Unsatisfied, 5=Very Satisfied)
- On a scale of 1 to 5, how likely are you to continue using this platform or service? (1=Very Unlikely, 5=Very Likely)

Again, question 1 assumes the participant liked anything about the platform, but this could be made more neutral by rephrasing to: “What, if anything…”

I’ve never asked an NPS-style question without the rating scale, but have found that the follow-up question often provides the most useful data. Overall CSAT is a common metric inserted at the end of studies, but the other rating scale once again asks participants to predict the future. Aside from that though, I have used versions of the other open-ended questions in studies.

Overall Study Questions: B

As a final test, I asked ChatGPT, “Are there any standard validated scales that are appropriate to include in this kind of usability study?”

Some examples of standard validated scales that may be appropriate to include in a usability study include:- The System Usability Scale (SUS): This is a widely used scale that measures overall usability and user satisfaction. It consists of 10 items, each rated on a 5-point scale, with higher scores indicating higher usability and satisfaction.
- The Task-Specific Satisfaction (TSS) scale: This scale measures satisfaction with specific tasks or features within a product or service. It consists of 5 items, each rated on a 5-point scale, with higher scores indicating higher satisfaction.
- The Single Ease Question (SEQ): This is a single item scale that measures overall ease of use and satisfaction. It consists of a single question, rated on a 5-point scale, with higher scores indicating higher ease of use and satisfaction.
- The User Experience Questionnaire (UEQ): This scale measures various aspects of the user experience, including usability, aesthetics, and emotional engagement. It consists of 16 items, each rated on a 7-point scale, with higher scores indicating higher user experience.

Interesting! Both the SUS and SEQ are common and appropriate measures here, but I’d never heard of the other two. I was interested to learn about the UEQ, but so far as I can tell, ChatGPT made up the TSS from whole cloth.

UX Scales: F (for dishonesty)

Generative AI is rapidly disrupting how art is made and how code, blog posts, and student essays are written.

But is it ready to take on UX research? I asked ChatGPT to produce a usability test script to find out.

The results were impressive. At the very least, ChatGPT produced serviceable first drafts of the screener, task scenarios, open-ended questions, follow-ups, rating scales, and metrics. When prompted, it made adjustments and corrections, like adding specific firms to the screener.
But they were still imperfect, and at times, way off. It recommended a non-existent UX metric, worded some things awkwardly or too directly, and didn’t always order questions in the most natural or coherent way. It also proposed tasks that might not be feasible without a prototype.
ChatGPT never asked me to clarify what any human researcher would: what are the goals of the study? UX researchers consider every decision about the study design and execution in light of how the stakeholders will use the results.

Perhaps in the future, generative AI tools will become sophisticated enough to develop flawless complete test plans, which would give researchers more time to focus on other parts of their role. But as it stands, anything they produce needs to be carefully reviewed.

Nevertheless, all of the materials presented here were produced in less than 10 minutes. That speed and efficiency may make ChatGPT a useful tool for ideation and first drafts, so long as a human researcher is experienced enough to critically judge the output.

For now, it’s exciting to think of generative AI as a tool to help researchers do their work more efficiently. But I wouldn’t fear ChatGPT taking our jobs any time soon.

Related posts: