Tommie Experts: Using AI to Fight Human Bias at Work

Editor's note: This piece originally ran on Medium and has been republished with permission.

Liz Melton is a Stanford alumna interested in AI, product-led growth and corporate social responsibility.

Grant Riewe is on faculty at the University of St. Thomas Opus College of Business.

Artificial intelligence has emerged as the focus of most new technological applications and promises to revolutionize how we interact with systems, data and other people. Many dream big of solutions that can make decisions better, see problems more clearly, and chart paths through noisy, unknown waters. These are, however, still dreams; the ability for AI to solve broad problems equal to human cognition remains in the future.

But narrowly defined problems (e.g., sales pattern insights) or problems with discrete boundaries (e.g., processing language or visual imagery) are well within the reach of AI. These are cognitive solutions in that we have trained systems to identify patterns as well as (or better than) humans, but they are not necessarily decision-makers beyond that point. Taking advantage of the inherent abilities of these solutions should be the obvious next step. What can machines natively do that humans struggle with?

What if natural language processing (NLP) could help identify bias?

No one can deny that humans’ ability to maintain intense focus is limited. While we know this to be true, we haven’t changed the way these limitations affect our work, particularly during live interactions like conference calls, chat-based collaborative work, or most critically, performance evaluations.

Reviews are intended to be objective, but all humans experience bias. While many companies opt for group reviews as a way to de-bias and challenge the status quo, we have to pay attention to what is being said in those meetings, how those comments are said, and the context for those remarks in order for this review structure to be as effective as intended. At the same time, most people’s attention span is of shorter duration than the review itself. Plus, being promoted depends on what bosses remember about their direct reports, their subjective measure of employee success, and their ability to convince others that employee accomplishments are deserving of a reward. As a result of these compounding factors, meta-bias patterns emerge in company culture.

Combine those limitations with the fact that reviews are often a breeding ground for subtle – and not so subtle – bias, and it begs the question: Why aren’t we using technology to help?

Where human attention fatigue exists, machine systems excel. Unlike humans, computers do not experience fatigue. With developments in natural language processing (NLP) and Conversational AI (CAI), computers can identify biased phrases in real time. These technologies have a long way to go to match human nuance, but even now we can at least flag problematic phrases during something as significant as performance reviews. And with the right inputs rooted in social science and normalized based on geography, contextual relationships, and culture, we could be surfacing insidious bias throughout organizations.

In the next several sections, we’ll take a look at how a future conversational AI tool could reduce bias and, eventually, teach people to reevaluate and reframe their thinking.

A hypothetical review process

Before examining our hypothetical example, we feel it’s important to define what we mean by “performance review.” When we refer to performance reviews, we are talking about a common review structure used by large consulting firms and high-functioning Fortune 500 companies. This method of review consists of several elements:

A panel of peer reviewers who present cases for their direct reports’ promotion. Each presenter only gets a few minutes to speak to each case. Peers then ask clarifying questions to understand the case, before determining outcomes (performance rating, promotion, bonus/compensation, etc.)
A moderator to chair the room and guide the process structure
Designated bias monitors who challenge based on context, language and other factors
Panels typically listen to dozens of cases per review meeting over many hours and subsequently experience attention fatigue

These review processes are designed to encourage de-biased conversation and equalize outcomes, but with so many opportunities to inject biased commentary, most review cycles still end up having some unfair results.

What could this technology look like?

So, how would this work? We envision applying NLP and natural language understanding (NLU) to monitor and alert within conversations. Examples of similar services already exist – take Alexa, Siri, and Cogito – so we have the ability to monitor and engage with conversations. The difference with this technology is that it would be actively detecting comments or statements that imply bias.

Many organizations have already done significant pre-work during anti-bias training, identifying biased phrases that are specific to company culture and contextually relevant to the reviewer / reviewee relationship. To apply this pre-work, we’d need to understand some demographic information about the reviewer and reviewee; age, race and gender are strong contextual indicators. With an understanding of company-specific problematic phrases, who is presenting, and who is being presented about, we have all the data elements to build a real-time alert system.

In a performance review setting, as we will see in an example below, the system would flag problematic phrases as they are said and the committee head would stop the conversation. The committee would then evaluate the comment, ask the presenter for further information, and only continue once there is sufficient clarity. Once the discussion concludes, the review cycle would continue until another phrase is identified. The system serves to be persistently aware throughout all conversations and highlight potential bias for everyone to learn from.

To be clear, this bias detection system isn’t independently deciding what statements are biased. It’s simply looking for what we tell it to look for. In other words, this system is an assistant, not a decider. Only humans can effectively balance statement, individual and culture to understand if a phrase was truly biased; there is too much nuance in culture and relationships for a machine to take into account. To be clear, this bias detection system isn’t independently deciding what statements are biased. It’s simply looking for what we tell it to look for. In other words, this system is an assistant, not a decider. Only humans can effectively balance statement, individual and culture to understand if a phrase was truly biased; there is too much nuance in culture and relationships for a machine to take into account.

We would expect biased and problematic phrases to shift over time, so companies will have to update the solution with new terms and phrases year over year. While this sounds tedious, it can function as a productive exercise to engage with indicators of bias, data on outcomes, and how to constructively engage with each other.

Bias detection in action

Let’s run through a performance review example to show how a bias detection tool might work. The example below is modeled on a generic process in which reviewers present reviewee annual performance to a committee of peers and senior leaders as described earlier.

Key participants:

Sundar (partner, male, naturalized Indian American, lives in New Jersey), reviewer
Tiffany (associate, female, black, American, lives in California), reviewed by Sundar
Preston (associate, male, white, American, lives in New York), reviewed by Sundar

Sundar: “Tiffany has done well this year, she’s starting to show more independence with clients, but it would be great if she spoke up in the team room more.”

Committee member: “What specific feedback do you have for her on speaking up? Does she not represent her workstream?”

Sundar: “Well, no, she is always on top of her workstream. But she doesn’t contribute to group discussions like my other reviewee on the team Preston does.”

Committee chair: “Comparative comments are not necessarily evaluative. Where is the gap in her participation?”

Sundar: “I think she just needs another cycle or two to develop more confidence.”

Although this is a short exchange, there are several issues with how Tiffany is represented in her review. Below, we’ve highlighted which statements could be problematic and why.

Sundar: “Tiffany has done well this year, she’s starting to show more independence with clients, but it would be great if she spoke up in the team room more.”

There are two points of contention in Sundar’s comment. The first is the use of the phrase “starting to show more independence.”

It’s not abundantly clear what this means – is Tiffany running meetings on her own? Is she forming relationships with clients individually? There is an assumed to be understood expectation that is not explicitly expressed, so the comment becomes vague.

Let’s assume there is a cultural understanding within the company of the expectation here. Even so, the comment is still problematic based on what we know about how minority women are asked to prove themselves repeatedly before advancement. Both facets require further clarification and specificity from Sundar.

The second problem is related to “speaking up.”

We don’t have much context around why Tiffany may not be speaking up, what she’s not speaking up about, and how the project itself is going. Perhaps Tiffany has a reason for staying quiet, or someone else is stealing her thunder whenever she tries to speak. We also know that it is common in consulting cultures for a brash, dominant, aggressive style to be the considered norm, ignoring other effective communication styles. Additionally, there is a high correlation to commentary on participation being out of alignment to the norm to gender or race differences. As such, we are unclear if Sundar is providing truly evaluative feedback or simply describing Tiffany.

Committee member: “What specific feedback do you have for her on speaking up? Does she not represent her workstream?”

While the committee member is doing a good job asking for feedback, they overlook the “independence” phrase so the committee misses an opportunity to engage and clarify.

Sundar: “Well, no, she is always on top of her workstream. But she doesn’t contribute to group discussions like my other reviewee on the team Preston does.”

At this point, Sundar’s not actually answering the committee member’s question. We don’t get more specificity regarding Tiffany’s speaking up and Sundar compares her to a white male associate. Comparative evaluation can be beneficial, but in this case, Sundar has not expressed any elements of that comparison that the committee can judge on. In essence, he is excluding the committee from participating in the evaluation process.

Committee chair: “Comparative comments are not necessarily evaluative. Where is the gap in her participation?”

Sundar: “I think she just needs another cycle or two to develop more confidence”.

Once more, Sundar doesn’t give concrete justification for why Tiffany needs another cycle to be promoted. This form of bias is common among women minorities who have to prove their worth over and over again to yield the same performance review results as their white male colleagues. He continues to exclude the committee on the evaluation of Tiffany. As a consequence, the evaluation that Tiffany receives goes through the form of being objective, but there is insufficient awareness and pressure on Sundar to provide evaluative evidence on his commentary.

In this example, we are still seeing the committee chair have sufficient awareness to challenge and inquiry. However, Sundar’s responses exclude the committee from participating in Tiffany’s evaluation. Additionally, only the committee chair is engaged in the challenge function – the rest of the committee does not demonstrate any awareness of the problematic phrasing. All of this is to the detriment of Tiffany, whose results presumably suffer.

Now, let’s examine how this conversation might go with a bias detection tool.

Sundar: “Tiffany has done well this year, she’s starting to show more independence with clients, but it would be great if she spoke up in the team room more.”

Bias detection tool: Potential problematic phrase: “starting to show more independence.” Reason for alert: confirmation bias. Context: Women and minorities are commonly asked to prove it again where other colleagues are not.

Bias detection tool: Potential problematic phrase: “speak up more.” Reason for alert: Vague statement. Context: Speaking is not the sole measure of participation; individuals have different participation styles.

Committee chair: “OK pause, there are two problematic statements that have been identified. Before we continue we need to clarify and understand what you are saying that is truly evaluative. Sundar, can you further explain your statement on “starting to show more independence” and provide specific examples of what Tiffany was asked to do and how she has done it? When you have done that, we can then discuss more precisely the expectations and demonstrated performance in the team room.”

Sundar: OK, great flags, I was not precise in my description. Let me do better.

The bias detection tool has identified the two problematic statements and contextualized them. Ideally, this information would be made available to the whole committee. The intention is not to shame Sundar, but to create awareness that there are unclear statements, why that lack of clarity is relevant, and give the committee the opportunity to determine if they need to intercede and discuss. The tool simply informs based on the context of training that it has – it neither decides outcomes nor evaluates presenters – it merely provides a consistent and unwearying alert function.

Over the course of many reviews in a session (sessions can last 6–8 hours and cover dozens of reviewees) attention and ability to intercede will wane as fatigue sets in. Humans are simply not able to remain continuously alert for this long. Here is where technology can help us identify intercession points, as it does not fatigue.

Other applications of bias detection tools

You might be asking yourself if this tool is good for anything other than pointing out biased phrases. Although that’s our primary reason for introducing this system, that’s not all it can do. Remember, this NLP/CAI combination is tracking every conversation – who is speaking, what they said, how others reacted, and so on. This data, therefore, is an incredible foundation for companywide analytics.

With a bias detection tool, you could see who is talking during a majority of performance review meetings. You could understand who challenges biased phrases and who doesn’t. You could even determine whether or not certain types of people are misrepresented in reviews more or less frequently. All this information gives us a fundamentally new picture of what’s happening in conversation, allowing us to engage not just based on performance review outcomes, but on the execution of each performance review itself.

Perhaps you’re curious about external factors affecting performance reviews. Maybe over the course of the day, people’s attention spans wane and cause more biased phrases to crop up. Or, maybe after proper bias training the total number of identified phrases goes down. Out of the box, this tool lays the groundwork for HR-related metrics that individuals (and the company as a whole) can improve over time.

Now what?

The creation of indefatigable assistants is a logical step in the evolution of AI assistants. The proliferation of customer service chatbots is similar though we envision other engagement strategies than simply inquiry and response as is common. We view this sort of intercession as necessary to shift behaviors within organizations. While there are post hoc solutions, and consulting organizations that perform this sort of analysis, the technology exists to make this a real-time and consequently more valuable tool.

With respect to our proposed bias detection and monitoring solution, there will be a necessary set of evolutions to remain persistent within an organization. The first grouping will be growing the solution to contain new expressions or behaviors that emerge as a consequence of what is detected. Organizations will need to build capacity to evolve the solution first, and then quickly pursue upstream intercession, coaching and engagement.

Eventually, we would desire to step back from maintaining the solution and letting it maintain and grow itself. This requires grappling with ethical questions, questions around organizational purpose, and creating challenge pathways to ensure that the solution does not stifle or overly dictate human behavior.

Science fiction has taught us to be wary of AI, creating a mythos of machines gone evil that has evolved into underlying anxiety of releasing control. In all of these stories, the evil or rogue element is rooted in a lack of deep intentionality and thoughtfulness in the creation of the machine. When we do think of aligning the goals and purpose of the machine organization to the human organization, we create benevolent assistants that allow us to challenge ourselves and grow in the most desirable ways.