Categorization profiles

A categorization profile contains categories and subcategories for the Conversation Analyzer feature. Conversation Analyzer uses the profile to categorize transcripts of call recordings. The profile also contains any substitution and redaction rules you provide. Using the substitution and redaction rules, Conversation Analyzer refines the transcribed text.

The categorization profile applies to the associated account. For information about where you can view the categorized recordings and refined transcripts, see Listening to and commenting on a call recording.

Categorization profiles are written in JavaScript Object Notation (JSON). For information about JSON, see https://www.json.org/.

Categorization profile structure

A categorization profile consists of the following top-level elements:

name (a name/value pair)
language (a name/value pair)
categories (an array of category objects).
Each category consists of the following:
- name (a name/value pair)
- rules (an array of one or more categorization rule objects).
  Each categorization rule object consists of the following:
  - party (a name/value pair)
  - expression (a name/value pair)
- subcategories (an array of one or more subcategory objects).
  Each subcategory consists of the following:
  - name (a name/value pair)
  - rules (an array of one or more categorization rule objects).
    Each categorization rule object consists of the following:
    - party (a name/value pair)
    - expression (a name/value pair)
For more information about categorization rules, see Categorization rules.
substitution (an array of substitution rule objects).
Each substitution rule consists of the following:
- party (a name/value pair)
- find (a name/value pair)
- replace (a name/value pair)
For more information about substitution rules, see Substitution and redaction rules.

In this page

Analyzing transcripts

Conversation Analyzer analyzes transcripts in several steps:

Conversation Analyzer identifies characters in the transcripts. Characters are either word or non-word characters. Characters from the Unicode categories (see expression and find value validation), plus apostrophes, are word characters. Other characters are non-word characters and act as word separators. Non-word characters include !, £, $, %, ^, &, *, (, ), and -.
Conversation Analyzer uses findings from step 1 to identify the individual words in the transcripts.
Conversation Analyzer looks for words in the transcripts that match the rules in the categorization profile:
1. Conversation Analyzer applies substitution rules first, replacing text if found.
2. Conversation Analyzer tags the processed transcripts with the corresponding categories if found.

`name` and `language`

Conversation Analyzer uses a call's Language and ConversationAnalyzerProfile data source values to identify the categorization profile to use to categorize and refine the call recording. Both language and name need to match the Language and ConversationAnalyzerProfile data source values to identify the profile. For information about how the Language and ConversationAnalyzerProfile data sources get their values, see Overview of Conversation Analyzer.

Categorization rules

As part of transcribing recordings, Conversation Analyzer categorizes the textual contents of the transcript, by identifying specific words and phrases that correspond to defined categories. A category is a collection of rules, with each rule consisting of a word or phrase and the party who said that word or phrase. If the transcript contains the word or phrase and was spoken by the specified party, the transcript matches the category.

For example, you may want to track how polite your agents are when speaking with customers. Create a category of 'Politeness' that looks for phrases such as 'Please', 'Thank you' and 'You're welcome'. You may also want to ensure that agents are promoting a new product or service. You would need to create a specific category that identifies incidences of the agent saying the product's or service's name.

Conversation Analyzer applies categorization rules to processed transcripts—text that Conversation Analyzer has applied substitution rules to—rather than the original text. Keep this in mind when you create your categories.

Example categorization profile (category rules only)

In the following example, the categorization profile—Cat_example—contains one category—Cat details. Cat details contains two rules—one rule for each party—and one subcategory—Cat position. Cat position also contains two rules—one rule for each party. The substitution array contains no rules. If more than one rule applies to some text in the transcript, that text will appear in multiple categories.

Click here to expand...

{
    "name": "Cat_example",
	"language": "en-us",
    "categories": [
        {
            "name": "Cat details",
            "rules": [	
				{
                    "party": "customer",
                    "expression": "cat is ## years old"
                },
                {
                    "party": "agent",
                    "expression": "your cat?"
                }
			],
			"subcategories": [
				{
					"name": "Cat position",
                    "rules": [
                        {
                            "party": "customer",
                            "expression": "cat * sat mat ~2"
                        },
                        {
                            "party": "agent",
                            "expression": "cat mat ~3"
                        }
                    ],
                    "subcategories": []
                },
			]
		}
    ]
    "substitution": []
}

The following sections describe the party and expression name/value pairs.

`party`

party defines the party who must say the word or phrase defined by the rule expression for the transcript to match the category. Party can be customer or agent.

The format of party is "party": "value" where value can be:

customer
agent

`expression`

The expression name/value pair in a rule defines the text that must appear in the transcript to match the category.

The categorization expression language describes the format of an expression. The language supports simple expressions where the presence of the exact word or phrase would result in a match. For information about the categorization expression language, see Categorization expression language.

Substitution and redaction rules

Along with applying categorization rules to a conversation transcript, Conversation Analyzer applies substitution and redaction rules to refine the output:

Substitution rules replace commonly mis-transcribed words and improve the spelling of words. You will most likely require these rules for proper nouns, such as place, company or product names. For example, Conversation Analyzer may transcribe 'Basingstoke' as 'Beijing spoke', or 'NewVoiceMedia' as 'new voice media'. Create rules that replace the incorrect word or words.
Redaction rules replace sensitive information such as credit card details. Redaction rules are specific type of substitution rules in that instead of using them to refine and clarify phrases in the transcript output, you use them to obscure the content. Use a redaction rule to replace specified text with text such as '(redacted)', '(removed)', or 'xxxxxxxxxxxxxx'.

Example categorization profile (substitution and redaction rules only)

In the following example, the categorization profile—Subs_example—contains three substitution rules. The categories array contains no rules.

Click here to expand...

{
    "name": "Subs_example",
	"language": "en-us",
    "categories": [ ]
    "substitution": [
        {
            "party": "agent",
            "find": "new voice media",
            "replace": "NewVoiceMedia"
        },
        {
            "party": "customer",
            "find": "Beijing spoke",
            "replace": "Basingstoke"
        },
        {
            "party": "customer",
            "find": "my card number is *",
            "replace": "xxxx xxxx xxxx xxxx"
        }
    ]
}

The following sections describe the find and replace name/value pairs. For information about the party name/value pair, see party.

`find`

The find name/value pair in a rule defines the text that must appear in the transcript to match the substitution rule.

The categorization expression language describes the format of the value in the find name/value pair. The language supports simple values where the presence of the exact word or phrase would result in a match. For information about the categorization expression language, see Categorization expression language.

`replace`

The replace name/value pair in a rule defines the text that will replace the found text.

Applying substitution and redaction rules result in Conversation Analyzer modifying transcript text. Because of this, you must take extra care when writing your rules. For more information about substitution rules, see Substitution and redaction rules continued.

Categorization expression language

The categorization expression language describes the required format of the values you provide in the expression and find name/value pairs. Conversation Analyzer can then use these values to locate matching text in the transcripts.

Use the categorization expression language to define the categorization, substitution and redaction rules.

`expression` and `find` value validation

Valid expression and find values contain only alphanumeric, apostrophe and space characters; that is, values can contain spaces (U+0020), apostrophes (U+0027), and characters from the following Unicode categories:

Values can be no more than 100 characters long.

Wildcards in values

The categorization expression language supports the following wildcards within the values. Examples refer to the expression name/value pair, but exactly the same rules apply to find name/value pairs.

Wildcard	Description	Example expressions	Details
`?`	Wildcard representing one character		Each `?` represents one character.
		`wh?`	The following words will match the example expression: "who" and "why". For an example of an expression using the `?` wildcard, see Example 2. Expression using the ? character wildcard.
		`wh??`	The following words will match the example expression: "what", "when", "whom". For an example of an expression using the `??` wildcard, see Example 5. Expression using the ?? wildcard.
`*`	Wildcard representing zero to many characters	`sit*`	The following words will match the example expression: "sit", "sits", "sitting". For an example of an expression using the * wildcard, see Example 3. Expression using the * character wildcard. To use `` to represent a character or characters, ensure that the `` is contiguous with the characters in the containing word. You can also use * to represent a word or words. For information, see Wildcard representing zero to many words.
`#`	Wildcard representing one numeric character	`###`	Only digits will match the example expression, not text. Text containing "123" will match the example expression but text containing "one two three" will not. For an example of an expression using the `#` wildcard, see Example 4. Expression using the # character wildcard.
`*`	Wildcard representing zero to many words	`cat * mat`	The following phrases will match the example expression: "cat mat", "cat sits on the mat", and "cat always sits happily on the mat". For an example of an expression using the `` wildcard, see Example 6. Expression using the word wildcard. To use `` to represent a word or words, type a space between the `` and any other characters in the expression. You can also use * to represent a character or characters. For information, see Wildcard representing zero to many characters.
`~N`	Represents the number of words that can appear between the specified words in a phrase	`cat mat ~4`	A phrase that contains N or fewer words between the specified words will match the example expression. The following phrases will match the example expression: "cat mat", "cat sits on the mat", and "cat always sits on the mat". For an example of an expression using the `~N` wildcard, see Example 7. Expression using the ~N wildcard. If the expression contains more than two words, `~N` applies to the number of words between any of the specified words. For an example of an expression using using the `~N` wildcard with more than two words, see Example 8. Expression using the ~N wildcard.

`expression` examples

Example 1. Simple expression

"expression": "the cat sat"

With a simple expression, only the exact word or phrase will satisfy the rule.

Example 2. Expression using the `?` character wildcard

"expression": "the cat? sat"

The ? in the expression represents a single character that must appear after "cat" but before "sat" in matching text.

Text	Does it match?	Explanation
the cat sat	No	The `?` in the expression requires a character in its place.
the cats sat	Yes	The `?` in the expression represents the "s" in the text.
their cats sat	No	The expression does not allow any additional characters after "the".

Example 3. Expression using the `*` character wildcard

"expression": "sit*"

The * in the expression represents zero to many characters that can appear after "sit" in matching text.

Text	Does it match?	Explanation
sit	Yes	The `*` in the expression requires zero to many characters in its place.
sits	Yes	The `*` in the expression represents the "s" in the text.
sitting	Yes	The `*` in the expression represents the "ting" in the text.
sat	No	The expression requires that "sit" appears in the text.

Example 4. Expression using the `#` character wildcard

"expression": "### ###"

Matching text must contain two sets of three digits, separated by a non-word character and no other characters.

Text	Does it match?	Explanation
123 456	Yes	The expression matches two sets of three digits.
123-456	Yes	The expression matches two sets of three digits. The hyphen is a non-word character and separates the two sets of three digits.
123456	No	The expression requires two sets of three digits, not one set of six.
123 abc 456	No	The expression requires two consecutive sets of three digits, not two sets separated by any other characters.

Example 5. Expression using the `??` wildcard

"expression": "wh?? cat"

The ?? in the expression represents two characters must appear after "wh" and before "cat" in matching text.

Text	Does it match?	Why
what cat	Yes	The `??` in the expression represents the "at" in the text.
when cat	Yes	The `??` in the expression represents the "en" in the text.
who cat	No	The `??` in the expression requires two characters after "wh" not one.
which cat	No	The `??` in the expression only represents two characters after "wh" not three.

Example 6. Expression using the * word wildcard

"expression": "the cat sits * on the mat"

The text must contain the phrase "the cat sits on the mat" with zero to many words between "sits" and "on".

Text	Does it match?	Why
the cat sits on the mat	Yes	The `*` in the expression requires zero to many words in its place.
the cat sits happily on the mat	Yes	The `*` in the expression represents "happily" in the text.
the cat always sits on the mat	No	The `*` in the expression appears after "sits", not before.

Example 7. Expression using the ~N wildcard

"expression": "cat mat ~3"

The text must contain the words "cat" and "mat" with up to three words between them.

Text	Does it match?	Why
the cat mat	Yes	The text contains no words between "cat" and "mat" and the expression allows up to three.
the cat likes mat	Yes	The text contains one word between "cat" and "mat", and the expression allows up to three.
the cat sits on the mat	Yes	The text contains three words between "cat" and "mat", and the expression allows up to three.
the cat always sits happily on the mat	No	The text contains five words between "cat" and "mat", but the expression only allows up to three.

Example 8. Expression using the ~N wildcard

"expression": "cat sat mat ~3"

The text must contain the words "cat", "sat" and "mat" with up to three words between each of them. In this example, matching text may contain three words between "cat" and "sat" and also three words between "sat" and "mat".

Text	Does it match?	Why
the cat eagerly sat on the mat	Yes	The text contains one word between "cat" and "sat", and two words between "sat" and "mat"; the expression allows up to three.
the cat eagerly and promptly sat on the green mat	Yes	The text contains three words between "cat" and "sat", and three words between "sat" and "mat"; the expression allows up to three.
the cat sat on the green and blue mat	No	The text contains too many words (five) between "sat" and "mat".

Example 9. Expression using the ~N and * word wildcards

"expression": "cat * sat mat ~2"

Even when used with a ~N wildcard in an expression, a * word wildcard can represent any number of words. In this example, matching text can contain any number of words between "cat" and "sat", but a maximum of two words between "sat" and "mat".

Text	Does it match?	Why
the cat sat on the mat	Yes	The text contains no words between "cat" and "sat", and two words between "sat" and "mat".
the cat waited calmly whilst the mouse ran around and then sat on the mat	Yes	The text contains nine words between "cat" and "sat", and two words between "sat" and "mat".
the cat always sat on the green mat	No	The text contains too many words (three) between "sat" and "mat".

Substitution and redaction rules continued

Overlapping substitution and redaction rules

Overlapping occurs when more than one rule matches the same transcript text. Because substitution and redaction rules actually modify the transcript text, overlapping rules can cause a conflict whereby multiple rules try to replace text with different values. To handle overlapping, Conversation Analyzer uses the following logic when applying the rules:

The order of the rules in the profile determine their priority; the first rule has the highest priority.
If rules overlap, the higher priority rule takes precedence over the lower priority. The lower priority rule is discarded.
A discarded rule does not block any other lower priority rules.

Examples of overlapping rules

In all the examples, party has been removed for simplicity.

Example 1. We want to replace "credit card" with "payment method" and remove credit card number.

Transcription text:

"My credit card is 1234567890123456"

Substitution rules:

Rule 1:

"find": "credit card","replace": "payment method"

Rule 2:

"find": "credit card #* ~5","replace": "(credit card information redacted)"

Intended text:

"My (credit card information redacted)"

Processed text:

"My payment method is 1234567890123456"

Why:

Rules 1 and 2 overlap. In this scenario, Conversation Analyzer applies rule 1—because rule 1 has higher priority—and discards rule 2. The result is that the credit card number is still exposed

Solution:

Write your redaction rules first, followed by your substitution rules.

Example 2. We want to remove all strings of three or more numbers because they can contain sensitive information. However, we want to label PIN numbers differently to credit card numbers.

Transcription text:

"My PIN is 1234"

Substitution rules:

Rule 1:

"find": "###*","replace": "(redacted)"

Rule 2:

"find": "credit card ################ ~5","replace": "(credit card has been redacted)"

Rule 3:

"find": "PIN #### ~5","replace": "(PIN has been redacted)"

Intended text:

"My (PIN has been redacted)"

Processed text:

"My PIN is (redacted)"

Why:

Rules 1 and 3 overlap. In this scenario, Conversation Analyzer applies rule 1—because rule 1 has higher priority—and discards rule 3. The result is that instead of applying the more specific rule "(PIN has been redacted)", we applied the more general one.

Solution:

Write more specific rules first, followed by more general—catch-all—rules later.

Example 3. Due to the highly sensitive nature of passwords, we want to remove user account names, and wipe out the whole text containing password.

Transcription text:

"My account name is administrator and my password is Jupiter, with upper case J"

Substitution rules:

Rule 1:

"find": "account name is * ","replace": "(account name redacted)"

Rule 2:

"find": "* password *","replace": "(password redacted)"

Intended text:

"My (account name redacted) and (password redacted)"

Processed text:

"My (account name redacted)"

Why:

In this scenario, Conversation Analyzer applies rule 1, because rule 1 has higher priority than rule 2. In removing the account name, the whole of the password text is removed too. Rule 2 does not match the remaining text.

Solution:

Write your rules in order of most sensitive to least sensitive. Avoid using operators like * and ~ as much as possible as these remove

Example 4. For a dogwalking service, we want to improve the transcription with more accurate, business-related words.

Transcription text:

"I have a big hunting dog"

Substitution rules:

Rule 1:

"find": "big hunting dog","replace": "hound"

Rule 2:

"find": "I have * dog","replace": "I am a dog owner"

Rule 3:

"find": "have","replace": "look after"

Processed text:

"I look after a hound"

Why:

In this scenario, Conversation Analyzer applies rule 1. Rule 2 overlaps rule 1 so Conversation Analyzer discards rule 2. Rule 3 overlaps rule 2 only, but because Conversation Analyzer has discarded rule 2, rule 3 can be applied.

Solution:

Write your substitution rules in order of importance.

Chaining substitution and redaction rules

Chaining occurs when one rule matches the output of another rule. Chaining only occurs when you re-analyze a recording. For information about re-analyzing recordings, see Configuring Conversation Analyzer.

Each time Conversation Analyzer applies substitution rules to a transcript, Conversation Analyzer overwrites the original transcript with the processed text. Rerunning the substitution rules can therefore further refine the text.

Example of chaining rules

In the example, party has been removed for simplicity.

Example: Simple case to illustrate chaining.

Original transcript text:

"I have a dog"

Substitution rules:

Rule 1:

"find": "dog","replace": "big cat"

Rule 2:

"find": "cat","replace": "mouse"

Processed text:

"I have a big cat"

Reprocessed text:

"I have a big mouse"

Why:

Rule 2 matches part the output of rule 1. On the initial processing, Conversation Analyzer applies rule 1. Conversation Analyzer overwrites the original text with the replaced text. On reprocessing, Conversation Analyzer applies rule 2.

Solution:

Write rules so that they don't apply to the output of each other to avoid chaining.

Highlighting replaced text

After Conversation Analyzer has processed a transcript, substituting or redacting text as your rules require, you are unable to see what has changed. If you want to see where in the transcript Conversation Analyzer, for example, removed text, create a category that highlights the replaced text.

If you substitute text with characters that are not valid in expression values, you will not be able to create a categorization rule to highlight the text. For example, if you create a substitution rule that replaces account numbers with "*********", a categorization rule with "expression": "*********" will be invalid.

Example of highlighting replaced text

In the example, party has been removed for simplicity.

Example: We want to see where account numbers have been removed from the transcript.

Original transcript text:

"My account number is 1234567890123456"

Substitution rule:

"find": "################","replace": "**** **** **** ****"

Processed text:

"My account number is **** **** **** ****"

Categorization rule:

"name": "Replaced text" [...] "expression": "**** **** **** ****"

In your transcript, the replaced text is highlighted within the Replaced text category.

Categorization profiles