Linguistic Analysis Explained

Editor’s note: This post was originally published on Ascribe and has been updated to reflect the latest data

Figuring out what humans are saying in written language is a difficult task.  There is a huge amount of literature, and many great software attempts to achieve this goal.  The bottom line is that we are a long way off from having computers truly understand real-world human language.  Still, computers can do a pretty good job at what we are after. Gathering concepts and sentiment from text.

The term linguistic analysis covers a lot of territory.  Branches of linguistic analysis correspond to phenomena found in human linguistic systems, such as discourse analysis, syntax, semantics, stylistics, semiotics, morphology, phonetics, phonology, and pragmatics. We will use it in the narrow sense of a computer’s attempt to extract meaning from text – or computational linguistics.

Linguistic analysis is the theory behind what the computer is doing.  We say that the computer is performing Natural Language Processing (NLP) when it is doing an analysis based on the theory.  Linguistic analysis is the basis for Text Analytics.

There are steps in linguistic analysis that are used in nearly all attempts for computers to understand text.  It’s good to know some of these terms.

Here are some common steps, often performed in this order:

1. Sentence detection

Here, the computer tries to find the sentences in the text. Many linguistic analysis tools confine themselves to an analysis of one sentence at a time, independent of the other sentences in the text.  This makes the problem more tractable for the computer but introduces problems. 

“John was my service technician.  He did a super job.

Considering the second sentence on its own, the computer may determine that there is a strong, positive sentiment around the job.  But if the computer considers only one sentence and individual word at a time, it will not figure out that it was John who did the super job.

2. Tokenization

Here the computer breaks the sentence into words. Again, there are many ways to do this, each with its strengths and weaknesses.  The quality of the text matters a lot here. 

I really gotmad when the tech told me *your tires are flat*heck I knew that.”

Lots of problems arise here for the computer. Humans see “gotmad” and know instantly that there should have been a space. Computers are not very good at this.  Simple tokenizers simply take successive “word” characters and throw away everything else.  Here that would do an OK job with flat*heck flat heck, but it would remove the information that your tires are flat is a quote and not really part of the surrounding sentence.  When the quality of text, syntax, or sentence structure is poor, the computer can get very confused.

This can also pose a problem when new words are introduced, or there are multiple meanings of words in one response or group of responses.

3. Lemmatization and cleaning

Most languages allow for multiple forms of the same word, particularly with verbs. The lemma is the base form of a word. So, in English, was, is, are, and were are all forms of the verb to be. The lemma for all these words is be

There is a related technique called stemming, which tries to find the base part of a word, for example, poniesponi.  Lemmatization normally uses lookup tables, whereas stemming normally uses some algorithm to do things like discard possessives and plurals.  Lemmatization is usually preferred over stemming.

Some linguistic analysis attempt to “clean up” the tokens.  The computer might try to correct common misspellings or convert emoticons to their corresponding words.

4. Part of speech tagging

Once we have the tokens (words) we can try to figure out the part of speech for each of them, such as noun, verb, or adjective. Simple lookup tables let the computer get a start at this, but it is really a much more difficult job than that.  Many words in the English language can be both nouns and verbs (and other parts of speech).  To get this right, the words cannot simply be considered one at a time.  The use of language can vary, and mistakes in part of speech tagging often lead to embarrassing mistakes by the computer.

Common Linguistic Analysis Techniques Explained

Most linguistic analysis tools perform the above steps before tackling the job of figuring out what the tokenized sentences mean.  At this point, the various approaches to linguistic analysis diverge.  We will describe in brief the three most common techniques.

Approach #1: Sentence parsing

Noam Chomsky is a key figure in linguistic theory.  He conceived the idea of “universal grammar”, a way of constructing speech that is somehow understood by all humans and used in all cultures.  This leads to the idea that if you can figure out the rules, a computer could do it, and thereby can understand human speech and text.  The sentence parsing approach to linguistic analysis has its roots in this idea.

A parser takes a sentence and turns it into something akin to the sentence diagrams you probably did in elementary school:

 

At the bottom, we have the tokens, and above them classifications that group the tokens.  V = verb, PP = prepositional phrase, S = sentence, and so on.

Once the sentence is parsed the computer can do things like give us all the noun phrases.  Sentence parsing does a good job of finding concepts in this way.  But parsers expect well-formed sentences to work on.  They do a poor job when the quality of the text is low.  They are also poor at sentiment analysis.

Bitext is an example of a commercial tool that uses sentence parsing.  More low-level tools include Apache OpenNLP, Stanford CoreNLP, and GATE.

Approach #2: Rules-Based Analysis

Rules-based linguistic analysis takes a more pragmatic approach.  In a rule-based approach, the focus is simply on getting the desired results without attempting to really understand the semantics of the human language.  Rules-based analysis always focuses on a single objective, say concept extraction.  We write a set of rules that perform concept extraction and nothing else.  Contrast this with a parsing approach, where the parsed sentence may yield concepts (nouns and noun phrases) or entities (proper nouns) equally well.

Rules-based linguistic analysis usually has an accompanying computer language used to write the rules.  This may be augmented with the ability to use a general-purpose programming language for certain parts of the analysis.  The GATE platform provides the ability to use custom rules using a tool it calls ANNIE, along with the Java programming language.

Rules-based analysis also uses lists of words called gazetteers.  These are lists of nouns, verbs, pronouns, and so on.  A gazetteer also provides something akin to lemmatization.  Hence the verbs gazetteer may group all forms of the verb to be under the verb be.  But the gazetteer can take a more direct approach.  For sentiment analysis the gazetteer may have an entry for awful, with sub-entries horrible, terrible, nasty.  Therefore, the gazetteer can do both lemmatization and synonym grouping.

The text analytics engines offered by SAP are rules-based.  They make use of a rule language called CGUL (Custom Grouper User Language).  Working with CGUL can be very challenging.

Here is an example of what a rule in the CGUL language looks like:

#subgroup VerbClause: {

  (

    [CC]

    ( %(Nouns)*%(NonBeVerbs)+)

    |([OD VB]%(NonBeVerbs)+|%(BeVerbs) [/OD])

    |([OD VB]%(BeVerbs)+|%(NonBeVerbs)+ [/OD])

    [/CC]

  )

  | ( [OD VB]%(NonBeVerbs)[/OD]   )

}

At its heart, CGUL uses regular expressions and gazetteers to form increasingly complex groupings of words.  The final output of the rules is the finished groups, for example, concepts.

Many rules-based tools expect the user to become fluent in the rule language.  Giving the user access to the rule language empowers the user to create highly customized analyses, at the expense of training and rule authoring.

Approach #3: Deep learning and neural networks

The third approach we will discuss is machine learning.  The basic idea of machine learning is to give the computer a bunch of examples of what you want it to do, and let it figure out the rules for how to do it.  This basic idea has been around for a long time and has gone through several evolutions.  The current hot topic is neural networks.  This approach to natural language machine learning is based loosely on the way our brains work.  IBM has been giving this a lot of publicity with its Watson technology.  You will recall that Watson beat the best human players of the game of Jeopardy.  We can get insight into machine learning techniques from this example.

The idea of deep learning is to build neural networks in layers, each working on progressively broader sections of the problem.  Deep learning is another buzzword that is often applied outside of the area intended by linguistic researchers.

We won’t try to dig into the details of these techniques, but instead, focus on the fundamental requirement they have.  To work, machine learning and artificial intelligence need examples.  Lots of examples.  One area in which machine learning has excelled is image recognition.  You may have used a camera that can find the faces in the picture you are taking.  It’s not hard to see how machine learning could do this.  Give the computer many thousands of pictures and tell it where the faces are.  It can then figure out the rules to find faces.  This works really well.

Back to Watson.  It did a great job at Jeopardy.  Can you see why?  The game is set up perfectly for machine learning.  First, the computer is given an answer.  The computer’s job is to give back the correct question (in Jeopardy you are given the answer and must respond with the correct question).  Since Jeopardy has been played for many years, the computer has just what it needs to work with: a ton of examples, all set up just the way needed by the computer.

Now, what if we want to use deep learning to perform sentiment and language analysis?  Where are we going to get the examples?  It’s not so easy.  People have tried to build data sets to help machines learn things like sentiment, but the results to date have been disappointing.  The Stanford CoreNLP project has a sentiment analysis tool that uses machine learning, but it is not well regarded.  Machine learning today can deliver great results for concept extraction, but less impressive results for sentiment analysis.

BERT

Recent advances in machine learning language models have added exciting new tools for text analysis.  At the forefront of these is BERT, which can be used to determine whether two phrases have similar meanings.

 

BERT stands for Bidirectional Encoder Representations from Transformers.  This technique has been used to create language models from several very large data sets, including the text from all of Wikipedia.  To train a BERT model a percentage of the words in the training data set are masked, and BERT is trained to predict the masked words from the surrounding text.  Once the BERT model has been trained we can present two phrases to it and ask how similar in meaning they are.  Given the phrases, BERT gives us a decimal number between 0 and 1, where 0 means very dissimilar and 1 means very similar.

 

Given the phrase “I love cats”, BERT will tell us the phrase “felines make great pets” is similar, but “it is raining today” is very dissimilar.  This is very useful when the computer is trying to tell us the main themes in a body of text.  We can use tools such as sentence parsing to partition the text into phrases, determine the similarity between phrases using BERT, and then construct clusters of phrases with similar meanings.  The largest clusters give us hints as to what the main themes are in the text.  Word frequencies in the clusters and the parse trees for the phrases in the clusters allow us to extract meaningful names for each cluster.  We can then categorize the sentences in the text by tagging them with the names of the clusters to which they belong.

Summary

Linguistic analysis is a complex and rapidly developing science.  Several approaches to linguistic analysis have been developed, each with its own strengths and weaknesses.  To obtain the best results you should choose the approach that gives superior performance for the type of analysis you need.  For example, you may choose a machine learning approach to identify topics, a rules-based approach for sentiment analysis, and a sentence parsing approach to identify parts of speech and their interrelationships.

If you’re not sure where to start on your linguistic and semantic analysis endeavors, the Ascribe team is here to help. With CXI, you can analyze open-ended responses quickly with the visualization tool – helping to uncover key topics, sentiments, and insights to assist you in making more informed business decisions. By utilizing textual comments to analyze customer experience measurement, CXI brings unparalleled sentiment analysis to your customer experience feedback database.

Demo our Linguistic Analysis Software today!