Natural language Processing

Introduction & History of NLP

Natural Language Processing (NLP) is a discipline between linguistics and computer science and in particular artificial intelligence. It allows interaction between human and machine through programs that analyze large amounts of data. These programs will try to understand the content of this data to extract information expected by the user.

Figure 1. Relationships between AI, ML, DL and NLP (1)

Research began in 1950 with Alan Turing’s paper on his Turing test which attempts to assess a machine’s ability to imitate human conversation (2). Afterwards, automatic translation applications were developed based on rules in a restricted vocabulary. In the 1970s, conceptual ontologies appeared and made structured information understandable by a machine. Until the late 1980s most patterns used complex sets of hand-written rules. In the 1990s, statistical models appeared that revolutionized this field of research. Recently, deep learning models and the arrival on the market of new GPUs have greatly improved the results obtained.

Initially, the systems were based on handwritten rules associated with a dictionary. They had the advantage of being simple to implement, but they required very large numbers of rules to integrate apart from the complexity of languages. On the other hand, statistical methods focus on the most common cases which is not easy to determine by hand. They are also more resilient to errors on the input data (misspellings, omitted words, particular turns of phrase, etc.). Increasing the input learning data will also improve statistical models with richer models. However, when training data is insufficient, rule-based systems can perform better.

Families of Tasks

Many tasks have been addressed with 4 main families: signal processing, syntax, semantics and information extraction.

Text and speech processing analyses the speech signal and handwritten or printed text to convert it into machine-encoded text. These tasks can be preliminary to other tasks such as machine translation or information retrieval for example.

  • Optical character recognition (OCR) is an image processing and analysis system that can be coupled with linguistic rules to assess the probability of appearance of decoded letters and words.
  • Speech recognition analyses an audio recording and associates basic sound segments with common words or sequences of words that appear frequently.
  • Speech segmentation separates an audio signal into word or speech turn.
  • Text-to-speech produces from a text an artificial sound of human speech.

Syntax is the study of how words and morphemes combine to form larger units such as sentences.

  • Lemmatization finds the basic form of words.
  • Morphological segmentation separates the word into morphemes and identifies morphemes classes.
  • Part-of speech tagging determines the category of words in a given sentence.
  • Tokenization separates text into tokens, tokens can be sentences, words, sub-words…
  • Parsing determines the parse tree of a sentence to represent its syntactic structure.

Semantics is the study of the meaning of natural language. Due to the high complexity and subjectivity of human language, interpreting texts is quite a complicated task for machines.

  • Automatic translation of a text from a source language into a target language. This is one of the most complex problems, which requires a lot of knowledge in linguistics and the culture of each language.
  • Text generation writes syntactically and semantically correct texts
  • Summarization, rephrasing and paraphrasing detect the most important information in a text to generate a coherent text that is humanly credible
  • Disambiguation identifies the meaning of a word in a sentence
  • Question-answering and chatbot combine a language understanding step and a text generation step to produce a conversational system
  • Coreference detects the connection between several words in a sentence referring to the same subject, e.g. The music was so loud that itcouldn’t be enjoyed

Information extraction consists of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents.

  • Text-mining extract high-quality information from text
  • Information retrieval find from a query the most relevant documents
  • Named entity recognition (NER) identifies the category of a word or group of word in a text such as “person”, ‘city’…
  • Entity linking assigns a unique identity to a word or group of word
  • Document classification assigns a document to one or more classes or categories
  • Sentiment analysis identifies, extracts, quantifies  affective states and subjective information
  • Recommender system provides suggestions for the most relevant items for a particular user

The complexity of each task depends on the analyzed language. Some tasks are simple in English like separating text into words while they are more complex in other languages like Thai where words are not separate by spaces. Some preliminary tasks are necessary for others and they can also be cascaded to obtain better results.

Library

Many libraries are available to perform some of the tasks mentioned above.

NLTK (3) and SpaCy (4) will mainly perform low-level tasks at the syntactic level for several language.

Gensim (5), TensorFlow (6) or PyTorch (7) offer pre-trained models or provide tools to train your own models suitable for high-level tasks.

References

  1. Mehra, S., Hasanuzzaman, M. Detection of Offensive Language in Social Media Posts. (2020). doi:10.13140/RG.2.2.23097.80485
  2. Turing, A.M. Computing Machinery And Intelligence. Mind. 433-460 (1950). doi:10.1093/mind/LIX.236.433
  3. https://www.nltk.org/
  4. https://spacy.io/
  5. https://github.com/RaRe-Technologies/gensim
  6. https://www.tensorflow.org/
  7. https://pytorch.org/