Cengiz Zopluoglu: R, Reticulate, and Hugging Face Models

Cengiz Zopluoglu

Join me to get your feet wet with thousands of models available on Hugging Face! Hugging Face is like a CRAN of pre-trained AI/ML models. There are thousands of pre-trained models that can be imported and used within seconds at no charge to achieve tasks like text generation, text classification, translation, speech recognition, image classification, object detection, etc. In this post, I am exploring how to access these pre-trained models without leaving the comfort of RStudio using the reticulate package.

The `reticulate` package

The reticulate package provides an interface to call and run Python from R. There is an excellent website with many details about this package, so I will not repeat the same information. I would recommend spending some time on this website and installing the reticulate package following the instructions on the website. In addition, I also had Anaconda software previously installed on my computer, and it already has Python.

First, install the reticulate package and Anaconda. Then, load the reticulate package and check the default python configurations.

library(reticulate)

py_config()

python:         C:/Users/cengiz/Anaconda3/python.exe
libpython:      C:/Users/cengiz/Anaconda3/python38.dll
pythonhome:     C:/Users/cengiz/Anaconda3
version:        3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/cengiz/Anaconda3/Lib/site-packages/numpy
numpy_version:  1.19.2

NOTE: Python version was forced by RETICULATE_PYTHON

Currently, it is set to use Python 3.8, which came with Anaconda. If you see nothing when you run py_config(), you need to install Python. If you don’t have Anaconda or Python installed on your computer, you may check the ?install_miniconda function. You can install Python directly by using this function, and it creates a default virtual Python environment (r-reticulate) you can use to import Python modules moving forward. If you have never done this, a set of useful functions to check are listed below.

?install_miniconda
?conda_list
?conda_install
?use_condaenv

conda_list returns the available Python environments created before on your computer.

 conda_list()

          name
1    Anaconda3
2 r-reticulate
                                                                          python
1                                       C:\\Users\\cengiz\\Anaconda3\\python.exe
2 C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe

The output indicates that there are two Python environments on my computer. The first one was the base Python environment created when I installed Anaconda. The second one was the r-reticulate environment when I installed r-miniconda along with the reticulate package. I will be using the base environment that comes with Anaconda. So, I declare it below using the use_condaenv function.

use_condaenv('Anaconda3')

All Python modules I will need moving forward will be installed in this environment. You can install Python modules using the ?conda_install function, similar to the install.packages() command used while installing a new R library. Below is an example for installing the transformers module, and I will list all other modules needed for the rest of this post. These modules should all be installed, so the rest of the code in this post work.

# Install the python module to your specified Python environment

conda_install(envname = 'Anaconda3',
              packages = 'transformers',
              pip=TRUE)

# List of Python modules needed in this post

  # transformers
  # torch
  # torchvision
  # numpy
  # PIL
  # librosa
  # requests
  # timm
  # detoxify

I will also use the magick, kable, and kableExtra packages in R at the end while dealing with the object detection task.

# Install packages

install.packages('magick')
install.packages('kable')
install.packages('kableExtra')

Hugging Face Models

If you go to the Models tab of Hugging Face, there are more than 27,000 models available to use. I find it similar to the CRAN repository for the R packages, except it is for pre-trained AI/ML models. Some of these models probably cost tens of thousands of dollars, if not more.

These models are currently grouped in three major areas (Natural Language Processing, Audio, and Computer Vision), and each area has multiple tags for a different type of task.

Natural Language Processing
- Fill-Mask
- Question Answering
- Summarization
- Table Question Answering
- Text Classification
- Text Generation
- Text2Text Generation
- Token Classification
- Translation
- Zero-Shot Classification
- Sentence Similarity
- Conversational
- Feature Extraction
Audio -Text-to-Speech
- Automatic Speech Recognition
- Audio-to-Audio
- Audio Classification
- Voice Activity Detection
Computer Vision
- Image Classification
- Object Detection
- Image Segmentation
- Text-to-Image
- Image-to-Text

You can filter a specific subset of models developed for that specific tag when you click a tag. In this post, I checked the most downloaded model for each tag and tried to reproduce an example for using this model to accomplish the task. Some of them were straightforward, but some required extra effort to search the web due to the lack of documentation. I couldn’t find much information for specific tasks and models, so I abandoned some of these tasks. At the end of each demo, I provide the links for the pages I learned as I try to reproduce the examples.

I am not an expert in any of these models and am a very beginner Python user. I will try to explain things as much as possible, and please take my explanations with a grain of salt. As I started this post, my original intention was only to reproduce some examples.

Natural Language Processing

Fill-Mask

The Fill-Mask task is used to provide partial information in a text and ask the NLP model to complete the sentence for you. For instance, you can write a sentence like the following:

Istanbul is the _____ of Turkey.

There are many pre-trained NLP models for any NLP task, and I will use roberta-base for this task. First, we import the python libraries and then define tokenizer and model as two objects in the R environment. These objects will be downloaded to your computer when you first use them. Note that some of these models are very big, so make sure you have enough space on your computer.

transformers <- import('transformers')
torch        <- import('torch')

tokenizer <- transformers$AutoTokenizer$from_pretrained('roberta-base')

model     <- transformers$AutoModelForMaskedLM$from_pretrained('roberta-base')

Each NLP model may have a different format for the masked word. So, it is a good idea to check the mask token.

tokenizer$mask_token

[1] "<mask>"

tokenizer$mask_token_id

[1] 50264

Now, we can prepare the input text accordingly and tokenize it.

txt <- 'Istanbul is the <mask> of Turkey.'

input <- tokenizer$encode(text = txt,return_tensors="pt")
input

tensor([[    0,   100, 46770,    16,     5, 50264,     9,  2769,     4,     2]])

input$shape

torch.Size([1, 10])

This process encodes the words in our sentence (and some other hidden tokens) to their numeric representations. For instance, the numeric code for the mask token is 50264, or the numeric code for “_the” is 5. The returned object is a tensor with a length 10.

tokenizer$encode('<mask>')

[1]     0 50264     2

tokenizer$encode(' the')

[1] 0 5 2

I will locate the position of the mask token in the input tensor, because I will need it later.

loc <- which(input$tolist()[[1]] == tokenizer$mask_token_id)
loc

[1] 6

  # <mask> is the 6th token in my input tensor

I will submit the input tensor to the model.

token_logits <- model(input)$logits$detach()
token_logits

tensor([[[33.1248, -4.0028, 18.4557,  ...,  2.9945,  5.8644, 11.4540],
         [ 8.9463, -3.9279, 20.4999,  ...,  2.2690,  2.1061,  4.4710],
         [ 9.1800, -3.5469,  8.1455,  ...,  1.8752,  3.0497,  3.4087],
         ...,
         [ 5.8352, -3.7452,  6.6445,  ...,  0.6672,  0.3624,  2.9578],
         [18.0251, -4.6203, 19.6356,  ...,  0.8772,  3.5524,  7.1917],
         [12.1654, -3.9563, 31.6049,  ...,  1.1393, -0.8462,  9.7371]]])

token_logits$shape

torch.Size([1, 10, 50265])

The output returns logits for all 50265 words in the roberta-base dictionary for each token position (there are ten tokens in my sentence). These logits represent the probability of each word in the dictionary for that specific token position. In this case, my only interest is the token in the 6th position. Now, I find the three words with the highest calculated probability for the 6th token position in the code below.

# Note that python indices start from 0
# So, we ask for loc-1 below

  masked_token_logits <- token_logits[0][loc-1]
  masked_token_logits

tensor([-3.5709, -3.8466,  3.3165,  ..., -5.0920, -4.6620, -1.2226])

  masked_token_logits$shape

torch.Size([50265])

# Find the top three words based on probability calculated from the model
  
top_3 <- torch$topk(masked_token_logits, k = 3L)
top_3

torch.return_types.topk(
values=tensor([24.5527, 20.6755, 19.7865]),
indices=tensor([ 812, 1867, 1312]))

# Decode these indices to find the corresponding words 
# from the roberta-base dictionary

unmask <- tokenizer$decode(token_ids = top_3['indices'])
unmask

[1] " capital Capital center"

The model says the three words that would most likely be appropriate for the missing word is capital, Capital, and center. Below is some formatting to get the sentence with each of these three words.

unmask_ <- strsplit(unmask,' ')[[1]][-1]

gsub(pattern = '<mask>', replacement = unmask_[1], x = txt)

[1] "Istanbul is the capital of Turkey."

gsub(pattern = '<mask>', replacement = unmask_[2], x = txt)

[1] "Istanbul is the Capital of Turkey."

gsub(pattern = '<mask>', replacement = unmask_[3], x = txt)

[1] "Istanbul is the center of Turkey."

We can use a different NLP model, and the code would be almost identical. The only difference would be the mask token. For instance, if we use the bert-base-uncased model, the mask token is defined as [MASK]. The code below runs the same task with the bert-base-uncased model. As you will see, the three words this model predicted for the missing piece are capital, heart, birthplace.

transformers <- import('transformers')
torch        <- import('torch')

tokenizer <- transformers$AutoTokenizer$from_pretrained('bert-base-uncased')

model     <- transformers$AutoModelForMaskedLM$from_pretrained('bert-base-uncased')

tokenizer$mask_token

[1] "[MASK]"

tokenizer$mask_token_id

[1] 103

txt <- 'Istanbul is the [MASK] of Turkey.'

input <- tokenizer$encode(text = txt,return_tensors="pt")
input

tensor([[ 101, 9960, 2003, 1996,  103, 1997, 4977, 1012,  102]])

loc <- which(input$tolist()[[1]] == tokenizer$mask_token_id)

token_logits <- model(input)$logits

masked_token_logits <- token_logits[0][loc-1]

top_3 <- torch$topk(masked_token_logits, k = as.integer(3))

unmask <- tokenizer$decode(token_ids = top_3['indices'])

unmask_ <- strsplit(unmask,' ')[[1]]

gsub(pattern = '\\[MASK]', replacement = unmask_[1], x = txt)

[1] "Istanbul is the capital of Turkey."

gsub(pattern = '\\[MASK]', replacement = unmask_[2], x = txt)

[1] "Istanbul is the heart of Turkey."

gsub(pattern = '\\[MASK]', replacement = unmask_[3], x = txt)

[1] "Istanbul is the birthplace of Turkey."

Resources

Text Classification

A given text can be classified in many different ways. The most popular one is sentiment analysis predicting whether a text carries a negative, neutral, or positive sentiment. Or, we can try to predict the emotion (sadness, joy, love, anger, fear, surprise). We can also try to predict whether or not a given text is toxic. The subset of models on Hugging Face for text classification offers a variety of analyses. I will reproduce the examples from three models predicting slightly different things for a given text.

Sentiment Analysis

The first example is the most downloaded model in this category, cardiffnlp/twitter-roberta-base-sentiment. Let’s import the modules we will need, and load the tokenizer and model objects for this specific model.

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

model     <- transformers$AutoModelForSequenceClassification$from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

The input text is plain text as a character string, and we tokenize it using the tokenizer object. Note that I tried to trick the model by putting some negative words in the text; however, the sentence’s sentiment was positive.

txt <- "Dr. Z's class was not very boring and not disorganized. I would definitely take it again."


input <- tokenizer$encode(text = txt,return_tensors="pt")
input

tensor([[    0, 14043,     4,   525,    18,  1380,    21,    45,   182, 15305,
             8,    45,  2982, 29835,     4,    38,    74,  2299,   185,    24,
           456,     4,     2]])

input$shape

torch.Size([1, 23])

After tokenization, we submit the input tensor to the model object, producing logits.

output <- model(input)$logits
output

tensor([[-2.1241, -0.1090,  2.6949]], grad_fn=<AddmmBackward0>)

output$shape

torch.Size([1, 3])

The final output is a tensor that includes three numbers. These numbers are related with three categories: Negative (the first element), Neutral (the second element), and Positive (the third element). To transform the logits to probabilities, we apply a softmax transformation.

scores <- output$detach()$numpy()
scores

       [,1]   [,2]  [,3]
[1,] -2.124 -0.109 2.695

# Softmax transformation to get probabilities for Negative, Neutral, and Positive

 exp(scores)/sum(exp(scores))

         [,1]    [,2]   [,3]
[1,] 0.007556 0.05668 0.9358

  # the first probability is for Negative
  # the second category is Neutral
  # the third category is Positive

The model predicts that the probability of this text having a positive sentiment is 0.936, having a negative sentiment is 0.057, and having a neutral sentiment is 0.008. (Nice job!)

Resources:

https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment

Detecting Emotions

The second example is another model in the Text Classification category to specifically developed for detecting emotions such as sadness, joy, love, anger, fear, and surprise, bhadresh-savani/distilbert-base-uncased-emotion.

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('bhadresh-savani/distilbert-base-uncased-emotion')

model     <- transformers$AutoModelForSequenceClassification$from_pretrained('bhadresh-savani/distilbert-base-uncased-emotion')

I will use the same input text, tokenize it using the tokenizer object, and then submit the input tensor to the model to obtain the logits.

txt <- "Dr. Z's class was not very boring and not disorganized. I would definitely take it again."

input <- tokenizer$encode(text = txt,return_tensors="pt")

output <- model(input)$logits
output

tensor([[ 6.2330, -0.8139, -2.2702,  0.0108, -2.2436, -2.2626]],
       grad_fn=<AddmmBackward0>)

output$shape

torch.Size([1, 6])

The output returns six numbers related to six categories. We can check the order of categories to match these numbers to these categories.

model$config$id2label

$`0`
[1] "sadness"

$`1`
[1] "joy"

$`2`
[1] "love"

$`3`
[1] "anger"

$`4`
[1] "fear"

$`5`
[1] "surprise"

Let’s apply the softmax transformation to obtain the probabilities for each category.

scores <- output$detach()$numpy()

prob <- exp(scores)/sum(exp(scores))

data.frame(class = unlist(model$config$id2label),
           prob  = as.numeric(prob))

     class      prob
0  sadness 0.9965414
1      joy 0.0008671
2     love 0.0002021
3    anger 0.0019782
4     fear 0.0002076
5 surprise 0.0002037

The model predicted emotion for this text is sadness with a probability estimate of 0.996 (!!!)

Resources:

https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion

Toxicity

The final example in this category is the Detoxify module. This Python module has trained model to predict toxic comments based on the datasets from three Jigsaw challenges on Kaggle: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification.

For a given sentence, the model returns a probability estimate for six different areas: toxic, severe toxic, obscene, threat, insult, and identity_hate. Let’s take the following text, and get predictions from this model.

Immigrants are stealing our jobs. Send them back to where they come from! THEY DON’T DESERVE TO LIVE IN AMERICA!

# Load the Python module

detoxify <- import('detoxify')

# Input text

txt <- "Immigrants are stealing our jobs. Send them back to where they come from! THEY DON'T DESERVE TO LIVE IN AMERICA!"

# Predict

pred <- detoxify$Detoxify('original')$predict(txt)
unlist(pred)

       toxicity severe_toxicity         obscene          threat 
       0.832002        0.009733        0.024375        0.018156 
         insult identity_attack 
       0.091395        0.403692

Note that the text input can be a vector of texts. For instance, we can take 5 text strings as a vector and return the probability estimates in these six areas as a 5 x 6 matrix.

txt <- c("No, he is an arrogant, self serving, immature idiot. Get it right.",
         "Simple. You are stupid!",
         "The overall organization and text are good.",
         "Who is the man in the high castle?",
         "This is worse than I thought. This user has a sockpuppet account!!")
         
pred <- detoxify$Detoxify('original')$predict(txt)

unlist(pred)

       toxicity1        toxicity2        toxicity3        toxicity4 
      0.98233998       0.98080528       0.00055082       0.00116317 
       toxicity5 severe_toxicity1 severe_toxicity2 severe_toxicity3 
      0.31779119       0.02491754       0.02204690       0.00014005 
severe_toxicity4 severe_toxicity5         obscene1         obscene2 
      0.00009857       0.00031965       0.68105346       0.62617028 
        obscene3         obscene4         obscene5          threat1 
      0.00020516       0.00016786       0.00766684       0.00100577 
         threat2          threat3          threat4          threat5 
      0.00104363       0.00014156       0.00010783       0.00043173 
         insult1          insult2          insult3          insult4 
      0.92518520       0.93162781       0.00018154       0.00018590 
         insult5 identity_attack1 identity_attack2 identity_attack3 
      0.01330768       0.01747648       0.00959109       0.00014455 
identity_attack4 identity_attack5 
      0.00014807       0.00074668

# Some reorganization of the output

probs <- matrix(unlist(pred),
                nrow = 5,
                ncol=6,
                byrow=FALSE)

data.frame(txt             = txt,
           toxicity        = round(probs[,1],2),
           severe_toxicity = round(probs[,2],2),
           obscene         = round(probs[,3],2),
           threat          = round(probs[,4],2),
           insult          = round(probs[,5],2),
           identity_attach = round(probs[,6],2))

                                                                 txt
1 No, he is an arrogant, self serving, immature idiot. Get it right.
2                                            Simple. You are stupid!
3                        The overall organization and text are good.
4                                 Who is the man in the high castle?
5 This is worse than I thought. This user has a sockpuppet account!!
  toxicity severe_toxicity obscene threat insult identity_attach
1     0.98            0.02    0.68      0   0.93            0.02
2     0.98            0.02    0.63      0   0.93            0.01
3     0.00            0.00    0.00      0   0.00            0.00
4     0.00            0.00    0.00      0   0.00            0.00
5     0.32            0.00    0.01      0   0.01            0.00

Note that the probabilities within a row do not necessarily add up to one. If I am not mistaken, I think the model has sub models that make an independent binary prediction for each category.

Extractive Question Answering

Extractive Question Answering is the task of extracting an answer from a text given a question. We provide two inputs as text strings. The first input is a question. The second input is context. The model extracts the answer for the question from a given text.

For instance, let’s say we have the following text as a context.

In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type of continuous probability distribution for a real-valued random variable. The parameter mu is the mean or expectation of the distribution (and also its median and mode), while the parameter sigma is its standard deviation. A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal.

Then, we ask the following question.

What is the parameter mu in a normal distribution?

The model should find the relevant part in the text and extract the text that responds to this question. See below the code for this example using the most popular model in this category, deepset/roberta-base-squad2

# Load the python libraries

  transformers <- import('transformers')
  torchvision  <- import('torchvision')
  torch        <- import('torch')

# Load the tokenizer and model for roberta-base-squad2

  tokenizer <- transformers$AutoTokenizer$from_pretrained('deepset/roberta-base-squad2')
  
  model     <- transformers$AutoModelForQuestionAnswering$from_pretrained('deepset/roberta-base-squad2')

# Text inputs for the question and context
  
  question <- "What is the parameter mu in a normal distribution?"
  
  # Copy/paste the same text from context as written above
  # For some reason RMarkdown doesn't display it in the code below when I knit
  # the document
  
  context  <- "In probability theory, a normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a type of continuous probability distribution for a real-valued random variable. The parameter mu is the mean or expectation of the distribution (and also its median and mode), while the parameter sigma is its standard deviation. A random variable with a Gaussian distribution is said to be normally distributed, and is called a normal deviate. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known. Their importance is partly due to the central limit theorem. It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases. Therefore, physical quantities that are expected to be the sum of many independent processes, such as measurement errors, often have distributions that are nearly normal."

# Tokenize the inputs
    
  input <- tokenizer$encode(text      = question,
                            text_pair = context,
                            return_tensors="pt")
  
  input$shape

torch.Size([1, 219])

  # there are 219 tokens in the context

# Submit the input tensor to the model
  
  output <- model(input)

  # the model returns two elements
  
  # the first element includes the logits for the starting position of the answer
    # output$start_logits
  
  # the second element includes the logits for the ending position of the answer
  
# Extract the most likely token position to start the respond
  
  start <- output$start_logits$argmax(-1L)$item()
  start

[1] 53

# Extract the most likely token position to end the respond
  
  end <- output$end_logits$argmax(-1L)$item()
  end

[1] 59

# Decode the words between starting position and ending position
# This is the response the model predicts for the given question
  
  tokenizer$decode(input[0][start:end])

[1] " the mean or expectation of the distribution"

Resources:

https://huggingface.co/deepset/roberta-base-squad2

Summarization

In Summarization, the model takes a longer text and generates a shorter text as a summary.

For instance, let’s say we have the following text.

New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared ‘I do’ five more times, sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her ‘first and only’ marriage. Barrientos, now 39, is facing two criminal counts of ‘offering a false instrument for filing in the first degree,’ referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. The case was referred to the Bronx District Attorney’s Office by Immigration and Customs Enforcement and the Department of Homeland Security’s Investigation Division. Seven of the men are from so-called ‘red-flagged’ countries, including Egypt, Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.

The code below generates a model predicted summary for this text using the most downloaded model in this category, facebook/bart-large-cnn.

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('facebook/bart-large-cnn')

model     <- transformers$AutoModelForSeq2SeqLM$from_pretrained('facebook/bart-large-cnn')


# Copy/paste the same text above using double quotes
# For some reason RMarkdown doesn't display it in the code below when I knit
# the document
  
txt <- "New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York. A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband. Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared 'I do' five more times, sometimes only within two weeks of each other. In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her 'first and only' marriage. Barrientos, now 39, is facing two criminal counts of 'offering a false instrument for filing in the first degree,' referring to her false statements on the 2010 marriage license application, according to court documents. Prosecutors said the marriages were part of an immigration scam. On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further. After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say. Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages. Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted. The case was referred to the Bronx District Attorney's Office by Immigration and Customs Enforcement and the Department of Homeland Security's Investigation Division. Seven of the men are from so-called 'red-flagged' countries, including Egypt, Turkey, Georgia, Pakistan and Mali. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force. If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18."

# Tokenize the input text

input <- tokenizer$encode(text = txt,
                          return_tensors="pt")

# Generate the predicted tokens for the summary text

output <- model$generate(input)

# Decode the output tokens

tokenizer$batch_decode(output)

[1] "</s><s>Liana Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men, and at one time, she was married to eight men at once. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation.</s>"

# Too long to print, so I wrap it in kable

summary_txt <- as.matrix(tokenizer$batch_decode(output))

require(kableExtra)

summary_txt %>%
  kbl() %>%
  kable_styling()

</s><s>Liana Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men, and at one time, she was married to eight men at once. Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation.</s>

Resources:

https://huggingface.co/facebook/bart-large-cnn

Text Generation

The purpose of the text generation task is to create meaningful continuation of a text. For instance, suppose I start with the following sentence.

My name is Cengiz and I am from Turkey.

How would an NLP model continue this sentence? Below is the code to generate a continuation for this sentence using the GPT2 model.

# Load the module

transformers <- import('transformers')

# Load the tokenizer and model

tokenizer <- transformers$AutoTokenizer$from_pretrained('gpt2')

model     <- transformers$GPT2LMHeadModel$from_pretrained('gpt2')

# Input text

txt <- 'My name is Cengiz and I am from Turkey.'

# Tokenize the input

input <- tokenizer$encode(txt,return_tensors='pt')
input

tensor([[3666, 1438,  318,  327, 1516,  528,  290,  314,  716,  422, 7137,   13]])

# Generate the continuation text
# you can play with arguments like
# max_length, num_beams, no_repeat_ngram_size,num_return_sequences, etc.
# I don't know what some of these mean and how they impact the generated text

  # 100L, is just 100 as integer
  # If you use only 100, it is numeric (double) and the function gives error 
  # because it expects integers
  
output <- model$generate(input, 
                         max_length           = 100L,
                         num_beams            = 20L,
                         no_repeat_ngram_size = 2L,
                         num_return_sequences = 1L,
                         early_stopping       =TRUE)


# Too long to print, so I wrap it in kable

new_txt <- as.matrix(tokenizer$batch_decode(output))

new_txt %>%
  kbl() %>%
  kable_styling()

My name is Cengiz and I am from Turkey. I was born and raised in Turkey and have been living in the United States for over 20 years.

I have always been interested in learning about the world and what it is like to live in a country where you are not allowed to speak your mind. It is very difficult for me to understand what is going on in this country, but I do know that I have a lot to learn and that is what I want to do.

Note: I didn’t write this. This is what GPT2 wrote, and it is hilarious!

Resources:

Text2Text Generation

Like Text Classification, there are different tasks that can be considered under the Text2Text Generation category. I will reproduce examples for Generating a Headline, Generating a Question with and without Supervision, and Paraphrasing.

Generate a Headline

The purpose of this task is to generate a headline for a given text. For instance, let’s consider the following text.

Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady Melania Trump tested positive for COVID-19. Officials said the Trumps’ 14-year-old son Barron tested negative as did First Family and Senior Advisors Jared Kushner and Ivanka Trump. Trump took to social media, posting at 12:54 am local time (0454 UTC) on Twitter, ‘Tonight, [Melania] and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!’ Yesterday afternoon Marine One landed on the White House’s South Lawn flying Trump to Walter Reed National Military Medical Center (WRNMMC) in Bethesda, Maryland. Reports said both were showing ‘mild symptoms’. Senior administration officials were tested as people were informed of the positive test. Senior advisor Hope Hicks had tested positive on Thursday. Presidential physician Sean Conley issued a statement saying Trump has been given zinc, vitamin D, Pepcid and a daily Aspirin. Conley also gave a single dose of the experimental polyclonal antibodies drug from Regeneron Pharmaceuticals. According to official statements, Trump, now operating from the WRNMMC, is to continue performing his duties as president during a 14-day quarantine. In the event of Trump becoming incapacitated, Vice President Mike Pence could take over the duties of president via the 25th Amendment of the US Constitution. The Pence family all tested negative as of yesterday and there were no changes regarding Pence’s campaign events.

Below is the code generating a headline for this text using the T5 language model fine tuned for this task.

Note that the input text string should start with headline:. For instance, the input string for the text above should be formatted as

headline: Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady …

# Load the modules
  
  transformers <- import('transformers')
  torch        <- import('torch')

# Load the tokenizer and model

  tokenizer <- transformers$AutoTokenizer$from_pretrained('Michau/t5-base-en-generate-headline')
  model     <- transformers$AutoModelForSeq2SeqLM$from_pretrained('Michau/t5-base-en-generate-headline')

# Input text string

  # Copy/paste the article above
  # Do not forget to put headline: at the beginning

  article <- "headline: Very early yesterday morning, the United States President Donald Trump reported he and his wife First Lady Melania Trump tested positive for COVID-19. Officials said the Trumps' 14-year-old son Barron tested negative as did First Family and Senior Advisors Jared Kushner and Ivanka Trump. Trump took to social media, posting at 12:54 am local time (0454 UTC) on Twitter, 'Tonight, [Melania] and I tested positive for COVID-19. We will begin our quarantine and recovery process immediately. We will get through this TOGETHER!' Yesterday afternoon Marine One landed on the White House's South Lawn flying Trump to Walter Reed National Military Medical Center (WRNMMC) in Bethesda, Maryland. Reports said both were showing 'mild symptoms'. Senior administration officials were tested as people were informed of the positive test. Senior advisor Hope Hicks had tested positive on Thursday.
  Presidential physician Sean Conley issued a statement saying Trump has been given zinc, vitamin D, Pepcid and a daily Aspirin. Conley also gave a single dose of the experimental polyclonal antibodies drug from Regeneron Pharmaceuticals. According to official statements, Trump, now operating from the WRNMMC, is to continue performing his duties as president during a 14-day quarantine. In the event of Trump becoming incapacitated, Vice President Mike Pence could take over the duties of president via the 25th Amendment of the US Constitution. The Pence family all tested negative as of yesterday and there were no changes regarding Pence's campaign events."

# Tokenize the input string
    
  input <- tokenizer$encode_plus(text = article, 
                                 return_tensors = 'pt')

  input_ids <- input['input_ids']

# Model predicted tokens for the headline
  
  output <- model$generate(input_ids = input_ids)
  output

tensor([[    0,  2523,    11,  1485,  8571,  5049, 11219,  2300, 24972,    21,
          2847,  7765,   308,  4481,     1]])

# Decode the model predicted tokens

  tokenizer$batch_decode(output)

[1] "<pad> Trump and First Lady Melania Test Positive for COVID-19</s>"

Resources:

https://huggingface.co/Michau/t5-base-en-generate-headline

Generate a Question with Answer Supervision

There are model that can generate a question given a text string for an answer and another text string for context. Below is an example using the T5 language model fine tuned for this task.

# Load the module
  
  transformers <- import('transformers')

# Load the tokenizer and model
  
  tokenizer <- transformers$AutoTokenizer$from_pretrained('mrm8488/t5-base-finetuned-question-generation-ap')
    
  model     <- transformers$AutoModelWithLMHead$from_pretrained('mrm8488/t5-base-finetuned-question-generation-ap')

# Input string should be formatted as below
# 'answer: ..... context: .....'

  txt <- 'answer: 12 context: Apples'

# Tokenize the input and generate the question
  
  input <- tokenizer$encode(text = txt,return_tensors = 'pt')
  output <- model$generate(input)
  tokenizer$batch_decode(output)

[1] "<pad> question: How many apples are there?</s>"

# Another question with a different answer
  
  txt <- 'answer: red context: Apples'
  
  input <- tokenizer$encode(text = txt,return_tensors = 'pt')
  
  output <- model$generate(input)
  
  tokenizer$batch_decode(output)

[1] "<pad> question: What color are apples?</s>"

# Another one
    
  txt <- 'answer: decay context: Apples'
  
  input <- tokenizer$encode(text = txt,return_tensors = 'pt')
  
  output <- model$generate(input)
  
  tokenizer$batch_decode(output)

[1] "<pad> question: What do apples do?</s>"

Resources

https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap

Generate a Question with No Supervision

This is similar to the previous task, but there is no supervision. In other words, we don’t provide an answer, and only provide a context. The model generates questions from this context.

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('valhalla/t5-base-e2e-qg')
  
model     <- transformers$AutoModelForSeq2SeqLM$from_pretrained('valhalla/t5-base-e2e-qg')

model     <- transformers$AutoModelWithLMHead$from_pretrained('valhalla/t5-base-e2e-qg')


txt = "I had twelwe apples. I ate two apples. Then, I gave five apples to my daughter. I didn't give any apple to my son."

input <- tokenizer$encode(text = txt,
                          return_tensors = 'pt')

output <- model$generate(input,
                         max_length = 50L)

qs <- tokenizer$batch_decode(output)

qs %>%
  as.matrix() %>%
  kbl() %>%
  kable_styling()

<pad> How many apples did I have?<sep> How many apples did I eat?<sep> How many apples did I give to my daughter?<sep> How many apples did I give to my son?<sep></s>

Resources:

https://huggingface.co/valhalla/t5-base-e2e-qg

Paraphrasing

For a given sentence, these models paraphrases the sentence.

transformers <- import('transformers')
torch        <- import('torch')

tokenizer <- transformers$AutoTokenizer$from_pretrained('tuner007/pegasus_paraphrase')
model     <- transformers$PegasusForConditionalGeneration$from_pretrained('tuner007/pegasus_paraphrase')

txt1 <- "Her life spanned years of incredible change for women as they gained more rights than ever before."
txt2 <- "Giraffes like Acacia leaves and hay, and they can consume 75 pounds of food a day."


# Paraphrase the first text

input <- tokenizer$encode(text = txt1, return_tensors = 'pt')

output <- model$generate(input)

tokenizer$batch_decode(output)

[1] "<pad> Her life was filled with change for women as they gained more rights than ever before.</s>"

# Paraphrase the second text

input <- tokenizer$encode(text = txt2, return_tensors = 'pt')

output <- model$generate(input)

tokenizer$batch_decode(output)

[1] "<pad> Giraffes can eat 75 pounds of food a day.</s>"

Resources:

https://huggingface.co/tuner007/pegasus_paraphrase

Translation

There are models fine tuned for the task of translating a text from one language to another. Check the following page for a number of pairs of translation pairs.

https://huggingface.co/Helsinki-NLP

I provide two examples. The first one is translating from Turkish to English, and the second one is translating from English to Spanish.

Turkish to English

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('Helsinki-NLP/opus-mt-tr-en')
model     <- transformers$AutoModelForSeq2SeqLM$from_pretrained('Helsinki-NLP/opus-mt-tr-en')


txt <- "Merhaba, ben Cengiz. Istanbul'da dogdum."


input <- tokenizer$encode(txt,return_tensors='pt')

output <- model$generate(input)

tokenizer$batch_decode(output)

[1] "<pad> Hi, I'm Genghis, born in Istanbul."

Resources:

https://huggingface.co/Helsinki-NLP/opus-mt-tr-en

English to Spanish

transformers <- import('transformers')

tokenizer <- transformers$AutoTokenizer$from_pretrained('Helsinki-NLP/opus-mt-en-es')
model     <- transformers$AutoModelForSeq2SeqLM$from_pretrained('Helsinki-NLP/opus-mt-en-es')


txt <- "Hi, my name is Cengiz and I live in Eugene."


input <- tokenizer$encode(txt,return_tensors='pt')


output <- model$generate(input)

tokenizer$batch_decode(output)

[1] "<pad> Hola, mi nombre es Cengiz y vivo en Eugene."

Resources:

https://huggingface.co/Helsinki-NLP/opus-mt-en-es

Zero-shot Classification

In Zero-shot classification, we provide a text string as an input and then provide a class label. This class label could be anything. Then, the model predicts a probability of the given text to belong to this class label.

For instance, let’s consider the following sentence.

I will see the world one day.

We can say that this sentence is related to for instance traveling or exploration but for instance not related to cooking. We can use a model to predict each hypothesis.

# Load the modules, tokenizer, and model

  transformers <- import('transformers')
  
  tokenizer <- transformers$AutoTokenizer$from_pretrained('facebook/bart-large-mnli')
  
  model     <- transformers$AutoModelForSequenceClassification$from_pretrained('facebook/bart-large-mnli')

# Input text
  
txt               = "I will see the world one day."

# Is this text related to cooking?

  input <- tokenizer$encode(text           = txt,
                            text_pair      = 'cooking',
                            return_tensors = 'pt',
                            truncation     = TRUE)

  output <- model(input)

  scores <- output$logits$detach()$tolist()[[1]]

  prob <- exp(scores)/sum(exp(scores))

  prob

[1] 0.906399 0.087287 0.006314

  # the first element is FALSE, not related, prob = 0.906
  # the second element is Neutral, prob = 0.087
  # the second element is TRUE, related, prob = 0.006
  

# Is this text related to travel?

  
  input <- tokenizer$encode(text           = txt,
                            text_pair      = 'travel',
                            return_tensors = 'pt',
                            truncation     = TRUE)
  
  
  output <- model(input)
  
  scores <- output$logits$detach()$tolist()[[1]]
  
  prob <- exp(scores)/sum(exp(scores))
  
  prob

[1] 0.001291 0.228867 0.769842

  # the first element is FALSE, not related, prob = 0.001
  # the second element is Neutral, prob = 0.229
  # the second element is TRUE, related, prob = 0.770

Resources:

https://huggingface.co/facebook/bart-large-mnli

Sentence Similarity

The purpose of this task is to compare two text and produce a score of similarity.

For instance, consider the following three sentences:

Today is a sunny day.
That is a happy person.
Weather is really nice.

We can say that Sentence 1 and 3 are more similar to each other as they are related to weather.

Let’s compute the similarity scores for these three sentences.

transformers <- import('transformers')
torch        <- import('torch')

txt1 <- 'Today is a sunny day'
txt2 <- 'That is a happy person'
txt3 <- 'Weather is really nice'

tokenizer <- transformers$AutoTokenizer$from_pretrained('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
model     <- transformers$AutoModel$from_pretrained('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')


# Create a function to create sentence embeddings for a given text.
# This function takes a text and generates a numeric representation of this text in 
# 384 dimensions. The input is a plain txt and the output is the sentence embeddings
# as a vector of 384 numbers

  encode_ <- function(txt){
    
    input       <- tokenizer(txt,padding=TRUE,truncation=TRUE,return_tensors = 'pt')
    output      <- model(input['input_ids'],return_dict=TRUE)
    embeddings  <- output$last_hidden_state$detach()
    input_mask_expanded <- input['attention_mask']$unsqueeze(as.integer(-1))$expand(embeddings$size())$float()
  
    num  <- torch$sum(torch$multiply(embeddings,input_mask_expanded),as.integer(1))
    den  <- torch$clamp(input_mask_expanded$sum(as.integer(1)),min = 1e-9)
    
    emb <- torch$nn$functional$normalize(torch$divide(num,den),
                                p = as.integer(2),
                                dim = as.integer(1))
    
    emb
  }

# Embeddings for the texts

emb1 <- encode_(txt1)
emb1$shape

torch.Size([1, 384])

emb2 <- encode_(txt2)
emb2$shape

torch.Size([1, 384])

emb3 <- encode_(txt3)
emb3$shape

torch.Size([1, 384])

# Compute the similarity between Sentence 1 and Sentence 2

torch$mm(emb1,
         emb2$transpose(0L,1L))$tolist()[[1]]

[1] 0.2588

# Compute the similarity between Sentence 1 and Sentence 3

torch$mm(emb1,
         emb3$transpose(0L,1L))$tolist()[[1]]

[1] 0.5501

The model predicted similarity score is 0.259 between Sentence 1 and Sentence 2 while the model predicted similarity score is 0.550 between Sentence 1 and Sentence 3. So, the model predicts that Sentence 1 and Sentence 3 are more similar to each other.

Resources:

https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1

Audio

Automatic Speech Recognition

Speech recognition task takes an audio file and transcribes the audio to text. To produce an example of this task, I will use this audio file. You will need to download this file to your local drive.

The following code generates the speech based on this audio file.

# Load the Python Modules

  torch        <- import('torch')
  librosa      <- import('librosa')
  requests     <- import('requests')
  transformers <- import('transformers')

# Load the Wav2Vec2 tokenizer and the model

  tokenizer <- transformers$Wav2Vec2Tokenizer$from_pretrained('facebook/wav2vec2-base-960h')
  
  model     <- transformers$Wav2Vec2ForCTC$from_pretrained('facebook/wav2vec2-base-960h')

# Load the audio file from local drive
    
  sound.file <- librosa$load(here('_posts/huggingface/Welcome.WAV'),
                             sr = 16000)

# Tokenize the input audio file
  
  input_values = tokenizer(sound.file[1], 
                           return_tensors = "pt")

# Model prediction
  
  logits = model(input_values['input_values'])$logits

  prediction = torch$argmax(logits, dim = -1L)

# Decoding the prediction
  
  transcription = tokenizer$batch_decode(prediction)[[1]]
  
  transcription %>%
    as.matrix() %>%
    kbl() %>%
    kable_styling()

THANK YOU FOR CHOUSING THE OLYMPUS DICTATION MANAGEMENT SYSTEM THE OLYMPUS DICTATION MANAGEMENT SYSTEM GIVES YOU THE POWER TO MANAGE YOUR DICTATIONS TRANSCRIPTIONS AND DOCUMENTS SEEMLESSLY AND TO IMPROVE THE PRODUCTIVITY OF YOUR DAILY WORK FOR EXAMPLE YOU CAN AUTOMATICALLY SENT THE DICTATION FILES OR TRANSCRIBED DOCUMENTS TO YOUR ASSISTANT OR THE AUTHOR VIRE EMALE OR F T P IF YOURE USING THE SPEECH RECOGNITION SOFTWARE THE SPEECH RECOGNITION ENGINE WORKS IN THE BACKGROUND TO SUPPORT YOUR DOCUMENT CREATION WE HOPE YOU ENJOY THE SIMPLE FLEXIBLE RELIABLE AND SECURE SOLUTIONS FROM OLYMPUS

Resources:

Audio Classification

Similar to Text Classification, you can use models to classify audio. Hubert-Large for Emotion Recognition classify the audio using four different emotions: happy, angry, sad, and neutral.

The code below use the same audio file and predicts emotion.

torch        <- import('torch')
librosa      <- import('librosa')
transformers <- import('transformers')

tokenizer    <- transformers$Wav2Vec2FeatureExtractor$from_pretrained('superb/hubert-large-superb-er')
model <- transformers$HubertForSequenceClassification$from_pretrained('superb/hubert-large-superb-er')


sound.file <- librosa$load(here('_posts/huggingface/Welcome.WAV'),
                             sr = 16000)


input_values = tokenizer(sound.file[1], 
                         sampling_rate = 16000,
                         padding = TRUE,
                         return_tensors = "pt")


logits = model(input_values['input_values'])$logits
logits <- logits$detach()$tolist()[[1]]
logits

[1] -0.06883  0.21034  0.07505 -2.43493

probs <- exp(logits)/sum(exp(logits))

# Labels

data.frame(labels = unlist(model$config$id2label),
           probs  = probs)

  labels   probs
0    neu 0.28006
1    hap 0.37025
2    ang 0.32340
3    sad 0.02628

Resources:

https://huggingface.co/superb/hubert-large-superb-er

Computer Vision

Image Classification

The prupose of the Image Classification task is to predict a label for the objects in an image. For this task, I will use the Google’s Vision Transformers model, the most popular model in this category. This model considers 1000 different class labels, and you can find all these labels in the table below. Given an image file, the model predicts a probability for each of the following labels.

For the demonstration, I will use this image at this link.

# Load the Python modules
  
  transformers <- import('transformers')
  pil          <- import('PIL')
  requests     <- import('requests')
  torch        <- import('torch')
  
# Read the image file from url
  
  url   <- 'http://images.cocodataset.org/val2017/000000039769.jpg'
  
  image <- pil$Image$open(requests$get(url, stream=T)$raw)

# Load the feature extractor and model
  
  feature_extractor = transformers$ViTFeatureExtractor$from_pretrained('google/vit-base-patch16-224')

  model  =  transformers$ViTForImageClassification$from_pretrained('google/vit-base-patch16-224')

# Extract the features from the given image
  
  inputs  <- feature_extractor(images=image,return_tensors='pt')
  inputs['pixel_values']$shape

torch.Size([1, 3, 224, 224])

# Model predictions

  outputs = model(inputs['pixel_values'])
  logits  = outputs$logits
  
  # Softmax transformation of the logits to probabilities
  
  logits_ <- as.numeric(logits$detach()$numpy())
  
  probs   <- exp(logits_)/sum(exp(logits_))
  
  labels  <- as.matrix(unlist(model$config$id2label))
  
# Find the top 5 predicted categories 
  
  locs <- order(probs,decreasing=T)[1:5]
  
  data.frame(class = labels[locs],
            prob  = probs[locs])

                 class      prob
1         Egyptian cat 0.9374417
2     tabby, tabby cat 0.0384426
3            tiger cat 0.0144114
4      lynx, catamount 0.0032743
5 Siamese cat, Siamese 0.0006796

Resources:

https://huggingface.co/google/vit-base-patch16-224

Object Detection

The purpose of the Object Detection task is to identify different objects in an image and provide the location of these objects in the image. For this task, I will use the Facebook’s DEtection TRansformer (DETR) model, the most popular model in this category.

This model considers 91 different class labels (first being N/A), and you can find all these labels in the table below. Given an image file, the model identify if any of these classes exist in the image, and also provides a location of these objects in the given image.

For the demonstration, I will again use the same image. We can tell that there are two cats, two remotes, one blanket, and one couch in this image. Let’s see if we can recover this information using the DETR model.

# Load the Python modules

  transformers <- import('transformers')
  pil          <- import('PIL')
  requests     <- import('requests')
  torchvision  <- import('torchvision')
  timm         <- import('timm')

# Read the image
    
  url   <- 'http://images.cocodataset.org/val2017/000000039769.jpg'
  
  image <- pil$Image$open(requests$get(url, stream=T)$raw)

# Load the feature extractor and model
  
  feat_ext  <- transformers$DetrFeatureExtractor$from_pretrained('facebook/detr-resnet-50')
  model     <- transformers$DetrForObjectDetection$from_pretrained('facebook/detr-resnet-50')

# Extract the features

  inputs  <- feat_ext(images=image,
                      return_tensors='pt')

# Model predictions
  
  outputs = model(inputs['pixel_values'])
  logits  = outputs$logits$detach()
  bboxes  = outputs$pred_boxes$detach()
  
  logits$shape

torch.Size([1, 100, 92])

  bboxes$shape

torch.Size([1, 100, 4])

This returns two important objects. It was really the most frustrating because I couldn’t find any good documentation about what to do with these objects. Luckily, I find this discussion post, and the code included in the original question helped.

https://stackoverflow.com/questions/68350133/facebook-detr-resnet-50-in-huggingface-hub

I will try to explain as much as I understand. The first objects includes predicted logits for 92 classes in 100 different ways. We first transform these logits to probabilities using a softmax transformation, and then we can format it as a 100 x 92 matrix.

logit_mat <- matrix(nrow = 100, ncol = 92)

for(i in 0:99){
  logit_mat[i+1,] = logits$softmax(-1L)[0][i]$tolist()
}

I don’t know why 100, or what each row here represents. Also, Column 92 is discarded for some reason. So, it leaves the logits for 91 classes originally listed in the table above.

logit_mat <- logit_mat[,-92]

Finally, we do a search row by row and try to find whether or not there is any row that has a probability higher than a threshold. If we find such a row, we flag that row, and also identify the corresponding label for the column with a probability that exceeds the threshold.

threshold <- 0.7
labels    <- unlist(model$config$id2label)

class <- c()
prob  <- c()
id    <- c()

for(i in 1:100){
  
  loc <- which(logit_mat[i,]>threshold)
  
  if(length(loc) !=0){
    
    cl <- labels[as.numeric(names(labels) )==(loc-1)]
    
    class <- c(class,cl)
    prob  <- c(prob,logit_mat[i,loc])
    id    <- c(id,i)
  }
}


data.frame(id    = id,
           class = class,
           prob  = prob)

  id  class   prob
1 38 remote 0.9982
2 58 remote 0.9960
3 60  couch 0.9955
4 62    cat 0.9988
5 99    cat 0.9987

Superb! So, it seems the model correctly identified the cats, remotes, and couch.

What else? The model also returned boundary locations for these objects. Let’s check how they look like.

coord <- bboxes[0][id-1]$numpy()
coord

       [,1]   [,2]    [,3]    [,4]
[1,] 0.1685 0.1967 0.21154 0.09828
[2,] 0.5481 0.2711 0.05482 0.23982
[3,] 0.4998 0.4947 0.99961 0.98461
[4,] 0.2557 0.5448 0.46997 0.87265
[5,] 0.7701 0.4089 0.46089 0.71847

What are these numbers? The four columns are normalized coordinates for (X center, Y center, Width, Height). Also, I noticed that this is how the model imagines the coordinate system. Notice the range for the Y-axis is reversed.

So, these coordinates should be re-scaled and rearranged given the actual dimension of the image. The code below draw a rectangular box around each identified object using these coordinates and labels them.

# Read the image

  library(magick)

Linking to ImageMagick 6.9.12.3
Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
Disabled features: fontconfig, x11

  pic <- image_read('http://images.cocodataset.org/val2017/000000039769.jpg')
  image_info(pic)

# A tibble: 1 x 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 JPEG     640    480 sRGB       FALSE   173131 72x72

# Set the actual width and height of the image

  width  <- image_info(pic)$width
  height <- image_info(pic)$height

# Descale the coordinates according to actual dimensions
    
  coord[,1] <- coord[,1]*width
  coord[,2] <- coord[,2]*height
  coord[,3] <- coord[,3]*width
  coord[,4] <- coord[,4]*height

# Reverse the Y scale
  
  coord[,2] <- height - coord[,2]

# Finalized coordinates 
# (X center, Y center, Width, Height)

  coord

      [,1]  [,2]   [,3]   [,4]
[1,] 107.9 385.6 135.38  47.17
[2,] 350.8 349.9  35.09 115.11
[3,] 319.9 242.5 639.75 472.61
[4,] 163.6 218.5 300.78 418.87
[5,] 492.9 283.7 294.97 344.87

# Plot the picture in the plot object

  plot(pic)

# For each of the identified object, draw a rectangular box around and label
    
  for(i in 1:nrow(coord)){
    
    rect(xleft   = coord[i,1] - coord[i,3]/2,
         ybottom = coord[i,2] - coord[i,4]/2,
         xright  = coord[i,1] + coord[i,3]/2,
         ytop    = coord[i,2] + coord[i,4]/2,
         border = 'blue',
         lwd    = 4)
    
    text(coord[i,1],coord[i,2],class[i])
  }

Resources:

R, Reticulate, and Hugging Face Models

The reticulate package

Hugging Face Models

Natural Language Processing

Fill-Mask

Text Classification

Sentiment Analysis

Detecting Emotions

Toxicity

Extractive Question Answering

Summarization

Text Generation

Text2Text Generation

Generate a Headline

Generate a Question with Answer Supervision

Generate a Question with No Supervision

Paraphrasing

Translation

Turkish to English

English to Spanish

Zero-shot Classification

Sentence Similarity

Audio

Automatic Speech Recognition

Audio Classification

Computer Vision

Image Classification

Object Detection

Citation

The `reticulate` package