YubiBERT is a language model for FinTechs, based on the customised micro version of RoBERTa architecture. This model has been pre-trained in English and various Indian regional language corpus collected at Yubi.
We evaluated YubiBERT with six different downstream tasks, which were very specific to Yubi’s use cases. This included improving the outcome accuracies for most tasks over previous approaches with SOTA models, which confirmed the effectiveness of large pretrained language models specifically for Fintech.
Limitations With Current SOTA Models
Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on concatenating data in multiple languages.
The contribution of other languages, especially Indian regional languages in most of these models is almost negligible.
Apart from this, finance specific BERT model FinBERT for finance already exists. This is a very good model to start with. However, it has a few issues.
The model uses less data and does not include finance-related pages on Wikipedia.
Customer communication (feedback, calls, reviews) which is essential data for FinTech, financial books, and reports are also missing. But the major skip for us is that it does not support Indian regional languages.
And because of this very reason, we have to rely on other SOTA multilingual models like BERT, RoBERTa etc, to fine-tune our use cases.
We also regularly try IndicBERT and MuRIL in our projects, but surprisingly, we’ve got weak results with them in many experiments compared to BERT and RoBERTa models. And we are yet to investigate the reason for the same.
Why Did We Pretrain?
At Yubi, we collected more than 220 GB of text data by scrapping public fintech-related websites in English and Indian regional languages. We also had a lot of financial reports and pdfs along with customer communication data in the form of telephonic communication, reviews, and comments.
We also added English Wikipedia data in the mix to make sure we are not missing anything during scrapping. So it makes more sense to train from scratch, rather than to fine-tune previous SOTA models.
Apart from that, we wanted to be practical and wanted the model to run on CPUs with as much less inference latency as possible. So, we opted to reduce the original RoBERTa architecture from 12 encoders to just 4 and also changed many other parameters along the way, including the number of attention heads, attention-layer size, peak-learning rate etc.
Hence, we called the current version of the model YubiBERT-micro-encoder4. We have also trained the language model with 8 encoders and call it, YubiBERT-small-encoder8.
In creating the language model, we have created our own Tokenizer using the Sentencepiece library compared to some SOTA model’s tokenizers. The current token size is ~100k and it supports r the following languages – English, Indian-English, Hindi, Assamese, Bengali, Gujarati, Kannada, Malayalam, Oriya, Punjabi, Tamil, Telugu, Urdu, Nepali, and Marathi.
Please note that both YubiBERT and YubiTokenizer models are trained with lower-cased text. We have also realised that SentencePiece models cannot be used directly in HuggingFace training modules because of portability issues. Hence, we had to create HuggingFace supported tokenizer versions so that anyone can use them in the model training pipeline.
Results on Downstream Fintech Task
As we focus on solving problems for Yubi, the FinTech industry using GLUE tasks does not make sense. So we decided to test our own problems/projects and compare them with the best approach we previously had.
Some of these are sample tasks created just to validate models. Each fine-tuning with YubiBERT took 30 mins on g4dn.xlarge instance, including data pre-processing steps. Inference times are just a few milliseconds.
Also, we were evaluating the best approach for each project, instead of just one or a few selected SOTA models.
|Task||Description||YubiBERT fine-tuned model validation accuracy||Previous best model validation accuracy|
|Intent Detection Multilingual||A task of detecting intent after speech-2-text.||~91% (after 8 epochs)||~87% (using Multilingual SentenceBERT and XLMR)
|Intent Detection (English + Tamil)||A task of detecting intent after speech-2-text||~90% (after 8 epochs)||~87% (using Language agnostic BERT embeddings and following smaller neural networks), ~54% (using IndicBERT, MuRIL)|
|Sentiment Analysis - news||Detecting positive v/s negative sentiment news texts - with very limited data||~96% (after 16 epochs)||86% (using BERT-base-uncased stopped after 4 epochs)|
|Sentiment Analysis - reviews||Detecting positive v/s negative sentiment reviews by professionals, with very limited data||~98% (after 8 epochs)||98% (after 3 epochs, used Bi-LSTM networks)|
|Document Classification||The task is to classify fintech documents after the OCR step||~90% (after 8 epochs)||78% (with BERT embedding and 5-layer ANN)|
|Name Matching||Task is to compare new company name text scrapped and see if matches any previous entries||~89% (Finetuned YubiBERT like a siamese network)||84% (String Matching algorithms)|
|Search||Text search on the Yubi platform||76% top-5 using yubi-embeddings, 84% using yubi-tokens||84% top-5 using word search, 83% top-5 using Bert-embeddings|
What Did Not Work?
- As we can see in the table above when we use Yubi embeddings directly for text search, it does not give the best result. Simple word searches also performed better than this approach. The reasons for this are discussed here – [link1] [link2]
- According to the papers, after checking various discussions across the DS community we can say that when we reduce our architecture to 4/8 encoders only, vector space is getting reduced, and different vectors are getting placed nearby making differentiation very difficult. The model is working well after fine-tuning, but not without it.
We are pre-training more versions of YubiBERT with different model architectures & encoder depths and with more data in future. Our team is also working on creating various NLP and Vision-related models. We will release them to our public GIT repository.
|Yubi-Tokenizer||Sentencepiece byte-pair-encodings, ~100k tokens in 14 languages|
|Yubi-Tokenizer-HF||HuggingFace supported tokenizer, ~100k tokens in 14 languages|
|YubiBERT-micro-e4||4 transformer encoders, 16 attention heads, pre-trained till ~4.5 batch perplexity(ppl) score|
|YubiBERT-small-e8||8 transformer encoders, 12 attention heads, pre-trained till ~3.5 batch perplexity(ppl) score|
We will be sharing these YubiBERT models, content about how to use them and upcoming project in NLP and Vision at this public GitHub repository. We hope to make this repository a one-stop solution for a DS resource working in the FinTech domain.
YubiBERT-specific models are available here with relevant information and code.
Environment setup and model files download
conda create -n test python=3.7 ### Or python version > 3.7 conda activate test pip install -r requirements.txt
Load and use YubiTokenizer
import sys sys.path.append("/parent/directory/path/of/yubiai/github/unzipped/") from YubiAI.nlp.tokenizer.yubiTokenizer import YubiTokenizer tokenizer = YubiTokenizer() text = "CredAvenue is building India's first and largest de-facto operating system for the discovery, investment, fulfilment, and collection of any debt solution." tokenizer.get_tokens(text) ### ['▁cred', 'aven', 'ue', '▁is', '▁building', '▁india', ''', 's', '▁first', '▁and', '▁largest', '▁de', '-', 'fact', 'o', '▁operating', '▁system', '▁for', '▁the', '▁discovery', ',', '▁investment', ',', '▁fulfilment', ',', '▁and', '▁collection', '▁of', '▁any', '▁debt', '▁solution', '.']
Load and use HuggingFace supported YubiTokenizer
import sys sys.path.append("/parent/directory/path/of/yubiai/github/unzipped/") from YubiAI.nlp.tokenizer.yubiTokenizer import YubiTokenizerHF tokenizer = YubiTokenizerHF() text = "CredAvenue is building India's first and largest de-facto operating system for any debt solution." tokenizer.get_tokens(text) tokenizer.get_tokens_transformer(text) ### ['▁cred', 'aven', 'ue', '▁is', '▁building', '▁india', '’', 's', '▁first', '▁and', '▁largest', '▁de', '-', 'fact', 'o', '▁operating', '▁system', '▁for', '▁any', '▁debt', '▁solution', '.'] type(tokenizer.model) ### <class 'tokenizers.implementations.sentencepiece_bpe.SentencePieceBPETokenizer'> type(tokenizer.transformer_model) ### <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
Load and use YubiBERT
import sys sys.path.append("/parent/directory/path/of/yubiai/github/unzipped/") from YubiAI.nlp.yubiEmbeddings.yubibert import YubiBERT ybert = YubiBERT(use_gpu=False, model_type='yubibert_e4_micro') ## OR yubibert_e8_small input_text = "CredAvenue is India's fastest Fintech unicorn." print(ybert.getEmbeddings(input_text)) ### Or use below function to get last-n encoder layers ### print(ybert.getEmbeddings_last_n_layers(input_text, last_n_layers=2)) masked_text = "CredAvenue is 's fastest Fintech unicorn." print(ybert.roberta_fill_in_the_blank_task(masked_text)) ### [('credavenue is india's fastest fintech unicorn.', 0.73644, ' india'), ### ('credavenue is world's fastest fintech unicorn.', 0.17053, ' world'), ### ('credavenue is america's fastest fintech unicorn.', 0.03784, ' america'), ### ('credavenue is asia's fastest fintech unicorn.', 0.01167, ' asia'), ### ('credavenue is australia's fastest fintech unicorn.', 0.00319, ' australia'), ### ('credavenue is europe's fastest fintech unicorn.', 0.00307, ' europe'), ### ('credavenue is china's fastest fintech unicorn.', 0.00305, ' china'), ### ('credavenue is singapore's fastest fintech unicorn.', 0.00193, ' singapore'), ### ('credavenue is canada's fastest fintech unicorn.', 0.00161, ' canada'), ### ('credavenue is britain's fastest fintech unicorn.', 0.00148, ' britain')]
Want to know more?
Check out more details on how to pretrain your own model here
Understand how to fine-tune these pre-trained models here
Learn how to fine-tune YubiBERT and use its corresponding code here