

Introduction
A few months ago, we released YubiBERT, a tiny BERT model for fintech which also supports 13 Indian regional languages. We created some downstream nlp models using it, for private use and released a few models publicly.
But then ChatGPT came, first, GPT-3.5 and now with GPT-4.0. And seems like the world has gone crazy over the same. The general feeling among DS/AI community is such generative large language models can solve any problem in any project just through a chat interface.
Traditionally using a generative model when discriminative models are available was not considered a good practice… also model hallucination was/is a bigger problem. Many times generative models spit out information which is not true or partially true/false. But with new Reinforcement Learning methodologies, OpenAI has tried to control the model hallucination and retain the factual information part.
So we thought, why not check how YubiBERT stands against a behemoth like ChatGPT on our specific tasks and determine where it works better and where it fails? This study will also help DS teams which do not have hundreds of millions of budgets for training LLMs and are actually working on solving problems in smaller domains or contexts. Inference cost is higher than training costs for OpenAI like paid services also we have to share the data over insecure internet as well, which many companies like ours might not want to do.
Please Note: The context used in preparing datasets is Regional Languages and Indian Fintech Domain and accordingly expectation is set for the same. With your own datasets & domains, you might see different results.
Sentiment Analysis on Fintech News
Fintech Sentiment Analysis with Two Classes (Positive, Negative)
- The task is to determine if the given text is positive or negative news for a given company.
- YubiBERT-e8-small model was fine-tuned with 2 classes
- Samples are provided to chatgpt for each class in initial interactions
- Prompt Used – Can you detect positive or negative sentiment for the following text —– <TEXT>
- ChatGPT performs well. Looks like, performance could be improved with better prompts and more examples.
Model | Sample Count | Matched Classes | Accuracy |
YubiBERT Finetuned model | 498 | 479 | 96.18% |
ChatGPT with prompts | 498 | 473 | 94.97% |
Fintech Sentiment Analysis with Three Classes (Positive, Negative, Neutral)
- YubiBERT-e8-small model was fine-tuned with 3 classes
- Samples are provided to chatgpt for each class in initial interactions
- Prompt Used – Can you detect positive or negative or neutral sentiment for the following text —– <TEXT>
Model | Sample Count | Matched Classes | Accuracy |
YubiBERT Finetuned model | 701 | 577 | 82.31% |
ChatGPT with prompts | 701 | 343 | 48.93% |
Text Language Detection
- The task is to determine the language of the given text. Also, to check if Roman transliterations are also getting detected correctly.
- Used YubiBERT fine-tuned model yulan-e4-v2
- With chatgpt, the following prompt is used – Find the language of each of the given texts and please also mention the script in bracket —- [1] TEXT1 [2] TEXT2 [3] TEXT3 [4] TEXT4
- ChatGPT confuses similar script languages in Roman transliterations.
- Bengali & Assamese
- Bengali & Odiya
- Gujarati & Marathi & Hindi
- Kannada & Telugu
- ChatGPT struggles to return the transliterated script back. For Romanised text also it returns language as script. Sample output for Roman-Transliterated text looks like -> [0] Bengali (Bengali script) [1] Hindi (Devanagari script) [2] Bengali (Bengali script) [3] Telugu (Telugu script)
- Passing additional Samples did not improve the performance.
Indian Regional Text Set
Model | Match | Total | Accuracy |
ChatGPT | 178 | 224 | 79.46% |
YuLan | 214 | 224 | 95.54% |
Indian Regional Transliterated Text Set
Model | Match | Total | Accuracy |
ChatGPT | 13 | 126 | 10.32% |
YuLan | 118 | 126 | 93.65% |
Fill In The Blanks
- We tried masking a word in a text and then passed them to YubiBERT-e8-small and ChatGPT models. Here, the task is to fill in the blanks.
- Prompt used with ChatGPT → Fill in the blanks —– TEXT with <mask>
- Example → Fill in the blanks —– Ratan Tata and <mask> Ambani are pillars of the Indian economic system.
- ChatGPT completely outperformed YubiBERT in the task of generating Factual content. Though to be fair RoBERTa architecture which is the base for YubiBERT was never meant to be factual in the first place.
- But if we consider semantically & syntactically correct word fill, then both models did well.
- YubiBERT many times uses pronouns instead of nouns. Sentences look correct but are not completely sensible or factual.
Model | Match | Total | Accuracy | Factual Match | Factuak Accuracy |
ChatGPT | 139 | 140 | 99.28% | 135 | 96.42% |
YubiBERT | 133 | 140 | 95% | 57 | 40.71% |
Document Type Detection
- Given an uploaded document we have to detect the type of each page.
- This is a typical task which the CV team does before further downstream tasks and hence very important.
- We chose ~200 random samples for the test and we had to anonymise the data (name & numbers) for privacy reasons.
- YubiBERT-e8-small model was fine-tuned and ChatGPT was given a context before passing examples to categorise.
- The prompt used for ChatGPT is – Return matching class label from the following list to the text following it [“aoa”, “auditor_report”, “balance_sheet”, “bank_statement”, “cashflow”, “certificate_of_incorporation”, “entity_pan”, “gst_certificate”, “gst_return”, “guarantee_deed”, “itr”, “loan_agreement”, “moa”, “others”, “partnership_deed”, “profit_loss”, “promoter_aadhar”, “promoter_pan”, “schedules”, “sub_class”, “udyam_registration_certificate”, “utility_bill”, “vendor_term_sheet”] and text is —– TEXT CONTENT
- ChatGPT performs worse when the category is others, it doesn’t provide new classes also, and it keeps selecting one from the list but never others class.
- It also gets confused between document types when there are fewer text sentences available and lot numbers are present.
- I guess with fine-tuning ChatGPT can/will do better. I was not even expecting with 0-shot it can do 80% or more.
Model | Match | Total | Accuracy |
ChatGPT | 161 | 200 | 80.05% |
YubiBERT Finetuned model | 178 | 200 | 89% |
Final Thoughts
YubiBERT vs ChatGPT
- Not a fair fight. Both are different styles of models. One is meant for a specific domain with limited language knowledge and the other is meant to be served as a one-for-all, jack-of-all-trades kind of model.
- In Yubi-related tasks seems like YubiBERT is winning but I am sure if we finetune ChatGPT with more examples using OpenAI Playground ChatGPT or GPT4 models can equal or surpass our performance.
- But given our initial findings on YubiBERT – there is merit in finetuning and descoping models. From that perspective even for generation use cases, a fine-tuned model for localized use cases could fit the problem better
Cost Analysis
- YubiBERT was trained with a budget of 2000 USD only. Also, we can use CPU instances to infer directly from fine-tuned models. Dirt-cheap even with c5 and g4dn instances!
- Overall when we look at 20 USD per month for ChatGPT plus account .. looks like a cheap deal. But is it? It has a limit of 25 messages per 3 hours. Assuming we work for 8 hours a day, we can not even ask 100 queries to ChatGPT. [reference link]
- Coming back to fine-tuning – training looks to be cheap but inferencing is not.
- Inference on existing models
- Call to gpt4 which is a base model would cost $0.06 to $0.12 for 1000 tokens or per request. I am considering only the top two models Curie and Davinci. This is the minimum cost per inference call.
- Example – Summarisation tasks
- Let’s say we have a document with 500 words. Tokens would be ~1000 approx. It would cost 0.06$ to 0.12$ per such request.
- If we summarize 10000 documents cost would be $600 to $1200
- Training & Inference custom models
- ChatGPT finetuning happens through InstructGPT architectures. Better models are Curie and Davinci.
- Example – News Sentiment Analysis Task
- Let’s say we have 10000 text samples with our own sentiment classes tagged. Assume 150 words/~300 tokens per sample. The assumption is we need fewer data to train now as GPT knows all the context already at some level. (Otherwise, 10k samples for training is considered less)
- 1000000 tokens for training would cost us from $9 to $90 for good models like Curie and Davinci. This is for 1 run of model training. We might try it out 3-4 times … ~$400. Still, training costs are very less !!
- After that, each model inference will cost $0.012 to $0.12 per 1000 tokens. So let’s say 10000 inference calls cost would be $36 to $360 based on model type. This is a recurring cost by the way applicable for all NLP-related projects.
Data Privacy
- As a FinTech company, we would not want to share textual data over to an external chatbot.
- One thing is a privacy breach and user trust at Yubi and the second is ChatGPT will keep learning from our data as well considering it as just another chat history.
- It can spit out this information in other chats when asked about it in curated question form.
- So many companies where data privacy is important might want to avoid online generative models like ChatGPT.
- We can not download models and APIs can be accessed over the cloud platform provided by OpenAI.
- More information is to be explored with OpenAI’s incoming stack documentation.