As an intern at Khalis Foundation, me (Dilraj Singh) and my teammate Simarjot Singh, are assigned with the project of implementing the voice search feature in apps developed by said foundation. These apps include Sikhi To The Max and Sunder Gutka. Our task is to find the best transcript api that will give the most accurate results for Punjabi language.
This documentation will explain all the steps that we followed so it would be easier for fellow developers to find the best transcript api and easily implement in their own apps.
The most obvious solution that comes to mind is Web Kit Speech Recognition. This is a web speech api. It is very easy to implement. We don't need an api key to use it. It is very fast. It generates accurate results most of the times. And it supports Punjabi language as well. The only problem is that this is browser based. It can easily be used in web applications but to use it in native environment, we have to find some workarounds. Also it is not supported by all browsers. It primarily works in Chrome and some Chromium-based browsers. Here is a very simple code that you can use to implement this feature.
You can read it's complete documentation here.
Bhashini AI is official LLM(Large Language Model) for indian languages by Indian Govt. It supports mostly all the indian languages and it provides several services like asr (Automated Speech Recognition), text-to-speech etc. You can see the complete list of services and their supported languages.
First you have to register yourself to Bhashini AI. Follow steps on this page to register and get api key and user id. Api key and user id both are required for api calls. See the Language codes here.
Pipeline Name | Pipeline ID | Available Services |
---|---|---|
MeitY | 64392f96daac500b55c543cd |
|
AI4Bharat | 643930aa521a4b1ba0f4c41d |
|
Key | Value |
---|---|
userID | Your user id |
ulcaApiKey | Your user id |
Now you should have api key, user id and pipeline id. Now you need endpoint and request format. Refer the following
https://meity-auth.ulcacontrib.org/ulca/apis/v0/model/getModelsPipeline
{
"pipelineTasks": [
{
"taskType": "asr",
"config": {
"language": {
"sourceLanguage": "xx"
}
}
},
{
"taskType": "translation",
"config": {
"language": {
"sourceLanguage": "xx",
"targetLanguage": "yy"
}
}
},
{
"taskType": "tts",
"config": {
"language": {
"sourceLanguage": "yy"
}
}
}
],
"pipelineRequestConfig": {
"pipelineId" : "xxxx8d51ae52cxxxxxxxx"
}
}
{
"languages": [
{
"sourceLanguage": "gu",
"targetLanguageList": [
"bn"
]
}
],
"pipelineResponseConfig": [
{
"taskType": "asr",
"config": [
{
"serviceId": "ai4bharat/conformer-multilingual-indo_aryan-gpu--t4",
"modelId": "6411746056e9de23f65b5425",
"language": {
"sourceLanguage": "gu"
},
"domain": [
"general"
]
}
]
},
{
"taskType": "translation",
"config": [
{
"serviceId": "ai4bharat/indictrans-fairseq-i2i-gpu--t4",
"modelId": "62023eeb3fc51c3fe32b8c5b",
"language": {
"sourceLanguage": "gu",
"targetLanguage": "bn"
}
}
]
},
{
"taskType": "tts",
"config": [
{
"serviceId": "ai4bharat/indic-tts-coqui-indo_aryan-gpu--t4",
"modelId": "636e60e586369150cb00432a",
"language": {
"sourceLanguage": "bn"
},
"supportedVoices": [
"male",
"female"
]
}
]
}
],
"pipelineInferenceAPIEndPoint": {
"callbackUrl": "https://dhruva-api.bhashini.gov.in/services/inference/pipeline",
"inferenceApiKey": {
"name": "Authorization",
"value": "m-LTAzxQVp6jjznmSR5RgKM"
},
"isMultilingualEnabled": true,
"isSyncApi": true
}
}
Now I know this is looking scary and there are a lot of things to keep track of. But let's understand it in simple way. We basically asked the pipeline "Can we use xyz service for xyz language?" We basically configured the pipeline for a specific task. If that task is not available in that language or in that specific order then you will recieve a "400 BAD REQUEST" error response. In normal cases pipeline responds with a very long response but we are interested in only 4 things in it.
This callbackUrl is the new endpoint where we will send the final api call. For this we need the 3 other things mentioned in above list and one audio data. You need to convert the audio data into base 64 and then send this base64 data to endpoint.
Key | Value |
---|---|
auth parameter key | key obtained from pipeline response |
auth parameter value | value obtained from pipeline response |
callbackUrl obtained from pipeline response
{
"pipelineTasks": [
{
"taskType": "asr",
"config": {
"language": {
"sourceLanguage": "xx"
},
"serviceId": "xxxxx--ssssss-d-ddd--dddd",
"audioFormat": "wav",
"samplingRate": 16000
}
}
],
"inputData": {
"input": [
{
"source": null
}
],
"audio": [
{
"audioContent": "{{generated_base64_content}}"
}
]
}
}
Finally after doing all this this api's response is neither fast nor accurate. You can check it's implementation in this repo.. Obviously this was very complicated. I tried my best to keep it as simple as possible. I left out some details that are not important for beginners.
If you hate your life and you decided to implement this api and you are getting any error that is not mentioned in my brief documentation then you can read the complete documentation on this website.