Finding ASR Models

In second week, our task was to find custom models that will take audio input and give Punjabi transcript as output. For this I started reading research papers about Punjabi ASR. Here are the list of papers that I read.

I read documentation of Llama 3.1, Wav2Vec 2.0, IndicWav2Vec. These all are large language models. We can use these models in our app. These models can be downloaded from Huggingface Website. Llama 3.1 is the largest and the most accurate LLM model in the world. It is developed by Meta. It is trained on 405 Billion parameters. IndicWav2Vec is specifially trained on Indian Languages like Bhashini AI. IndicWav2Vec supports only 9 Indian languages. But none of these models supported Punjabi language. So it all came to a dead end.

However I did found a dataset which had almost 40,000 examples with audio input, punjabi transcript and english translation. This dataset is of 10.9 GBs. Here is the link to the dataset. The special thing about this dataset is that it actually takes aud samples from old punjabi newscasters. These are real audio clips by real people.
In another dataset I got more than 1,25,000 examples with audio input and Punjabi transcript. It is of 60 GBs. This is the link to that dataset. This dataset consists of audio clips by AI voice. It has been collected by using text-to-speech. The downside of this is that the model can be overfitted on this dataset.

I also tried to find custom models available on github. I did find one. This is the link to one of the articles. However it was not relevant for our project.

We also tried to implement OpenAI Whisper and Google Transcript API but at the time we did not have their api key as they are paid. They will be further explained in Week 3.