Navigating Azure AI: An Extensive Guide to Language, Vision, Speech, Decision, and Azure Open AI Services — Part 1

13 min readMay 26, 2023

Introduction

Back in 2013, when I first delved into AI, the landscape was vastly different. Today, it has evolved to an extent that no one had anticipated. Tasks that used to take months or days to complete are now accomplished within minutes. My peers and mentees often ask me a common question — “Is AI taking people’s jobs?” To be honest, I don’t have a definitive answer to this question. At times, I believe it’s creating job opportunities on a large scale. However, when I manage to complete tasks in areas I’m unfamiliar with, it makes me wonder if AI is indeed taking over people’s jobs. Hence, this remains an unanswered question for me.

Let me share a short story with you. Back in 2014, I was working for a startup where my manager, who was also my mentor, had a commendable practice of conducting mandatory code reviews every week (I remember it was every Thursday), irrespective of feature releases. He was serious about code quality and emphasized the importance of writing good, quality code. During these code reviews, we spent a significant amount of time not just improving the quality, but also rectifying numerous errors. We had a large monitor in the conference room, and a larger team would sit together to identify where we were going wrong. Imagine a group of 4–5 people doing this job together.

Fast forward to today, we have AI assistants like ChatGPT and many more that help us to a certain extent (still it would take time to reach that level of correctness, but a good start). They can identify bugs, explain errors, and even correct them. I believe they will become even smarter in the future with more Reinforcement Learning Human Feedback (RLHF). The point of this story is to illustrate the shift from a group of 4–5 people working on a task to individuals accomplishing the same task with the help of Large Language Models (LLMs). I’m not sure if this is taking jobs or creating new ones, because we also see amazing startups coming in this space who are leveraging Generative AI technologies in their backend.

Today, we have many tools in the market for code coverage. One of them is Microsoft’s GitHub Co-pilot. Microsoft has expanded its Co-pilot offerings across various domains. It’s interesting to see how our customers are leveraging these tools today. Hence, in this blog, we will discuss the Azure AI stack end-to-end, covering Vision, Language, Decision, and Open AI services in Azure. This is an excellent time to learn about AI at Microsoft, as we are releasing top-notch services and updates to existing AI services in Azure.

So, readers, fasten your seatbelts as we embark on this series of learning blogs through the rapid-paced world of AI. Let’s get started!

In the first part of the blog, we will discuss most of the Azure AI services, including Vision, Cognitive Services for Natural Language Processing (NLP), and much more. In the second part, we will delve into Azure AI Decision Services, Azure Open AI Services (OAI), and some interesting work done by Microsoft Research in the field of Large Language Models (LLMs) over the years (Research Papers, Semantic Kernel and much more). The third and final part of the blog will focus on how customers are leveraging these services in Azure, featuring real-time customer case studies on GPT-4, ChatGPT, and much more.

Before starting, let me introduce you to Suyash who collaborated with me on writing and contributing to this blog.

Azure AI Services:

Computer Vision Services

Image Moderation (SafeSearch, Adult/Offensive Content Detection) — Azure Content Moderator uses advanced algorithms and machine learning to analyze images and detect potentially objectionable or unsafe content. It allows you to define your moderation policies, tailoring them to your specific requirements and content guidelines. Here are some key features of Azure Content Moderator’s image moderation capabilities:

Explicit Content Detection
Text Detection and Profanity Filtering
Customizable Moderation Policies
Human-in-the-Loop Review

2. Custom Vision — If you have specific image recognition needs, Azure Custom Vision allows you to create and train custom models. With this service, you can teach your application to recognize unique objects or classify images according to your specific requirements. It offers a user-friendly interface and supports both image classification and object detection scenarios. It allows you to select classification types and domains. Take a look at the image below:

3. Form Recognizer — Extracting information from forms and documents can be a time-consuming task. Azure Form Recognizer simplifies this process by automatically extracting key-value pairs, tables, and other relevant data from various types of forms. Whether it’s invoices, receipts, or purchase orders, this service can save you significant time and effort in data extraction. There are primarily three types of models/offerings:

Pre-built models: Azure Form Recognizer provides pre-built models that are trained on a wide range of document types. These models are ready to use out of the box and can extract key information from common document formats like invoices, receipts, business cards, and more. The pre-built models are trained on a large dataset, ensuring high accuracy and performance for these specific document types.
Custom models: If you have specific document types or formats that are unique to your organization, Azure Form Recognizer allows you to create custom models. With custom models, you can train the service to understand and extract relevant information from your specific documents. This is particularly useful when dealing with documents that have a distinct structure or layout, such as internal forms or proprietary formats.
Layout models: In addition to extracting structured data, Azure Form Recognizer also offers layout models. These models focus on understanding the layout and structure of documents without extracting specific data fields. Layout models can be useful when you need to analyze the positioning and arrangement of different elements within a document, such as tables, headers, footers, or sections.

Take a look at this table on what type of models can be used for your domain.

Speech Services

Speech-to-Text — Azure Speech-to-Text allows you to convert spoken language into written text. It offers high-quality, real-time speech recognition capabilities and supports multiple languages and audio formats. This service can transcribe audio recordings, enable voice commands in applications, or generate subtitles for multimedia content. The following are key features:

Real-time speech-to-text- With real-time speech-to-text, the audio is transcribed as speech is recognized from a microphone or file. Use real-time speech-to-text for applications that need to transcribe an audio in real-time such as:

Transcriptions, captions, or subtitles for live meetings
Contact center agent assist
Dictation
Voice agents
Pronunciation assessment

Batch transcription- Batch transcription is used to transcribe a large amount of audio in storage. You can point to audio files with a shared access signature (SAS) URI and asynchronously receive transcription results. Use batch transcription for applications that need to transcribe audio in bulk such as:

Transcriptions, captions, or subtitles for pre-recorded audio
Contact center post-call analytics
Diarization

Custom Speech — With Custom Speech, you can evaluate and improve the accuracy of speech recognition for your applications and products. A custom speech model can be used for real-time speech-to-text, speech translation, and batch transcription.

Take a look at these services in the snapshot below:

There are several common scenarios where you can use Speech-to-text and Azure AI gives you the option to use that directly

As mentioned in the previous section, we have a custom speech option along with some other services that you see below:

2. Text-to-Speech — Azure Text-to-Speech converts written text into natural-sounding speech. This service provides a variety of voices with different styles and languages, allowing you to customize the generated speech to match your application’s needs. You can utilize this service to add voice interfaces, create audio content, or improve accessibility by providing spoken information.

The following are key features:

Neural Text-to-Speech: Azure Text-to-Speech utilizes state-of-the-art neural network models to generate high-quality speech. These models are trained on extensive datasets to capture the nuances of human speech, resulting in more natural and expressive output.
Wide Range of Voices: Azure Text-to-Speech offers a diverse set of voices with various styles, accents, and languages. You can choose from a catalog of pre-built voices that represent different genders, ages, and regions. This allows you to create unique and engaging voice experiences tailored to your application’s requirements.

Custom Voice Creation: In addition to the pre-built voices, Azure Text-to-Speech provides the capability to create custom voices. With the Custom Voice service, you can train a unique voice model using your recordings. This enables you to generate speech that closely resembles a specific person or brand, enhancing personalization and brand consistency.

Integration Options: Azure Text-to-Speech can be easily integrated into various applications and platforms. It provides APIs and SDKs for different programming languages, making it straightforward to incorporate text-to-speech capabilities into your software. Azure Text-to-Speech is also compatible with other Azure services, allowing seamless integration with chatbots, virtual assistants, or other AI-powered applications.
Audio Output Formats: The service offers flexibility in terms of audio output formats. You can choose from a range of formats, including WAV, MP3, and PCM, to suit your specific needs. This allows you to optimize the audio quality, file size, and compatibility with different devices or platforms.
Speech Synthesis Markup Language (SSML) Integration: Azure Text-to-Speech supports the integration of SSML, enabling developers to fine-tune the generated speech by controlling aspects such as pronunciation, prosody, and speech rate. This capability enhances the naturalness and expressiveness of the output.
Multilingual and Multiregional Support: Azure Text-to-Speech provides voices in various languages and accents, making it suitable for global applications. Whether you need speech synthesis in English, Spanish, French, German, or other languages, Azure Text-to-Speech offers a wide range of options to cater to diverse linguistic requirements.

3. Azure Translation — Azure Speech Translation combines speech recognition and translation capabilities to enable real-time multilingual communication. It can recognize and translate speech from one language to another, allowing users to have conversations even if they speak different languages. This service is valuable for building applications that require language translation during speech interactions.

The following are key features:

Neural Machine Translation: Azure Translation utilizes state-of-the-art neural machine translation models, which have been trained on vast amounts of multilingual data. These models are designed to capture the complexities of language and provide accurate and natural translations.

Wide Range of Languages: Azure Translation supports a broad set of languages, enabling translation between various language pairs. It includes widely spoken languages such as English, Spanish, French, German, Chinese, and many more. This allows you to localize your applications, websites, or content for a global audience. Below are the subset of languages

Translation API: The Azure Translation service offers an API that allows developers to easily integrate translation capabilities into their applications. The API provides simple and comprehensive methods for translating text, detecting languages, and retrieving language information.

Text Adaptation: Azure Translation provides the ability to customize and adapt translations according to your specific domain or terminology. You can create translation models using your bilingual data, improving translation accuracy and consistency for industry-specific or domain-specific content.
Language Detection: The service includes language detection capabilities, which automatically identify the language of a given text. This is particularly useful when dealing with multilingual content or when the language is unknown. It helps streamline the translation process by automatically detecting the source language.
Integration with Other Azure Services: Azure Translation seamlessly integrates with other Azure services, allowing you to build end-to-end solutions. For example, you can combine Azure Translation with Azure Cognitive Services, Azure Speech Services, or Azure Bot Service to create intelligent and multilingual applications, chatbots, or voice-enabled interfaces.
Scalability and Performance: Azure Translation is built on Azure’s robust and scalable infrastructure, ensuring high availability, reliability, and performance. It can handle large volumes of translation requests, making it suitable for applications with high traffic or demanding translation needs.
Translator Text Widget: Azure Translation provides a Translator Text Widget that enables easy integration of translation capabilities into websites or applications. The widget allows users to translate content on the fly without leaving the webpage, enhancing user experience and accessibility.

Language Services

Text Analytics (Sentiment Analysis, Key Phrase Extraction, Language Detection) — Azure Text Analytics service allows you to extract insights from unstructured text data. It includes a set of APIs for sentiment analysis, key phrase extraction, named entity recognition, language detection, and more. These services leverage machine learning models to process text and provide valuable information for tasks such as social media monitoring, customer feedback analysis, and content categorization. Let me show you what each service looks like when you access it.

Language Understanding (LUIS) : Azure Language Understanding Intelligent Service (LUIS) is an AI service that enables the creation of natural language understanding models for applications. LUIS allows developers to define intents, entities, and utterances and then uses machine learning to train the model to understand and interpret user inputs. LUIS can be used in various scenarios such as chatbots, voice assistants, and smart applications that require understanding and processing of natural language.

Named Entity Recognition (NER), Text Translation & Summarizer : Several options for custom NER and custom Text Translation are also available in Azure. Take a look at this below

QnA Maker — Azure QnA Maker is a service that allows you to easily create question-and-answer systems from your existing content. It enables you to transform FAQs, product manuals, support documents, and other textual information into a knowledge base that can provide accurate answers to user queries. QnA Maker leverages machine learning to understand user questions and retrieve the most relevant answers from the knowledge base. QnA Maker service is being retired on 31st March 2025. A newer version of this capability is now available as a part of Azure Cognitive Service for Language called question answering, we read about it in the previous section.

Personalizer: Azure Personalizer is an AI service that helps your applications make smarter decisions at scale. It uses reinforcement learning to analyze information about your application, the situation, and users to determine the best decision to make. By receiving feedback from your application, Personalizer learns and improves its decision-making ability in near-real time.

Personalizer can help in various scenarios to determine the best actions to take:

E-commerce: It can decide which product to show to customers to increase the chances of a purchase.
Content recommendation: It can recommend the most suitable article to improve the click-through rate.
Content design: It can determine the optimal placement of advertisements to enhance user engagement on a website.
Communication: It can suggest the right timing and method to send notifications for maximizing the likelihood of a response.

Azure Personalizer has two primary APIs: Rank API and Reward API.

The Rank API helps you make decisions by obtaining the best action, while the Reward API allows you to provide feedback on the suggested action’s performance, using examples like article clicks or product purchases.

For example, Rank API can provide recommendations of relevant news articles to the user, and Reward API will consider the feedback of whether the user clicked on the suggested news articles

Conclusion

In conclusion, this part-1 of an extensive guide has provided a comprehensive overview of Azure AI services, covering language, vision, and speech. Azure offers a powerful suite of tools for diverse AI challenges, from text analytics to image recognition and more. With scalable and customizable services, Azure empowers businesses to drive innovation and gain a competitive edge. As AI continues to evolve, Azure remains at the forefront, offering limitless possibilities for creating intelligent solutions. Explore the vast resources of Azure AI and unleash the potential of AI in your projects. Azure AI awaits, ready to shape the future. Stay tuned for part-2 and part-3 with interesting services and Azure Open AI demos.

Just before I go, I thought of asking GPT-4 if it's stealing our jobs or not😂

Oh, no worries! As an AI language model, I’m more of a helpful sidekick than a job thief. Think of me as your trusty AI companion, here to assist and make your life easier. Together, we can conquer language-related tasks and bring a touch of AI magic to your work. So, rest assured, I’m not here to steal your job. I’m just here to lend a helping “AI-d” hand!

Signing off,

Hitesh Hinduja & Suyash Dongare

Hitesh Hinduja | LinkedIn

Suyash Dongare| LinkedIn , Suyash Medium