- Microsoft introduces three in-house AI foundation models for transcription, voice generation and image creation to reduce reliance on OpenAI.
- MAI-Transcribe-1 supports 25 languages and runs around 2.5x faster than Microsoft’s current Azure Fast transcription offering.
- MAI-Voice-1 can generate 60 seconds of customisable audio in about one second, while MAI-Image-2 targets advanced image and video generation.
- The models integrate into Microsoft Foundry, MAI Playground, Teams and Azure, with aggressive pricing and a roadmap toward large frontier models by 2027.

Microsoft is taking a clear step toward greater autonomy in artificial intelligence by introducing three of its own foundation models aimed at transcription, speech generation and image creation. The move signals that the company wants a deeper, multimodal AI stack it fully controls, even while it keeps a close commercial alliance with OpenAI in place.
These new systems, developed under the Microsoft AI / MAI Superintelligence teams, are designed to plug directly into products like Teams and Azure as well as into internal experimentation platforms. In practice, Microsoft is laying the groundwork for a long-term strategy where its own models cover a growing share of everyday workloads, reserving external models like those from OpenAI for cases where they bring clear, differentiated value.
Three Microsoft-built foundation models for transcription, voice and images
The launch revolves around three core models: MAI-Transcribe-1 for speech-to-text, MAI-Voice-1 for text-to-speech and MAI-Image-2 for visual generation. Together they form a first, very visible layer of an in-house multimodal AI stack that can handle text, audio and images inside the Microsoft ecosystem.
Rather than relying solely on large, general-purpose models, Microsoft is betting on task-focused systems that are cheaper and faster for common enterprise use cases. That approach is especially relevant as the number of Copilot users and AI-powered features in Office, Teams and Azure keeps climbing, with costs that would otherwise scale almost linearly with API usage.
Foundation models of this kind are trained on large and diverse datasets so they can later be adapted to a wide range of scenarios. Here, that means powering everything from call center transcription and meeting summaries to synthetic voices, accessibility tools and automated content creation pipelines.
MAI-Transcribe-1: faster, multilingual speech-to-text for 25 languages
MAI-Transcribe-1 is Microsoft’s new speech-to-text engine and one of the central pieces of this rollout. The model supports transcription in 25 different languages and has been benchmarked internally as roughly 2.5 times faster than the company’s existing Azure Fast transcription offering, which has been a reference point in its current portfolio.
This performance bump matters because transcription workloads are highly sensitive to latency, especially in real-time scenarios like live captions, customer support or hybrid meetings. The broader language coverage also aligns with Microsoft’s global footprint, making it easier for multinational customers to standardise on a single provider instead of mixing regional tools.
From a product standpoint, Microsoft plans to wire MAI-Transcribe-1 straight into Microsoft Teams to handle meeting transcripts and live captions. Over time, the same engine is expected to appear under the hood of other productivity tools, so that users see better speed and lower costs without necessarily noticing a branding change.
Pricing has been positioned aggressively: MAI-Transcribe-1 starts at around $0.36 per hour of processed audio, a figure aimed at undercutting comparable offers from both Google and OpenAI while still running on Microsoft’s own cloud infrastructure.
MAI-Voice-1: ultra-fast text-to-speech with custom voices
On the audio generation side, MAI-Voice-1 is Microsoft’s new model for turning text into speech. According to the company, it can produce approximately 60 seconds of audio in about one second of processing time, which is a notable jump for use cases where responsiveness is critical.
Beyond raw speed, a key promise is support for custom, brand-aligned voices. Organisations will be able to define voices that match their identity or specific use cases, from support hotlines and conversational agents to training material, podcasts and accessibility features. That level of control is increasingly important as synthetic speech becomes more common and listeners grow more demanding about tone and clarity.
Microsoft is aiming MAI-Voice-1 squarely at developers and enterprises building voice-heavy products: call centres, in-app assistants, language learning tools, media platforms or any service that needs scalable narration. With pricing starting around $22 per one million characters, the model is meant to be financially viable at both small and very large volumes.
From an infrastructure angle, MAI-Voice-1 is offered through Azure APIs, Microsoft Foundry and MAI Playground, letting teams test voices quickly and then move to production without switching environments. The idea is to streamline the full path from experimentation to deployment within Microsoft’s stack.
MAI-Image-2: image and video generation integrated into Microsoft’s stack
The third model, MAI-Image-2, focuses on image (and in some descriptions, video) generation from text prompts. While the company has not disclosed every technical detail, it is positioning the model as a visual counterpart to its text and audio systems, aimed at automating the creation of marketing assets, product visuals, storyboards and other media.
Interestingly, MAI-Image-2 first appeared more quietly in MAI Playground, Microsoft’s experimentation environment for large models, back in mid-March. The current announcement formalises its role as part of the broader Foundry and Azure ecosystem, where businesses can access it as a standard component rather than as a pure research demo.
Pricing is again structured to compete: the company cites an entry point of about $5 per one million input tokens for text and around $33 per one million output tokens for generated images. These numbers are framed as being on par with, or below, similar tiers from rival providers while benefiting from Microsoft’s enterprise security and compliance stack.
Use cases range from automated creative workflows and personalised marketing visuals to rapid prototyping for product design. For many customers already standardised on Azure, the key selling point is that they can experiment with image generation without bringing in an additional external vendor.
Integration across Azure, Foundry, MAI Playground and Microsoft 365
A defining aspect of this launch is how tightly the new models are woven into Microsoft’s existing cloud and productivity platforms. All three systems – MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2 – are being rolled out through Microsoft Foundry, the company’s environment for accessing and scaling foundation models.
Developers can start with MAI Playground, where the same models are exposed in a more experimental interface. That setup is meant to lower the barrier for teams that want to try out capabilities like transcription, synthetic voices or visual generation without committing to full integration straight away.
On the product side, Microsoft is already pointing to Microsoft Teams as an early beneficiary. MAI-Transcribe-1 is set to power meeting transcripts and captions, while MAI-Voice-1 and MAI-Image-2 are expected to surface over time in various Copilot and Microsoft 365 experiences, even if end users may not see explicit model branding.
For companies, the promise is a single, coherent stack where transcription, voice and images live next to language models, data services and analytics in Azure. That could simplify compliance, security reviews and vendor management compared with stitching together multiple external AI providers.
Pricing strategy and competition with OpenAI and Google
Alongside technical specifications, Microsoft is placing a lot of emphasis on pricing competitiveness. The company openly positions these models as alternatives that can match or undercut similar offerings from OpenAI and Google, especially for sustained, high-volume use.
The published price points – $0.36 per audio hour for MAI-Transcribe-1, $22 per million characters for MAI-Voice-1 and the $5 / $33 per million token structure for MAI-Image-2 – are not just technical details. They are part of a broader message that Microsoft wants to be seen as a cost-efficient, end-to-end provider of generative AI rather than only a reseller of partner models.
In a market where more organisations are embedding AI into daily operations, cost per request can quickly become a strategic variable. By owning its own models, Microsoft can fine-tune the trade-off between compute expenses, model complexity and user pricing instead of paying large markups to external providers.
There is also a signalling effect: by highlighting its own benchmarks and price tables, Microsoft is effectively telling customers that they no longer need to default to third-party models for core workloads such as transcription, speech and images if they are already committed to Azure.
Mustafa Suleyman and the “human-centred” AI vision
The three new models come from teams grouped under Microsoft AI / MAI Superintelligence, led by Mustafa Suleyman, who now heads Microsoft AI. Suleyman, known for his previous roles in the AI industry, has been publicly outlining a vision that he describes as “humanist AI” or human-centred artificial intelligence.
In Microsoft’s communications around the launch, Suleyman emphasises that these models are designed to reflect how people actually communicate, prioritising practical usefulness and safety. The goal, in his words, is to create systems that are less abstract research projects and more tools that fit into everyday workflows at work and at home.
He has also suggested that the current trio of models is only the beginning of a broader portfolio. Microsoft plans to roll out additional foundation models through Foundry and directly inside products, gradually expanding its in-house capabilities beyond speech and images to cover more modalities and more specialised tasks.
That roadmap underscores Microsoft’s intent to be seen not just as a platform for other people’s AI, but as a builder of its own advanced models that can sit alongside offerings from long-time partners like OpenAI.
A recalibrated relationship with OpenAI and a 2027 frontier-model goal
One of the most delicate aspects of this strategy is how it relates to Microsoft’s high-profile partnership with OpenAI. The companies remain closely tied: Microsoft has invested over $13 billion in OpenAI, hosts its models on Azure and integrates systems like GPT into products such as Copilot.
However, recent reports point to a renegotiation of the relationship that gives Microsoft more room to run its own AI research and product lines in parallel. Suleyman has framed this shift as a natural evolution, not a rupture – more akin to the company designing some of its own chips while still buying from external suppliers.
According to Bloomberg and other outlets, Microsoft is aiming to have its own large-scale, frontier-level models up and running by around 2027. The newly announced systems sit slightly upstream of that ambition: they are not yet positioned as general-purpose, cutting-edge language models, but rather as specialised components that reduce dependence on partner APIs for everyday workloads.
In practice, this means Microsoft can keep using OpenAI models like GPT-5.4 where they make sense, while gradually swapping in its own models wherever the cost-performance ratio or strategic considerations favour internal technology. Users may simply notice that features become quicker or cheaper as these transitions happen in the background.
For the broader AI market, this dual track underscores a clear trend: large tech companies are seeking a balance between collaboration and self-sufficiency, using alliances to move fast but building their own capabilities to avoid being locked into a single supplier over the long term.
With these three models, Microsoft is effectively planting a flag: it wants to compete at multiple levels of the AI stack – from infrastructure and tooling to the foundational models themselves – while still leaving space for partners like OpenAI where they bring unique strengths. For customers, that could translate into more options, sharper pricing and a gradual shift toward Microsoft-branded AI underpinning familiar products and services.
