Skip to content
Back to blog
Company

December 10, 2025 · 3 min read

Introducing VoiceFrom

We are building real-time speech-to-speech translation at the level of meaning, not words. From former Google audio AI leads Dominik Roblek and Hassan Rom.

Abstract sound waves transforming into multilingual speech bubbles in VoiceFrom brand colors

We are building at the level of meaning, not words.

What to listen for

Translation technology has existed for decades. And for most of that time, it has solved the wrong problem.

The dominant measure of quality has been accuracy: does the output say the same things the speaker said? This is a reasonable starting point. But in live speech, it is not enough. A keynote speaker builds to a peak, slows for emphasis, lets a pause carry weight. A trainer’s warmth signals that a learner should feel safe asking a question. A negotiator’s measured calm communicates confidence that the words alone do not. When you strip all of that out and produce a flat, accurate transcript of what was said, you have not translated the communication. You have translated the surface of it.

This is the problem we are building to solve.

Why the pipeline matters

The reason most AI translation loses meaning is architectural. The standard pipeline converts speech to text, translates the text, and generates new audio from the translation. This approach is efficient, and it produces readable output. But it discards the audio signal at step one, and meaning lives in the audio. Prosody, rhythm, emphasis, emotional register: these are not recoverable from text. Any system that processes language as text cannot preserve the things that make spoken communication distinct from written communication.

We are building directly on the speech signal. This is harder. It requires solving the problem of preserving speaker characteristics (tone, pace, emphasis) across language boundaries in real time, with the latency constraints that live events demand. It is also, we believe, the only approach that can actually make the language barrier invisible, rather than just smaller.

Who we are

We are Dominik Roblek and Hassan Rom. Before VoiceFrom, we spent more than a decade at Google working on audio AI: the systems behind Google Meet, Google Assistant, Pixel Buds, and Waymo. We were inside the infrastructure that hundreds of millions of multilingual conversations depend on, and we watched the same meaning problem go unsolved. Solving it is why we started VoiceFrom.

VoiceFrom Pro today

VoiceFrom Pro is live. It delivers real-time speech-to-speech translation in the browser, with no hardware, no interpreter booths, and no AV setup, for conferences, enterprise events, and live communications in English, Spanish, French, German, Italian, and Portuguese. We were recognized in the Slator 2025 Language AI 50 Under 50.

This is the beginning, not the finished state. We are actively building toward broader language coverage, more robust handling of multi-speaker environments, and deeper integration with the workflows our customers depend on.

Why we’re writing

We are starting this blog because we want to share how we think about the hard problems in real-time speech translation: not just what VoiceFrom does, but why we built it the way we did. We’ll publish technical posts on model architecture and the decisions behind it, field observations from production deployments, and research we find genuinely useful. We’ll also be direct about what is hard, what we have not solved yet, and what we are working on next.

The language barrier is a solvable problem. We are solving it.

Dominik Roblek & Hassan Rom Co-founders, VoiceFrom

Portrait avatar of Dominik Roblek

Dominik Roblek

Co-founder

Dominik is Co-founder at VoiceFrom and previously led audio AI work at Google across products including Meet and Assistant. He focuses on speech-native translation quality and real-time product execution.

Portrait avatar of Hassan Rom

Hassan Rom

Co-founder

Hassan is Co-founder at VoiceFrom and former Google audio AI leader. He works on low-latency multilingual speech systems that preserve meaning, tone, and listener experience in live settings.