Rubato is a small audio language model that listens to music and writes out the score — note by note, beat by beat, synchronized to the audio.

It is powered by InterMo, a new symbolic music language that encodes notation and timing in a single sequence, so the model can generate both at once.

The result: you can follow the score in sync with the recording and feel the rubato — the expressive timing that gives each performance its character.

Piano transcription

Rubato is trained on piano audio. The scores you see here are generated entirely from the audio signal — no human editing, no MIDI input.

Zero-shot piano reduction

For orchestral, ensemble, and pop recordings, Rubato produces a piano reduction without any additional training on non-piano audio. It maps what it hears onto a grand staff, giving you a readable two-hand score of music it was never trained on.

© Authors of “Rubato: Transcribing Piano Music with Timestamps.” This interface demonstrates one of many possible applications of time-aligned score generation. It is designed to support all future open models that generate time-aligned InterMo notation, and to collect user preference data for improving those models. Unauthorized commercial use is prohibited and may be traced through unique identifiers embedded in each deployment. Free for open science and non-commercial use in education.

Rubato: Transcribing Piano Music with Timestamps

Western staff notation is one of the most sophisticated human languages. It captures melody, rhythm, phrasing, and structure in a form musicians can read, interpret, and discuss. However, generating this abstract notation — especially in a way that would support human creative interactions with audio — has been a pain point for modern AI. Existing formats are extremely verbose for AI generation. Worse, they force the musician to give up direct interaction with audio just to obtain the notational abstractions they can read.

Intervals and Moments (InterMo) merges human-readable notation and audio grounding in a single representation designed for autoregressive generation. This requires the model to decode the audio moment by moment, and enables users to follow along the sheet music and visually inspect the rubato — where a phrase leans forward, where it lingers, and where expressive timing reshapes the written pulse — directly against the recording.

Rubato is a prompt-conditioned encoder–decoder that generates InterMo sequences at different abstraction layers (time-aligned notation, performed note events, or beat/downbeat annotations). Unlike prior systems, Rubato requires no music-specific modeling: a standard speech recognition architecture, repurposed as-is, turns out to be all you need for generating readable, inspectable, audio-grounded sheet music. If the architecture doesn't distinguish speech from music, should we?