Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks in this area primarily focus on audio–video synchronization, while largely overlooking explicit evaluation of audio–physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark designed to systematically evaluate the audio–physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset comprising 25.5 hours of 11,605 audible videos recorded from 184 participants in controlled environments, ensuring high fidelity and preventing data leakage. It includes 337 groups of paired prompts with carefully controlled physical variables that induce sound variations, each grounded with 2--42 videos, spanning 6 audio–physics dimensions and 41 fine-grained test points, ranging from fundamental phenomena (e.g., collision) to complex effects (e.g., Helmholtz resonance). Each video is densely annotated with step-by-step audio–physics reasoning that describes how the sound is produced. In addition, each prompt pair is annotated with the underlying physical factors responsible for the differences in sound. Importantly, PhyAVBench leverages paired text prompts to assess models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art (SOTA) models across T2AV, I2AV, and V2A tasks, along with human studies involving 74 participants. Human evaluation of physical realism shows a strong positive correlation with the CPRS metric. Our results reveal that even leading commercial models struggle with fundamental audio–physical phenomena, exposing a critical gap beyond audio–visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation.
TABLE I: Comparison of unified audio-video generation benchmarks across audio-physics coverage, controlled setting, acoustic scenario coverage, data origin, ground-truth video numbers, and evaluation metrics.
SAVGBench evaluates unconditioned audio-video generation. VABench contains only text prompts and conducts evaluation using MLLM.
| Benchmark | Audio-Physics Coverage | Controlled Setting with Paired Samples | Acoustic Scenario Coverage | Newly Collected | #GT Videos per Prompt | Evaluation Metric | |||
|---|---|---|---|---|---|---|---|---|---|
| Music | SFX | Speech | Mix | ||||||
| TAVGBench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | 1 | AV-Align |
| SAVGBench | 1 Test Point | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | - | AV&Spatial-Align |
| Verse-Bench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | 1 | AV-Align |
| JavisBench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ (partial) | 1 | AV-Align |
| VABench | 4 Test Points | ✗ | ✓ | ✓ | ✓ | ✓ | - | 0 | AV&Stereo Align |
| PhyAVBench (Ours) | 6 Dimensions & 50 Test Points | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ≥ 20 | AV-Align & Physics Sensitivity Test |
Each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. The following are some sample video pairs in PhyAVBench, shwoing the diversity of the data.
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up, static camera. An index finger slowly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor. m01_c03_t08_s02_g011_a01 |
|
|
|
|
|
Close-up, static camera. An index finger quickly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor. m01_c03_t08_s02_g011_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up, static camera. Water flows into a cup at a slow, gentle rate. Indoor. m02_c05_t14_s02_g004_a01 |
|
|
|
|
|
Close-up, static camera. Water flows into a cup at a fast, strong rate. Indoor. m02_c05_t14_s02_g004_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A close-up, static shot of a transparent plastic bottle. The bottle contains no water. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle. m02_c06_t16_s02_g001_a01 |
|
|
|
|
|
A close-up, static shot of a transparent plastic bottle. The bottle is filled with water to about four-fifths of its capacity. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle. m02_c06_t16_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Static medium shot of an empty plastic bottle being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room. m02_c07_t18_s02_g001_a01 |
|
|
|
|
|
Static medium shot of a plastic bottle filled with water to about four-fifths of its capacity being dropped onto an indoor floor. water to about four-fifths of its capacity being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room. m02_c07_t18_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it rapidly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket. m02_c08_t20_s02_g001_a01 |
|
|
|
|
|
A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it slowly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket. m02_c08_t20_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on low airflow, no nozzle attachment about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on. m02_c08_t20_s02_g018_a01 |
|
|
|
|
|
Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on high airflow with a narrow concentrator nozzle attached about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on. m02_c08_t20_s02_g018_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium close-up shot facing the doorway. The door is open. In a quiet indoor environment, music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall. m02_c12_t33_s01_g001_a01 |
|
|
|
|
|
A static, medium close-up shot facing the doorway. The door is closed. In a quiet indoor environment, the same music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall. m02_c12_t33_s01_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium close-up shot of a smartphone placed on a table. The phone is not covered. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone and the surrounding tabletop. m03_c12_t33_s03_g001_a01 |
|
|
|
|
|
A static, medium close-up shot of a smartphone placed on a table. The phone is completely covered by an upside-down transparent plastic box. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone, the transparent plastic box, and the surrounding tabletop. m03_c12_t33_s03_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with aperiodic, irregular timing; indoor, quiet. m05_c18_t45_s02_g006_a01 |
|
|
|
|
|
Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with periodic, regular timing; indoor, quiet. m05_c18_t45_s02_g006_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is boiling vigorously, producing continuous bubbling and steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the active boiling water. m06_c19_t46_s02_g001_a01 |
|
|
|
|
|
A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is gently simmering, with occasional small bubbles forming and minimal steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the lightly boiling water. m06_c19_t46_s02_g001_b01 |
|
|
|
|