PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Anonymous Authors

Abstract

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks in this area primarily focus on audio–video synchronization, while largely overlooking explicit evaluation of audio–physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark designed to systematically evaluate the audio–physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset comprising 25.5 hours of 11,605 audible videos recorded from 184 participants in controlled environments, ensuring high fidelity and preventing data leakage. It includes 337 groups of paired prompts with carefully controlled physical variables that induce sound variations, each grounded with 2--42 videos, spanning 6 audio–physics dimensions and 41 fine-grained test points, ranging from fundamental phenomena (e.g., collision) to complex effects (e.g., Helmholtz resonance). Each video is densely annotated with step-by-step audio–physics reasoning that describes how the sound is produced. In addition, each prompt pair is annotated with the underlying physical factors responsible for the differences in sound. Importantly, PhyAVBench leverages paired text prompts to assess models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art (SOTA) models across T2AV, I2AV, and V2A tasks, along with human studies involving 74 participants. Human evaluation of physical realism shows a strong positive correlation with the CPRS metric. Our results reveal that even leading commercial models struggle with fundamental audio–physical phenomena, exposing a critical gap beyond audio–visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation.

Data Distribution

Data Distribution
Fig. 1: The data distribution of PhyAVBench.

Audio-Physics Sensitivity Test

Audio-Physics Sensitivity Test
Fig. 2: Overview of the PhyAVBench evaluation framework. The Audio-Physics Sensitivity Test (APST) uses paired prompts that differ by a single physical variable (e.g., material). By comparing the directional trends of generated audio features against ground-truth physical laws, we calculate the Contrastive Physical Response Score (CPRS) to assess the model's understanding of real-world physics.

Comparison with Existing Benchmarks

TABLE I: Comparison of unified audio-video generation benchmarks across audio-physics coverage, controlled setting, acoustic scenario coverage, data origin, ground-truth video numbers, and evaluation metrics.

SAVGBench evaluates unconditioned audio-video generation. VABench contains only text prompts and conducts evaluation using MLLM.

Benchmark Audio-Physics Coverage Controlled Setting with Paired Samples Acoustic Scenario Coverage Newly Collected #GT Videos per Prompt Evaluation Metric
Music SFX Speech Mix
TAVGBench 1 AV-Align
SAVGBench 1 Test Point - AV&Spatial-Align
Verse-Bench 1 AV-Align
JavisBench ✓ (partial) 1 AV-Align
VABench 4 Test Points - 0 AV&Stereo Align
PhyAVBench (Ours) 6 Dimensions & 50 Test Points ≥ 20 AV-Align & Physics Sensitivity Test

Data Curation Pipeline

Data Curation Pipeline
Fig. 3: The data curation pipeline of PhyAVBench.

Sample Video Pairs in PhyAVBench

Each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. The following are some sample video pairs in PhyAVBench, shwoing the diversity of the data.

Prompt GT Sora2 Veo3.1 OVI

Close-up, static camera. An index finger slowly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_a01

Close-up, static camera. An index finger quickly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_b01

Prompt GT Sora2 Veo3.1 OVI

Close-up, static camera. Water flows into a cup at a slow, gentle rate. Indoor.

m02_c05_t14_s02_g004_a01

Close-up, static camera. Water flows into a cup at a fast, strong rate. Indoor.

m02_c05_t14_s02_g004_b01

Prompt GT Sora2 Veo3.1 OVI

A close-up, static shot of a transparent plastic bottle. The bottle contains no water. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_a01

A close-up, static shot of a transparent plastic bottle. The bottle is filled with water to about four-fifths of its capacity. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Static medium shot of an empty plastic bottle being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_a01

Static medium shot of a plastic bottle filled with water to about four-fifths of its capacity being dropped onto an indoor floor. water to about four-fifths of its capacity being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it rapidly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_a01

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it slowly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on low airflow, no nozzle attachment about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_a01

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on high airflow with a narrow concentrator nozzle attached about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium close-up shot facing the doorway. The door is open. In a quiet indoor environment, music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_a01

A static, medium close-up shot facing the doorway. The door is closed. In a quiet indoor environment, the same music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium close-up shot of a smartphone placed on a table. The phone is not covered. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone and the surrounding tabletop.

m03_c12_t33_s03_g001_a01

A static, medium close-up shot of a smartphone placed on a table. The phone is completely covered by an upside-down transparent plastic box. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone, the transparent plastic box, and the surrounding tabletop.

m03_c12_t33_s03_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with aperiodic, irregular timing; indoor, quiet.

m05_c18_t45_s02_g006_a01

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with periodic, regular timing; indoor, quiet.

m05_c18_t45_s02_g006_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is boiling vigorously, producing continuous bubbling and steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the active boiling water.

m06_c19_t46_s02_g001_a01

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is gently simmering, with occasional small bubbles forming and minimal steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the lightly boiling water.

m06_c19_t46_s02_g001_b01