Generic Event Boundary Detection via Denoising Diffusion

1POSTECH, 2GenGenAI
ICCV 2025

*Indicates Equal Contribution
Teaser Image

Our method generates diverse and plausible boundary predictions for generic events via denoising diffusion.

Abstract

Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.

DiffGEBD: Generic Event Boundary Detection via Denoising Diffusion

Overview of Proposed Method
Input video V is given to the backbone network g, producing visual features F as output. Then, the extracted visual features F are produced to the encoder f, generating E. During training, Gaussian noise ε is added to the ground-truth label y0 following the diffusion forward step. The decoder h then predicts boundaries from a noisy label yt at time step t conditioned on E. During inference, the decoder iteratively denoises starting from the random Gaussian noise yT, generates the predictions, i.e., yTyT−Δ → ⋯ → y0, following DDIM inference step [40]. By differently initializing the NP random Gaussian noises, we can generate NP diverse predictions using a single model.

Diversity-aware Evaluation of GEBD

Diversity-aware evaluation protocol for GEBD

Quantitative Results

Diversity-aware Evaluation

Method F1sym F1p2g F1g2p Diversity
Temporal Perceiver 69.4 72.2 67.4 14.6
SC-Transformer 72.9 74.9 71.6 18.9
BasicGEBD 72.2 74.5 70.6 18.6
EfficientGEBD 72.6 76.0 70.2 14.9
DiffGEBD (ours) 74.0 75.6 72.9 20.4
Diversity-aware evaluation results on Kinetics-GEBD dataset. We achieve the best performance considering both diversity and fidelity.

Conventional Evaluation

Method F1@0.05
Kinetics-GEBD TAPOS
BMN 18.6 -
BMN-StartEnd 49.1 -
ISBA - 10.6
TCN 58.8 23.7
CTM - 24.4
TransParser - 23.9
PC 62.5 52.2
SBoCo 73.2 -
Temporal Perceiver 74.8 55.2
DDM-Net 76.4 60.4
CVRL 74.3 -
LCVS 76.8 -
SC-Transformer 77.7 61.8
BasicGEBD 76.8 60.0
EfficientGEBD 78.3 63.1
DyBDet 79.6 62.5
DiffGEBD (ours) 78.4 65.8
Conventional evaluation results on Kinetics-GEBD and TAPOS datasets. We achieve comparable performance to both datasets.

Effect of CFG weight w

Effect of CFG weight
Performance variation with different CFG weight values. A moderate guidance weight effectively balances trade-off between Pred-to-GT and GT-to-Pred, maximizing the symmetric F1 score by preserving alignment with the ground truth while ensuring sufficient diversity.

Qualitative Results

Qualitative Results
Example results on Kinetics-GEBD. All outputs were generated with the same model with different initial noise and CFG weight. We observe that lower weight guidance allow for diverse predictions, while higher guidance leads to more consistent predictions.

BibTeX


        @article{hwang2025generic,
          title={Generic Event Boundary Detection via Denoising Diffusion},
          author={Hwang, Jaejun and Gong, Dayoung and Kim, Manjin and Cho, Minsu},
          journal={arXiv preprint arXiv:2508.12084},
          year={2025}
        }