DiffGEBD

Generic Event Boundary Detection via Denoising Diffusion

¹POSTECH, ²GenGenAI
ICCV 2025
^*Indicates Equal Contribution

Abstract

Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.

DiffGEBD: Generic Event Boundary Detection via Denoising Diffusion

Input video V is given to the backbone network g, producing visual features F as output. Then, the extracted visual features F are produced to the encoder f, generating E. During training, Gaussian noise ε is added to the ground-truth label y₀ following the diffusion forward step. The decoder h then predicts boundaries from a noisy label y_t at time step t conditioned on E. During inference, the decoder iteratively denoises starting from the random Gaussian noise y_T, generates the predictions, i.e., y_T → y_T−Δ → ⋯ → y₀, following DDIM inference step [40]. By differently initializing the N_P random Gaussian noises, we can generate N_P diverse predictions using a single model.

Quantitative Results

Diversity-aware Evaluation

Method	F1_sym	F1_p2g	F1_g2p	Diversity
Temporal Perceiver	69.4	72.2	67.4	14.6
SC-Transformer	72.9	74.9	71.6	18.9
BasicGEBD	72.2	74.5	70.6	18.6
EfficientGEBD	72.6	76.0	70.2	14.9
DiffGEBD (ours)	74.0	75.6	72.9	20.4

Diversity-aware evaluation results on Kinetics-GEBD dataset. We achieve the best performance considering both diversity and fidelity.

Conventional Evaluation

Method	F1@0.05
Method	Kinetics-GEBD	TAPOS
BMN	18.6	-
BMN-StartEnd	49.1	-
ISBA	-	10.6
TCN	58.8	23.7
CTM	-	24.4
TransParser	-	23.9
PC	62.5	52.2
SBoCo	73.2	-
Temporal Perceiver	74.8	55.2
DDM-Net	76.4	60.4
CVRL	74.3	-
LCVS	76.8	-
SC-Transformer	77.7	61.8
BasicGEBD	76.8	60.0
EfficientGEBD	78.3	63.1
DyBDet	79.6	62.5
DiffGEBD (ours)	78.4	65.8

Conventional evaluation results on Kinetics-GEBD and TAPOS datasets. We achieve comparable performance to both datasets.

Effect of CFG weight w

Performance variation with different CFG weight values. A moderate guidance weight effectively balances trade-off between Pred-to-GT and GT-to-Pred, maximizing the symmetric F1 score by preserving alignment with the ground truth while ensuring sufficient diversity.

@article{hwang2025generic, title={Generic Event Boundary Detection via Denoising Diffusion}, author={Hwang, Jaejun and Gong, Dayoung and Kim, Manjin and Cho, Minsu}, journal={arXiv preprint arXiv:2508.12084}, year={2025} }

Generic Event Boundary Detection via Denoising Diffusion

Our method generates diverse and plausible boundary predictions for generic events via denoising diffusion.

Abstract

DiffGEBD: Generic Event Boundary Detection via Denoising Diffusion

Diversity-aware Evaluation of GEBD

Quantitative Results

Diversity-aware Evaluation

Conventional Evaluation

Effect of CFG weight w

Qualitative Results

BibTeX