TRACE: Temporal Grounding Video LLM via Causal Event Modeling

If our project helps you, please give us a star ⭐ on GitHub and cite our paper!

πŸ“° News

  • [2024.11.01] πŸ”₯ We are excited to announce the release of trace-uni, which has been enhanced by incorporating additional general video understanding data from a subset of LLaVA-Video-178k. Our results indicate that trace-uni outperforms trace in both VTG tasks and general video understanding tasks.
  • [2024.10.19] πŸ”₯ We release trace-retrieval by forcing the predicted timestamps to be align with the input frame timestamps. Results show trace-retrieval achieve better performance on dense video captioning tasks.
  • [2024.10.10] πŸ”₯ Our code and paper are released!
  • [2024.10.10] πŸ”₯ Our checkpoints are available now!

Overview

In this work

  • We model the videos by a series of events, and propose causal event modeling framework to capture videos' inherent structure.
  • We present a novel task-interleaved video LLM model, TRACE, tailored to implement the causal event modeling framework through the sequential encoding/decoding of timestamps, salient scores, and textual captions.

Model Zoo

Checkpoints Description URL
Initialization Weights initialized from VideoLLaMA2 trace-init
Stage-1 Model checkpoints trained after stage-1 trace-stage1
Stage-2 Model checkpoints trained after stage-2 trace
FT-Charades Fine-tuned on Charades-STA dataset trace-ft-charades
FT-Youcook2 Fine-tuned on Youcook2 dataset trace-ft-youcook2
FT-QVHighlights Fine-tuned on QVHighlights dataset trace-ft-qvhighlights
TRACE-retrieval Forcing the predicted timestamps to be align with input timestamps trace-retrieval
TRACE-uni Incorporating additional general video understanding data from a subset of LLaVA-Video-178k. trace-uni

Results

Youcook2 (Zero-Shot) CIDER METEOR SODA_c F1
TRACE 8.1 2.8 2.2 22.4
TRACE-retrieal 8.3 2.9 2.3 24.1
TRACE-uni 8.6 2.9 2.3 22.4
Charades-STA (Zero-Shot) 0.3 0.5 0.7 mIOU
TRACE 58.6 40.3 19.4 38.7
TRACE-retrieval 57.9 37.4 17.3 37.4
TRACE-uni 63.7 43.7 21.0 41.5
QVHighlights (Zero-Shot) mAP Hit@1
TRACE 26.8 42.7
TRACE-retrieval 27.9 44.3
TRACE-uni 27.5 43.9
ActivityNet-DVC CIDER METEOR SODA_c F1
TRACE 25.9 6.0 6.4 39.3
TRACE-retrieval 25.7 5.9 6.5 40.1
TRACE-uni 29.2 6.9 6.4 40.4
ActivityNet-MR 0.3 0.5 0.7 mIOU
TRACE 54.0 37.7 24.0 39.0
TRACE-retrieval 54.4 39.8 24.9 40.2
TRACE-uni 53.2 38.2 24.7 39.4
MVBench Avg AS AP AA FA UA OE OI OS MD AL ST AC MC MA SC FP CO EN ER CI
TRACE 48.1 61.2 56.5 72.5 46.5 61.0 48.0 69.5 40.0 22.0 31.0 86.5 37.5 37.0 51.0 45.0 40.5 39.0 31.0 43.5 44.5
TRACE-uni 53.8 68.1 58.5 72.5 41.5 73.5 55.1 71.5 40.5 25.0 53.0 88.5 63.5 38.5 51.0 52.5 49.0 59.5 33.5 49.5 32.5
VideoMME (w/o subtitle) Short Midium Long Avg
TRACE 49.5 42.5 39.3 43.8
TRACE-uni 58.2 48.1 42.3 49.6

Bibliography

If you find this repository helpful for your project, please consider citing:

@misc{guo2024tracetemporalgroundingvideo,
      title={TRACE: Temporal Grounding Video LLM via Causal Event Modeling}, 
      author={Yongxin Guo and Jingyu Liu and Mingda Li and Xiaoying Tang and Qingbin Liu and Xi Chen},
      year={2024},
      eprint={2410.05643},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2410.05643}, 
}
Downloads last month
3
Safetensors
Model size
7.55B params
Tensor type
BF16
Β·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for Yongxin-Guo/trace-ft-qvhighlights

Finetuned
(914)
this model

Collection including Yongxin-Guo/trace-ft-qvhighlights