Draft:RWKV
RWKV
[edit]Stable release | RWKV-6
|
---|---|
Repository | github |
Written in | Python |
Type | |
License | Apache License 2.0[1] |
Website | www |
RWKV is a deep learning architecture based on RNN. It was developed by researchers from Bo Peng and RWKV community developers to combines the efficient parallelizable training of transformers with the efficient inference of RNNs. It performs on par with similarly sized Transformers.[2] Now RWKV is a incubation-stage project of the LF AI & Data Foundation.[3]
Architecture
[edit]To solve the vanishing gradient problem and non-parallelizability in the time dimension, RWKV allows to formulate the model as either a Transformer or an RNN. It parallelizes computations during training and maintains constant computational and memory complexity during inference. [2]
The RWKV model architecture is defined by four fundamental elements that are intrinsic to the timemixing and channel-mixing blocks:
- R: The Receptance vector acts as the receiver of past information.
- W: The Weight signifies the positional weight decay vector, a trainable parameter within the model.
- K: The Key vector performs a role analogous to K in traditional attention mechanisms.
- V: The Value vector functions similarly to V in conventional attention processes.
Each block of RWKV model consists of a time-mixing and a channel-mixing sub-block, embodying recurrent structures to leverage past information. [2]
Version History
[edit]RWKV-4
[edit]It is the first official open source version of RWKV, which has 14 billion parameters and performs on par with similarly sized Transformers. [2] This paper has been included in the EMNLP 2023. [4]
RWKV-5/6
[edit]They are sequence models improving upon RWKV(RWKV-4) architecture.Their architectural advancements include multi-headed matrix-valuable states and a dynamic recurrence mechanism.[5]
Viarants
[edit]RWKV-CLIP
[edit]RWKV-CLIP is the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs, significantly improving performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval.[6]
VisualRWKV
[edit]VisualRWKV is the first application of a linear RNN model to multimodal learning tasks, leveraging the pre-trained RWKV language model. It uses a data-dependent recurrence and sandwich prompts to enhance modeling capabilities, along with a 2D image scanning mechanism to enrich the processing of visual sequences. VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks.[7]
Vision-RWKV
[edit]Vision-RWKV is a model adapted from the RWKV model used in the NLP field with necessary modifications for vision tasks, designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage lies in its reduced spatial aggregation complexity, which renders it exceptionally adept at processing high-resolution images seamlessly, eliminating the necessity for windowing operations. Vison-RWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs. In dense prediction tasks, it outperforms window-based models, maintaining comparable speeds.[8]
Diffusion-RWKV
[edit]Diffusion-RWKV is diffusion model with requistite modifications tailored to image generation tasks.is designed to efficiently handle patchnified inputs in a sequence with extra conditions, while also scaling up effectively, accommodating both large-scale parameters and extensive datasets. Its distinctive advantage manifests in its reduced spatial aggregation complexity, rendering it exceptionally adept at processing high-resolution images, thereby eliminating the necessity for windowing or group cached operations. both condition and unconditional image generation tasks demonstrate that Diffison-RWKV achieves performance on par with or surpasses existing CNN or Transformer-based diffusion models in FID and IS metrics while significantly reducing total computation FLOP usage.[9]
RWKV-SAM
[edit]RWKV-SAM is a mixed backbone that contains convolution and RWKV operation, which achieves efficient inference when dealing with high-resolution images. And it also has an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. Compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. [10]
Restore-RWKV
[edit]Restore-RWKV is the first RWKV-based model for medical image restoration. These adaptations designed in Restore-RWKV make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Restore-RWKV achieves superior performance across various medical image restoration tasks, including MRI image super-resolution, CT image denoising, PET image synthesis, and all-in-one medical image restoration.[11]
PointRWKV
[edit]PointRWKV is a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. PointRWKV outperforms the transformer and mamba based counterparts, while significantly saving about 46% FLOPs.[12]
- ^ "RWKV-LM/LICENSE at main · BlinkDL/RWKV-LM". GitHub.
- ^ a b c d Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10). "RWKV: Reinventing RNNs for the Transformer Era". arXiv:2305.13048 [cs.CL].
- ^ "RWKV – LFAI & Data". lfaidata.foundation. Retrieved 2024-07-18.
- ^ "EMNLP 2023 - Findings".
- ^ Peng, Bo; Goldstein, Daniel; Anthony, Quentin; Albalak, Alon; Alcaide, Eric; Biderman, Stella; Cheah, Eugene; Du, Xingjian; Ferdinan, Teddy (2024-04-10). "Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence". arXiv:2404.05892 [cs.CL].
- ^ Gu, Tiancheng; Yang, Kaicheng; An, Xiang; Feng, Ziyong; Liu, Dongnan; Cai, Weidong; Deng, Jiankang (2024-06-11). "RWKV-CLIP: A Robust Vision-Language Representation Learner". arXiv:2406.06973 [cs.CV].
- ^ Hou, Haowen; Zeng, Peigen; Ma, Fei; Yu, Fei Richard (2024-06-19). "VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models". arXiv:2406.13362 [cs.CV].
- ^ Duan, Yuchen; Wang, Weiyun; Chen, Zhe; Zhu, Xizhou; Lu, Lewei; Lu, Tong; Qiao, Yu; Li, Hongsheng; Dai, Jifeng (2024-03-07). "Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures". arXiv:2403.02308 [cs.CV].
- ^ Fei, Zhengcong; Fan, Mingyuan; Yu, Changqian; Li, Debang; Huang, Junshi (2024-04-05). "Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models". arXiv:2404.04478 [cs.CV].
- ^ Yuan, Haobo; Li, Xiangtai; Qi, Lu; Zhang, Tao; Yang, Ming-Hsuan; Yan, Shuicheng; Loy, Chen Change (2024-06-27). "Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model". arXiv:2406.19369 [cs.CV].
- ^ Yang, Zhiwen; Zhang, Hui; Zhao, Dan; Wei, Bingzheng; Xu, Yan (2024-07-14). "Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV". arXiv:2407.11087 [eess.IV].
- ^ He, Qingdong; Zhang, Jiangning; Peng, Jinlong; He, Haoyang; Wang, Yabiao; Wang, Chengjie (2024-05-24). "PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning". arXiv:2405.15214 [cs.CV].