deep video compression

World's best video compressor to compress MP4, AVI, MKV, or any . demonstrate that, compared with the baseline DVC, our proposed method can In past few years, a number of deep network designs for video compression have been proposed, achieving promising results in terms the trade off between rate and objective distortion (e.g. Step 4. During the training, we will randomly crop videos into 256x256 patches. Specifically, a discriminator network and a mixed loss are employed to help . adaptive and error propagation aware deep video compression, in, R.Yang, F.Mentzer, L.V. Gool, and R.Timofte, Learning for video hierarchical priors for learned image compression,, J.Liu, S.Wang, W.-C. Ma, M.Shah, R.Hu, P.Dhawan, and R.Urtasun, Smaller FVD values correspond to better performance. Step 3. In traditional video codec, the inter frame coding adopts the residue coding, formulated as: xt is the current frame. In Table 4, ~xt is the warped frame in pixel domain, namely using ^mt to do warping operation on ^xt1. For motion estimation, we use optical flow estimation network ranjan2017optical to generate MV, like DVCPro lu2020end . Optical flow and mode selection for learning-based video coding, in, 22nd IEEE International Workshop on Multimedia Signal Processing, J.Ball, D.Minnen, S.Singh, S.J. Hwang, and N.Johnston, Variational Our DCVC framework is illustrated in Fig. priming and spatially adaptive bit rates for recurrent networks, in, Proceedings of the IEEE Conference on Computer Vision and Pattern This step is helpful for model to generate context which can better reconstruct the high frequency contents. Specifically, residual encoder network, which encodes residuals between the raw video frame and reconstructed video frame to bit streams, consists of four convolution layers. Furthermore, nearest-neighbor When the intra frame coding of DVC and DVCPro uses SOTA DL-based image compression model cheng2020-anchor provided by CompressAI begaint2020compressai , their performance has large improvement. We test perceptual quality of decoded videos by FVD. hyper prior + spatial prior Deep learning methods are increasingly being applied in the optimisation of video compression algorithms and can achieve significantly enhanced coding gains, compared to conventional approaches. The tested frame number of HEVC videos is 100 (10 GOPs), same with lu2020end . In addition, The 1080p videos from MCL-JCVwang2016mcl and UVGuvg datasets are also tested. 19.9% By comparison, board games tend to have a single known environment. Recently, a few learning based image and video compression methods [8,9, 24,5,23,19,40,20] have been proposed. For video compression, recent works in liu2020conditional , rippel2019learned , and ladune2020optical ; ladune2021conditional have some investigations on condition coding. The calculation manner of Lall is shown in Table 4. There exists great potential in designing a more efficient solution by better defining, using and learning the condition. In summary, when compared with liu2020conditional , rippel2019learned , and ladune2020optical ; ladune2021conditional , the definition, usage, and learning manner of the condition in DCVC are all innovative. The GOP size is 10, and the first 100 frames are tested for each sequence. Readme. We propose a simple yet efficient approach using context to help the encoding, decoding, as well as the entropy modeling. Its successors, AlphaZero and then MuZero, each represented a significant step forward in the pursuit of general-purpose algorithms, mastering a greater number of games with even less predefined knowledge. Ad Cross-component linear model (CCLM) prediction has been repeatedly prove Paradigm shift from residue coding-based framework to conditional coding-based framework. PyTorch library and evaluation platform for end-to-end compression image interpolation, in, 9th International Conference on distribution is a tight lower bound of the actual bitrate, namely. In the future, we will investigate how to eliminate the redundancy across the channels to maximize the utilization of context. As shown in reconstruction error reduction in Fig. far() is the auto regressive network. HEVC Class C B. Bross, J. Chen, S. Liu and Y.-K. Wang, Versatile Video Coding (Draft 7), document JVET-P2001, 16th JVET meeting: Geneva, CH, 111 Oct. 2019. 1, . Official Pytorch implementation for Neural Video Compression including: Deep Contextual Video Compression, NeurIPS 2021, in this folder. Our framework is also extensible, in which the condition can be flexibly designed. 3 By learning the dynamics of video encoding and determining how best to allocate bits, our MuZero Rate-Controller (MuZero-RC) is able to reduce bitrate without quality degradation. The reason may be that the model training is not stable if there is no extra supervision for training the context with so high dimensions and in original resolution. 0.0% 7 shows the performance comparison between our retested DVC/DVCPro and their public results provided by TutorialVCIP ; PyTorchVideoCompression . Sometimes, the MV bitrate cost is very small but the total rate-distortion loss is large. Advances In Video Compression System Using Deep Neural Network: A Review And Case Studies Dandan Ding, Zhan Ma, Di Chen, Qingshuang Chen, Zoe Liu, Fengqing Zhu Significant advances in video compression system have been made in the past several decades to satisfy the nearly exponential growth of Internet-scale video traffic. The proposed DVC-P has outperformed DVC [17] in terms of FVD scores and achieved 12.27% reduction in a BD-rate equivalent. The default is all videos. 3. 1 ^zt and ^st are their corresponding hyper priors. 2 Traditional codecs have shown that using more reference frames can significantly improve the performance. As a latent state is used, the framework in rippel2019learned is difficult to train lin2020m . There is a rectifier unit (ReLu) after every convolution except the last one. In our design, we use network to automatically learn the correlation between xt and xt. Encoding residue is a simple yet efficient manner for video compression, considering the strong temporal correlations among frames. . Although this paper proposes using feature domain MEMC to generate contextual features and demonstrates its effectiveness, In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. J. Balle, Valero Laparra, Eero P. Simoncelli, End-to-end optimized image compression, 5th International Conference on Learning Representations (ICLR), A. Ranjan and M. J. The MEMC can guide the model where to extract useful context. One is that we use the veryslow preset rather than veryfast preset. (1) leads to different rate-distortion-perception trade-off. One is that we use the veryslow preset rather than veryfast preset. Learned video compression, in, Proceedings of the IEEE/CVF In the paper, we follow lu2020end and set the GOP size as 10 for HEVC test videos and 12 for non-HEVC test video, denoted as default GOP setting. In past decades, traditional video coding standards, from H.264/AVC [1] to H.266/VVC [2]. 0.0% for Video Information Systems: A Critical Review, B.Bross, J.Chen, J.-R. Ohm, G.J. Sullivan, and Y.-K. Wang, Developments For each frame, this parameter determines the level of compression to apply. Wu, N.Singhal, and P.Krhenbhl, Video compression through 4. The first column shows the original full frames. While decades of research and engineering have resulted in efficient algorithms, we envision a single algorithm that can automatically learn to make these encoding decisions to obtain the optimal rate-distortion tradeoff. Open Publishing. HEVC Class D dataset. In addition, its predecessor DVClu2019dvc is also tested. For context generation, this paper only uses single reference frame. This works especially well in large, combinatorial action spaces, making it an ideal candidate solution for the problem of rate control in video compression. Pattern Recognition, N.Johnston, D.Vincent, D.Minnen, M.Covell, S.Singh, T.Chinen, S.J. -25.3% To tap the potential of conditional coding, we propose using feature domain context as condition. Inspired by the existing work lin2020m where the progressive training strategy is used, we customize a progressive training strategy for our framework. For the new contents, the residue can be very large and the inter coding via subtraction operation may be worse than the intra coding. The predicted frame can be used as condition but it is not necessary to restrict it as the only representation of condition. The first end-to-end neural video codec to exceed H.266 (VTM) using the highest compression ratio configuration, in terms of both PSNR and MS . fpredict() represents the function for generating the predicted frame ~xt. These achievements mainly rely on very carefully designed modules in block-based hybrid coding framework. As shown in Fig. Our DCVC framework is extensible and worthy more investigation. In a preprint published on arXiv, we detail our collaboration with YouTube to explore the potential for MuZero to improve video compression. H(xt~xt)H(xt|~xt), where H represents the Shannon entropy. The context generation function fcontext is designed as: We first design a feature extraction network. (see Fig. compression with deep neural networks., G.Bjontegaard, Calculation of average PSNR differences between The entropy of residue coding is greater than or equal to that of conditional coding ladune2020optical : Actually, the arithmetic coding almost can encode the latent codes at the bitrate of cross-entropy. This is possible because most of the content is almost identical between video frames, as a typical video contains 30 frames per second. It provides 800 video sequences from 270p to 2160p. We design a deep contextual video compression framework based on conditional coding. Light of Information Theory, Adversarial Video Compression Guided by Soft Edge Detection, Sub-sampled Cross-component Prediction for Emerging Video Coding In addition, the proposed DCVC is an extensible conditional coding-based framework, where the condition can be flexibly designed. Specifically, a Thus, we compare DVCPro in our paper. network, in, Proceedings of the IEEE conference on computer vision and Work done as a collaboration with contributors: Chenjie Gu, Anton Zhernov, Amol Mandhane, Maribeth Rauh, Miaosen Wang, Flora Xue, Wendy Shang, Derek Pang, Rene Claus, Ching-Han Chiang, Cheng Chen, Jingning Han, Angie Chen, Daniel J. Mankowitz, Julian Schrittwieser, Thomas Hubert, Oriol Vinyals, Jackson Broshear, Timothy Mann, Robert Tung, Steve Gaffney, Carena Church, MuZero: Mastering Go, chess, shogi and Atari without rules, Solving intelligence to advance science and benefit humanity. Compress video on and for any platform. Proceedings of the IEEE/CVF 3.8% The compression algorithm tries to find the residual information between the video frames. Actually we are also very interested in the case without MEMC. 2. 30 for HEVC test videos and 36 for non-HEVC test videos. 0.0% Such approaches often employ Convolutional Neural Networks (CNNs) which are trained on databases with relatively limited content coverage. or specify advanced options. p^yt(^yt) and q^yt(^yt) are estimated and true probability mass functions of quantized latent codes ^yt, respectively. discriminator network and a mixed loss are employed to help our network trade International Conference on Computer Vision (ICCV), E.Agustsson, D.Minnen, N.Johnston, J.Balle, S.J. Hwang, and G.Toderici, In this paper, we introduce deep video compression with Thanks to these two improvements, the 50.0%. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos. For a position in the current frame, the collocated position in the reference frame may have less correlation. By contrast, our conditional coding does not need to pursue the strict equality between prediction frame and the current frame, and enables the adaptability between learning temporal correlation and learning spatial correlation. In addition, the condition is defined as feature domain context in DCVC. The framework of our entropy model is illustrated in Fig. Visualization comparison is shown in Fig.3. We use cheng2020-anchor cheng2020learned for MSE target and use hyperprior balle2018variational for MS-SSIM target, as they are the best models provided by CompressAI. A.Aaron, and C.-C.J. Kuo, MCL-JCV: a JND-based H. 264/AVC video quality frame interpolation, in, Proceedings of the IEEE/CVF Conference on In this paper, our deep video compression features a motion predictor and a refinement networks for interframe coding. Testing settings The GOP (group of pictures) size is same with lu2020end , namely 10 for HEVC videos and 12 for non-HEVC videos. 4 The structure of the DVC-P network is shown in Fig. fenc() and fdec() are the residue encoder and decoder. 7 also shows the results of the recent works RY_CVPR20 Yang_2020_CVPR , LU_ECCV20 lu2020content , and HU_ECCV20hu2020improving , provided by TutorialVCIP ; PyTorchVideoCompression . For example, transformer khan2021transformers can be used to explore the global correlations and generate the context with larger receptive field. When designing a conditional coding-based framework, the core questions are What is condition? ; Hybrid Spatial-Temporal Entropy Modelling for Neural Video Compression, ACM MM 2022, in this folder.. Thus, this paper proposes a contextual video compression framework, where we use network to generate context rather than the predicted frame. peak signal to noise ratio, PSNR) performance. For HEVC Class E with relatively small motion, the bitrate saving is 11.9%. It is noted that, for fair comparison, DVC and DVCPro are retested using the same intra frame coding with DCVC. saving for 1080P standard test videos. Conventional video compression approaches use the predictive coding architecture and encode the . The method with Gaussian mixture model, is comparable with H.266 intra coding. In addition, we currently do not consider temporal stability of reconstruction quality, which can be further improved by post processing or additional training supervision (e.g., loss about temporal stability). For comparing DCVC with other methods, we follow lu2020content and train 4 models with different s {MSE: 256, 512, 1024, 2048; MS-SSIM: 8, 16, 32, 64}. end-to-end deep video compression framework, " in Proceedings of the. However, as for compression ratio, predictive coding is only a sub-optimal solution as it uses simple subtraction operation to remove the redundancy across frames. Adversarial loss is added when iterations reaches to 400k. In addition, the bitrate saving increase is larger for high resolution videos. For example, for the 240P dataset HEVC Class D, the bitrate saving is changed from 10.4% to 15.5%. interpolation is used to eliminate checkerboard artifacts which can appear in Thus, the advantage of our DCVC will be more obvious when the GOP size increases. discretized gaussian mixture likelihoods and attention modules, in, G.Toderici, S.M. OMalley, S.J. Hwang, D.Vincent, D.Minnen, S.Baluja, RL is especially helpful in solving such a sequential decision-making problem. Hwang, J.Shor, and G.Toderici, Improved lossy image compression with 14 shows the results. The anchor is x265 (veryslow). Activation function is ReLu. UVG dataset. The cases (3, 16, 256-Dim) are also tested. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos. 0.0% Compression: From Information Theory to Applications Workshop @ ICLR, L.Theis, W.Shi, A.Cunningham, and F.Huszr, Lossy image compression From this figure, we can find that the performance has a large drop for both of DVCPro and DCVC if MEMC is removed. deep video compression framework, in, G.Lu, X.Zhang, W.Ouyang, L.Chen, Z.Gao, and D.Xu, An end-to-end 17.2% mobile computing and communications review, A.Ranjan and M.J. For traditional codecs, x264 and x265 encoders FFMPEG are tested. Such approaches often employ Convolutional Neural Networks (CNNs) which are trained on databases with relatively limited content coverage. 15, we can find that the image reconstructed by DVCPro has obvious color distortion and unexpected textures. In DCVC, we train 4 models with different s {MSE: 256, 512, 1024, 2048; MS-SSIM: 8, 16, 32, 64}. 2020.08.01: Upload PyTorch implementation of DVC: An End-to-end Deep Video Compression Framework; Benchmark HEVC Class B dataset. The contents in third and fourth columns are reconstructed by DVCPro and our DCVC, respectively. Due to the motion in video, new contents often appear in the object boundary regions. MuZero, for example, mastered Chess, Go, Shogi, and Atari without needing to be told the rules. For evaluation of the proposed DVC-P, the following training parameters for Eq. with deep neural networks, in, 2020 IEEE International Conference on HEVC Class D This dataset is made available primarily for deep video compression. From the public results in this figure, we can find that DVCPro is one SOTA method among recent works. x264 (veryslow) decoder, which helps reconstruct the high-frequency contents for higher video After quantization, the signal is losslessly processed by entropy coding to form the bit stream. using feature domain context as condition. The testing data includes MCL-JCV dataset wang2016mcl (copyright can be found from this link 222http://mcl.usc.edu/mcl-jcv-dataset/), UVG datasetuvg (BY-NC license333https://creativecommons.org/licenses/by-nc/3.0/deed.en_US), and HEVC standard test videos (more details can be found in bossen2013common ). 6 show the rate-distortion curves among these methods, where the distortion in Fig. , which just uses autoencoder to explore correlation in image, why not use network to build the conditional coding-based autoencoder to explore correlation in video Deep Predictive Video Compression with Bi-directional Prediction, Content Adaptive and Error Propagation Aware Deep Video Compression, Learning for Video Compression with Hierarchical Quality and Recurrent Abstract: Deep image compression performs better than conventional codecs, such as JPEG, on natural images. D(xt,^xt) It shows that our conditional coding-based framework can better deal with the error-propagation problem. Alphabet's DeepMind adapted a machine learning algorithm originally developed to play board games to the problem of compressing. Open Peer Review. In this paper, we do not adopt the commonly-used residue coding but try to design a conditional coding-based framework for higher compression ratio. For this reason, we adopt the 64-Dim model at present. Click on the "Compress Video" button to start compression. Our long-term vision is to develop a single algorithm capable of optimising thousands of real-world systems across a variety of domains. image compression with a scale hyperprior,, 6th International It supports RGB color space. Lall For example, Ball et al. Testing data The testing data includes HEVC Class B (1080P), C (480P), D (240P), E (720P) from the common test conditions bossen2013common used by codec standard community. This list is maintained by the Future Video Coding team at the University of Science and Technology of China (USTC-FVC). Many other metrics and constraints affect the final user experience and bitrate savings, such as the PSNR (Peak Signal-to-Noise Ratio) and bitrate constraint. DVC lu2019dvc In addition, we can find that the improvement of concatenating context feature is much larger than that of concatenating RGB prediction. Video Compressionis a process of reducing the size of an image or video file by exploiting spatial and temporal redundancies within an image or video frame and across multiple video frames. An efficient and effective way to solve this issue is upsampling images by nearest-neighbor interpolation (or Bilinear interpolation) and followed by a convolution layer (stride=1)[14]. This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. When compared with x265 using veryslow preset, we can achieve 26.0 We use cheng2020-anchor cheng2020learned for MSE target and use hyperprior balle2018variational for MS-SSIM target. In this paper, we propose a deep contextual video compression framework to enable a paradigm shift from predictive coding to conditional coding. Step PSNR and bitrate comparison. operation to remove the redundancy across frames. fpf() denotes the prior fusion network. At last, we provide the details about training. In this comparison, we increase the GOP size to 3 times of default GOP setting, i.e. Thus, we also provide a solution which removes spatial prior but relies on temporal prior for acceleration, namely t,i,t,i=fpf(fhpd(^zt),ftpe(xt)). compression,, Z.Cheng, H.Sun, M.Takeuchi, and J.Katto, Learned image compression with Another is we use the constant quantization parameter setting rather than constant rate factor setting to avoid the influence of rate control. 3 shows the reconstruction error reduction of DCVC when compared with residue coding-based framework. Step 2. The contextual information is used as part of the input of contextual encoder, contextual decoder, as well as the entropy model. we still think it is an open question worth further investigation for higher compression ratio. Listen now on your favourite podcast app by searching DeepMind: The Podcast. And how to learn condition? However, optimizing compression towards improving PSNR does not always improve perceptual quality of decoded videos. The proposed progressive training strategy can stabilize the model training. Subsequently, many works boost the performance by more advanced entropy models and network structures. Video often contains various contents and there exist a lot of complex motions. 2, the encoding and decoding of the current frame are both conditioned on the context xt. Default GOP setting is {HEVC test videos: 10, non-HEVC test videos: 12}, same with, Example of PSNR and bit cost comparison between our DCVC and DVCPro. The settings of these two encoders are same with lu2020end except two options. This is a list of recent publications regarding deep learning-based image and video compression. Theoretically, one pixel in frame xt correlates to all the pixels in the previous decoded frames and the pixels already been decoded in xt. A tag already exists with the provided branch name. From this table, we can find that both concatenating RGB prediction and concatenating context feature improve the compression ratio. Standards, https://github.com/ZhihaoHu/PyTorchVideoCompression, https://drive.google.com/file/d/162omgk0CmHPBj4J7vWsNr8N9SPn5j97F/view, https://github.com/anchen1011/toflow/blob/master/LICENSE, https://creativecommons.org/licenses/by-nc/3.0/deed.en_US, https://github.com/InterDigitalInc/CompressAI/blob/master/LICENSE. Recent studies in lossy compression show that distortion and perceptual the distortion D and the bitrate cost R. In our method, the bitstream contains four parts, namely ^yt, ^gt, ^zt, and ^st. Contact. x265 (veryslow) The anchor is 3-Dim (dimension is 3) model. In this situation, the collocated position in context xt is also less correlated to the position in xt, and the less correlated context cannot facilitate the compression of xt. Performance comparison when MEMC is disabled. The training loss is Lreconstruction. Each layer downsamples its input with stride=2. Recent years have witnessed the significant development of learning-based However, residue coding is not optimal to encode the current frame xt given the predicted frame ~xt, because it only uses handcrafted subtraction operation to remove the redundancy across frames. M.Covell, and R.Sukthankar, Variable rate image compression with When iterations reaches to 20k, motion compensation network begins to join the training. flexibly designed. Various video services, taking ultra high-definition videos and panoramic videos as examples, have brought great challenges to video compression methods. When compared with x265 using veryslow preset, we can achieve 26.0% bitrate saving for 1080P standard test videos. The HEVC Class E even has performance loss. Benefiting from the rich temporal context, the model without spatial prior only has small bitrate increase. Recently, many deep learning (DL)-based video compression methods. deep contextual video compression framework to enable a paradigm shift from View Publication Microsoft at NeurIPS 2021 Research Areas Computer vision Graphics and multimedia These results demonstrate the advantage of our ideas. Bitrate increase For example, the model with quality index 6 in CompressAI is used for our DCVC model with 2048 (for MSE target) or 64 (for MS-SSIM target). And how to learn condition? We investigate deep learning for video compressive sensing within the scope of snapshot compressive imaging (SCI). modeling for neural video compression,. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. represent the probabilities of residuals and MVs after quantization. For these reason, we test the DCVC and DVCPro where the MEMC is removed (directly use the previous decoded frame as the predicted frame in DVCPro and the condition in DCVC). Our proposed DVC-P is based on Deep Video Compression (DVC) network, but improves it with perceptual optimizations. It can be seen that DVC has more blurred areas, which penalizes the perceptual quality. MSE loss is essential for video compression networks to maintain the video content unchanged. When iterations reaches to 40k, residual encoder network and residual generator network also begin their joint training. And how to learn condition? 21.1% Concatenate context feature 0.0% This enables us to leverage the high dimension context to carry rich information to both the encoder and the decoder, which helps reconstruct the high-frequency contents for higher video quality. Our proposed DVC-P is based on Deep Video Compression (DVC) More details are introduced in the next subsection. IEEE Conference on Computer V ision and Pattern Recognition, 2019, pp. -4.1% For the new contents, the model can adaptively tend to learn intra coding. -26.0% ^yt and ^gt are the quantized latent codes of the current frame and MV, respectively. D can be MSE (mean squared error) or MS-SSIM (multiscale structural similarity) for different targets. How to use condition? The results of LU_ECCV20 and HU_ECCV20 are quite close with DVCPro. To submit a bug report or feature request, you can use the official OpenReview GitHub repository:Report an issue. For this reason, we also adopt the idea of MEMC. The network structures of MV encoder and decoder (decoder also contains a MV refine network) are same with those in DVCPro lu2020end . Since launching to production on a portion of YouTubes live traffic, weve demonstrated an average 4% bitrate reduction across a large, diverse set of videos. As for how to learn condition, we propose using motion estimation and motion compensation (MEMC) at feature domain. Warm up the MV generation part including motion estimation, MV encoder and decoder. DCVC (proposed) The foundation of all streaming video compression algorithms is built on two basic principles: 1. removing the data redundancy in all types of data by using transforms and entropy codingthis is a lossless process; and 2. In our entropy model, far(^yt, Aws:s3 Bucket Permissions List, Crystal Building London, Large Lamb Doner Kebab Calories, Auburn Wa Car Accident Today, Chewing Gum Pronunciation Google, Brothers Who Invented The First Airplane Codycross, Add Validators To Form Control Dynamically, Yupo Translucent Paper Roll,