I'm thinking whether it's worth trying to implement inter prediction if I have RAM limitations.
I don't know much about FPGA, but I don't think you should worry about RAM. Less than 3MB might suffice.
Do I understand correctly that the buffer size will be equal to the resolution multiplied bits on pixel, given that I am working with YUV 4:2:0 fotmat? Or do I only need the Y brightness component?
I'm not sure to have understand this correctly. Using YUV 4:2:0 chroma sampling, the buffer has a size of :
width * height * 3 / 2
And is there any way to use an already compressed picture that has already passed through entropy encoding(cavlc or cabac) for inter-frame prediction? Otherwise, if you use the picture after the reconstruction conversion, it weighs the same as the original, although its quality is already lower. This fact requires a lot of RAM, which is not enough in FPGA.
As far as I know, it's not really possible.
It's called transcoding, but it's usually done by decoding and re-encoding.
I have not found any examples in the open source that would not use RAM less than the full size of the uncompressed picture. For YUV at 720p, this is 1350 KB.
This is correct ;
720×1280×3÷2 = 1382400
If you encounter a genuine issue regarding memory. You may use an SD card (here's a similar question). Access time would be much slower, but it it'd solve your problem.