I had a similar problem recently. Tried different strategies. Finally found sampling a subset of the data, compressing it and extrapolating to be most accurate. Wrote a small program to do this - zip-sizer. Error about 2-3% in my testing. Hope it helps.