Your question is primarily around performance, and so I will focus on some performance comparisons between Flatbuffers and various JSON implementations for Python.
First, I should note that if you want to benefit from the performance of Flatbuffers, you have to use them in the intended way. What do I mean by this?
GetXXXAsNumpy()
. You should be using these interfaces, because they return data as numpy arrays, which are vectorized objects, and operations are typically implemented in specialized C code, which is fast.On to some performance comparisons.
For this performance assessment I chose to use some data which I am already working with. Unfortunately you cannot have a copy of this data, but I am sure you can generate some random data with suitable properties to validate these results, if you wish to do so.
The data I am using is a financial timeseries data, which contains 3 arrays. "bid", "ask" and "timestamp". "bid" and "ask" are floating point values (64 bits wide) and the "timestamp" is effectively a 64 bit integer.
Note that there are two formats for JSON serialization in the context of tabular data. Object Table and Array Table.
This is an example of an Object Table.
"{\"col1\":[1,2,3],\"col2\":[1.0,2.0,3.0],\"col3\":[\"1\",\"2\",\"3\"]}"
This is an example of an Array Table.
"[{\"col1\":1,\"col2\":1.0,\"col3\":\"1\"},{\"col1\":2,\"col2\":2.0,\"col3\":\"2\"},{\"col1\":3,\"col2\":3.0,\"col3\":\"3\"}]"
Both are representations of the same table of data.
pandas.DataFrame(
{
"col1": [1, 2, 3],
"col2": [1.0, 2.0, 3.0],
"col3": ["1", "2", "3"]
}
)
Object Table is likely (almost certain) to be faster, because it serializes in a more compact representation.
I raise this before continuing because it is useful to know and relevant to this particular example, because I show an example based around tabular data.
There are a truck load of JSON packages on Pypi.
I am familiar with:
jsons
packageorjson
which is supposed to be very fastTherefore, I will present results for each of these. I use the standard library json
package as the baseline case to compare against. I compare the performance of read and write operations seperatly.
Each value in the table is the scaling factor to compare with the baseline case. In other words, the value 1.39
for orjson
read means that orjson
was 1.39x faster than json
for read performance.
json | jsons | orjson | |
---|---|---|---|
read | 1 | 0.02 | 1.39 |
write | 1 | 0.63 | 2.79 |
json
json
Here are all the results for comparison.
json | jsons | orjson | flatbuf | |
---|---|---|---|---|
read | 1 | 0.02 | 1.39 | 45.8x |
write | 1 | 0.63 | 2.79 | 40.7x |
You noted that you did not see good performance when serializing in Flatbuffer format compared with JSON. Two possible reasons for this are that your objects either:
In regards to (1), you may wish to re-work your fbs
(flatbuffer specification) file. You did not provide the schema, so I cannot comment on it. Or, you may wish to re-work your objects to be serialized. (Try to write them as array based types where possible, so that you can interface with them using numpy
arrays.)
In all of the following example codes, I use a pandas.DataFrame
object as the data transport type. (Meaning that the read and write functions present an API which uses a DataFrame
to exchange data.)
json
and jsons
is the same. orjson
removes some spaces from the serialized output.import pandas
import json
import jsons
import orjson
def read_json():
with open('df.objecttable.python_json.json', 'r') as ifile:
data = json.load(ifile)
return pandas.DataFrame(data)
def write_json(df):
with open('df.objecttable.python_json.json', 'w') as ofile:
data = df.to_dict(orient='list')
json.dump(data, ofile)
def read_jsons():
with open('df.objecttable.python_jsons.json', 'r') as ifile:
data = jsons.loads(ifile.read())
return pandas.DataFrame(data)
def write_jsons(df):
with open('df.objecttable.python_jsons.json', 'w') as ofile:
data = df.to_dict(orient='list')
ofile.write(jsons.dumps(data, strict=True)) # strip_privates???
def read_orjson():
with open('df.objecttable.python_orjson.json', 'rb') as ifile:
data = orjson.loads(ifile.read())
return pandas.DataFrame(data)
def write_orjson(df):
with open('df.objecttable.python_orjson.json', 'wb') as ofile:
data = df.to_dict(orient='list')
ofile.write(orjson.dumps(data))
def read_flatbuf():
with open('df.python.flatbuf', 'rb') as ifile:
bid_ask_timeseries = BidAskTimeseriesFBS.BidAskTimeseries.GetRootAs(ifile.read(), 0)
bid = bid_ask_timeseries.BidAsNumpy()
ask = bid_ask_timeseries.AskAsNumpy()
timestamp = bid_ask_timeseries.TimestampAsNumpy()
data_dict = {
'bid': bid,
'ask': ask,
'ts': timestamp,
}
df = pandas.DataFrame(data_dict)
return df
def write_flatbuf(df):
builder = flatbuffers.Builder(1024)
bid = df['bid'].values
ask = df['ask'].values
ts = df['ts'].values
bid_flatbuffer = builder.CreateNumpyVector(bid)
ask_flatbuffer = builder.CreateNumpyVector(ask)
ts_flatbuffer = builder.CreateNumpyVector(ts)
BidAskTimeseriesFBS.Start(builder)
BidAskTimeseriesFBS.AddBid(builder, bid_flatbuffer)
BidAskTimeseriesFBS.AddAsk(builder, ask_flatbuffer)
BidAskTimeseriesFBS.AddTimestamp(builder, ts_flatbuffer)
bid_ask_timeseries = BidAskTimeseriesFBS.End(builder)
builder.Finish(bid_ask_timeseries)
with open('df.python.flatbuf', 'wb') as ofile:
ofile.write(builder.Output())
# returns elapsed time in seconds as float
def elapsed(function, *args):
time_start = time.time()
function(*args)
time_end = time.time()
time_diff = time_end - time_start
return time_diff
def profile_function(f, count, args):
#print(f'profile_function called with f={f}, count={count}, args={args}')
if args is None:
return [elapsed(f) for _ in range(count)]
else:
return [elapsed(f, args) for _ in range(count)]
count = 10
instruction_dict = {
1: {
"human_name": "JSON (json) read",
"key": "json_read",
"function": read_json,
"profile_count": count,
"args": None,
},
2: {
"human_name": "JSON (json) write",
"key": "json_write",
"function": write_json,
"profile_count": count,
"args": read_json(),
},
3: {
"human_name": "JSON (jsons) read",
"key": "jsons_read",
"function": read_jsons,
"profile_count": count,
"args": None,
},
4: {
"human_name": "JSON (jsons) write",
"key": "jsons_write",
"function": write_jsons,
"profile_count": count,
"args": read_jsons(),
},
5: {
"human_name": "JSON (orjson) read",
"key": "orjson_read",
"function": read_orjson,
"profile_count": count,
"args": None,
},
6: {
"human_name": "JSON (orjson) write",
"key": "orjson_write",
"function": write_orjson,
"profile_count": count,
"args": read_orjson(),
},
7: {
"human_name": "Flatbuffer read",
"key": "flatbuf_read",
"function": read_flatbuf,
"profile_count": count,
"args": None,
},
8: {
"human_name": "Flatbuffer write",
"key": "flatbuf_write",
"function": write_flatbuf,
"profile_count": count,
"args": read_flatbuf(),
},
}
df_results = pandas.DataFrame()
for k in sorted(instruction_dict.keys()):
v = instruction_dict[k]
print(f'k={k}')
human_name = v['human_name']
print(f'v[human_name]={human_name}')
key = v['key']
human_name = v['human_name']
function = v['function']
args = v['args']
if key not in df_results.columns:
print(f'profiling {human_name}')
results = profile_function(function, count, args)
if df_results.shape[0] > 0:
df_results[key] = results
else:
df_results[key] = results
else:
print(f'skipping profiling {human_name}')
df_results.to_csv('df_results.csv')
You will have to supply your own data.