79598583

Date: 2025-04-29 14:18:13
Score: 0.5
Natty:
Report link

Your question is primarily around performance, and so I will focus on some performance comparisons between Flatbuffers and various JSON implementations for Python.

First, I should note that if you want to benefit from the performance of Flatbuffers, you have to use them in the intended way. What do I mean by this?

On to some performance comparisons.

Candidate Data

For this performance assessment I chose to use some data which I am already working with. Unfortunately you cannot have a copy of this data, but I am sure you can generate some random data with suitable properties to validate these results, if you wish to do so.

The data I am using is a financial timeseries data, which contains 3 arrays. "bid", "ask" and "timestamp". "bid" and "ask" are floating point values (64 bits wide) and the "timestamp" is effectively a 64 bit integer.

JSON Object Table vs JSON Array Table

Note that there are two formats for JSON serialization in the context of tabular data. Object Table and Array Table.

This is an example of an Object Table.

"{\"col1\":[1,2,3],\"col2\":[1.0,2.0,3.0],\"col3\":[\"1\",\"2\",\"3\"]}"

This is an example of an Array Table.

"[{\"col1\":1,\"col2\":1.0,\"col3\":\"1\"},{\"col1\":2,\"col2\":2.0,\"col3\":\"2\"},{\"col1\":3,\"col2\":3.0,\"col3\":\"3\"}]"

Both are representations of the same table of data.

pandas.DataFrame(
    {
        "col1": [1, 2, 3],
        "col2": [1.0, 2.0, 3.0],
        "col3": ["1", "2", "3"]
    }
)

Object Table is likely (almost certain) to be faster, because it serializes in a more compact representation.

I raise this before continuing because it is useful to know and relevant to this particular example, because I show an example based around tabular data.

JSON Performance

There are a truck load of JSON packages on Pypi.

I am familiar with:

Therefore, I will present results for each of these. I use the standard library json package as the baseline case to compare against. I compare the performance of read and write operations seperatly.

Each value in the table is the scaling factor to compare with the baseline case. In other words, the value 1.39 for orjson read means that orjson was 1.39x faster than json for read performance.

json jsons orjson
read 1 0.02 1.39
write 1 0.63 2.79

Flatbuffer Performance

Here are all the results for comparison.

json jsons orjson flatbuf
read 1 0.02 1.39 45.8x
write 1 0.63 2.79 40.7x

Comments in relation to your results

You noted that you did not see good performance when serializing in Flatbuffer format compared with JSON. Two possible reasons for this are that your objects either:

  1. do not lend themselves favorably to being serialized using vectorized function calls
  2. or; you did not write the serialization code to make use of these vectorized functions

In regards to (1), you may wish to re-work your fbs (flatbuffer specification) file. You did not provide the schema, so I cannot comment on it. Or, you may wish to re-work your objects to be serialized. (Try to write them as array based types where possible, so that you can interface with them using numpy arrays.)

Code

In all of the following example codes, I use a pandas.DataFrame object as the data transport type. (Meaning that the read and write functions present an API which uses a DataFrame to exchange data.)

Example functions to read and write JSON data

import pandas
import json
import jsons
import orjson

def read_json():
    with open('df.objecttable.python_json.json', 'r') as ifile:
        data = json.load(ifile)
        return pandas.DataFrame(data)

def write_json(df):
    with open('df.objecttable.python_json.json', 'w') as ofile:
        data = df.to_dict(orient='list')
        json.dump(data, ofile)

def read_jsons():
    with open('df.objecttable.python_jsons.json', 'r') as ifile:
        data = jsons.loads(ifile.read())
        return pandas.DataFrame(data)

def write_jsons(df):
    with open('df.objecttable.python_jsons.json', 'w') as ofile:
        data = df.to_dict(orient='list')
        ofile.write(jsons.dumps(data, strict=True)) # strip_privates???

def read_orjson():
    with open('df.objecttable.python_orjson.json', 'rb') as ifile:
        data = orjson.loads(ifile.read())
        return pandas.DataFrame(data)

def write_orjson(df):
    with open('df.objecttable.python_orjson.json', 'wb') as ofile:
        data = df.to_dict(orient='list')
        ofile.write(orjson.dumps(data))

Example functions to read and write Flatbuffer data

def read_flatbuf():
    with open('df.python.flatbuf', 'rb') as ifile:
        bid_ask_timeseries = BidAskTimeseriesFBS.BidAskTimeseries.GetRootAs(ifile.read(), 0)
        bid = bid_ask_timeseries.BidAsNumpy()
        ask = bid_ask_timeseries.AskAsNumpy()
        timestamp = bid_ask_timeseries.TimestampAsNumpy()
        data_dict = {
            'bid': bid,
            'ask': ask,
            'ts': timestamp,
        }
        df = pandas.DataFrame(data_dict)
        return df

def write_flatbuf(df):
    builder = flatbuffers.Builder(1024)
    bid = df['bid'].values
    ask = df['ask'].values
    ts = df['ts'].values
    bid_flatbuffer = builder.CreateNumpyVector(bid)
    ask_flatbuffer = builder.CreateNumpyVector(ask)
    ts_flatbuffer = builder.CreateNumpyVector(ts)
    BidAskTimeseriesFBS.Start(builder)
    BidAskTimeseriesFBS.AddBid(builder, bid_flatbuffer)
    BidAskTimeseriesFBS.AddAsk(builder, ask_flatbuffer)
    BidAskTimeseriesFBS.AddTimestamp(builder, ts_flatbuffer)
    bid_ask_timeseries = BidAskTimeseriesFBS.End(builder)
    builder.Finish(bid_ask_timeseries)
    with open('df.python.flatbuf', 'wb') as ofile:
        ofile.write(builder.Output())

Utility Functions for profiling


# returns elapsed time in seconds as float
def elapsed(function, *args):
    time_start = time.time()
    function(*args)
    time_end = time.time()
    time_diff = time_end - time_start
    return time_diff

def profile_function(f, count, args):
    #print(f'profile_function called with f={f}, count={count}, args={args}')
    if args is None:
        return [elapsed(f) for _ in range(count)]
    else:
        return [elapsed(f, args) for _ in range(count)]

Example usage

count = 10

instruction_dict = {
    1: {
        "human_name": "JSON (json) read",
        "key": "json_read",
        "function": read_json,
        "profile_count": count,
        "args": None,
    },

    2: {
        "human_name": "JSON (json) write",
        "key": "json_write",
        "function": write_json,
        "profile_count": count,
        "args": read_json(),
    },

    3: {
        "human_name": "JSON (jsons) read",
        "key": "jsons_read",
        "function": read_jsons, 
        "profile_count": count,
        "args": None,
    },

    4: {
        "human_name": "JSON (jsons) write",
        "key": "jsons_write",
        "function": write_jsons, 
        "profile_count": count,
        "args": read_jsons(),
    },

    5: {
        "human_name": "JSON (orjson) read",
        "key": "orjson_read",
        "function": read_orjson,
        "profile_count": count,
        "args": None,
    },

    6: {
        "human_name": "JSON (orjson) write",
        "key": "orjson_write",
        "function": write_orjson,
        "profile_count": count,
        "args": read_orjson(),
    },

    7: {
        "human_name": "Flatbuffer read",
        "key": "flatbuf_read",
        "function": read_flatbuf,
        "profile_count": count,
        "args": None,
    },

    8: {
        "human_name": "Flatbuffer write",
        "key": "flatbuf_write",
        "function": write_flatbuf,
        "profile_count": count,
        "args": read_flatbuf(),
    },
}

df_results = pandas.DataFrame()

for k in sorted(instruction_dict.keys()):
    v = instruction_dict[k]

    print(f'k={k}')
    human_name = v['human_name']
    print(f'v[human_name]={human_name}')

    key = v['key']
    human_name = v['human_name']
    function = v['function']
    args = v['args']

    if key not in df_results.columns:
        print(f'profiling {human_name}')
        results = profile_function(function, count, args)

        if df_results.shape[0] > 0:
            df_results[key] = results
        else:
            df_results[key] = results
    else:
        print(f'skipping profiling {human_name}')

df_results.to_csv('df_results.csv')

You will have to supply your own data.

Reasons:
  • Blacklisted phrase (1): regards
  • Blacklisted phrase (1): ???
  • Blacklisted phrase (0.5): I cannot
  • RegEx Blacklisted phrase (1): cannot comment
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • High reputation (-2):
Posted by: user2138149