Reports

though its been 7 months since the question has been asked, but still answering it you or someone else might need it

as mentioned by Christoph Rackwit, to sum over the instances, i would be using the same method to calculate the total number of pixels, along with the code mentioned by you to find the total pixels for each instance, dervied from instance segmentation

so first we extract the predicted class, which are basically index values (pred_classes) and predicted masks (pred_mask), the order of these both tensor values are the same
then we iterate over pred_classes and pred_masks and using a dict we add the previous sum of pixel of a particular instance to the current
then store the class names used for training in a list
lastly iterate over the dict to using the list of classes and the pred_classes index values, we find the total number of pixel for each class ( containing multiple instances )

import locale

pre_classes = MetadataCatalog.get(self.cfg.DATASETS.TRAIN[0]).thing_classes # this contains the class names in the same order used for training, replace it with your custom dataset class names
masks = predictions["instances"].pred_masks # this extracts the pred_masks from the predicitons
classes = predictions["instances"].pred_classes.numpy().tolist() # this extracts the pred_classes (contains index values corresponding to pre_classes) to a simple from the predicitons

results = torch.sum(torch.flatten(masks, start_dim=1), dim=1).numpy().tolist() # this calculates the total pixels of each instance

count = dict() # create a dict to store unique classes and their total pixel
for i in range(len(classes)): # itearte over the predicted classes
    count[classes[i]] = count.get(classes[i], 0) + results[i] # add the current sum of pixel of particular class and instance to the previous sum of the same class, adds 0 if the class didnt already exist 

locale.setlocale(locale.LC_ALL, 'en_IN.UTF-8') # set the locale to Indian format
for k, v in count.items(): # itearte over the dict
    print(f"{pre_classes[k]} class contains {locale.format_string('%d', v, grouping=True)} pixels") # printing each class and its total pixel, pres_classes[k] for accessing corresponding class names for class index, locale.format_string for formating the raw number to Indian format

i used the predefined model to perform instance segmentation on the following image

which resulted in the following image

which also produced your required results

dog class contains 1,39,454 pixels
cat class contains 95,975 pixels

as i havent had any hands on experience on semantic segmentation, so the solution is provided using instance segmentation,but if you insist on achieving the same using semantic segmentation please provide the weights, and the inference methods and test dataset, so that i can help you out

anyways i hope that this is what you were looking for, any questions related to the code, logic or working, feel free to contact

79357227