though its been 7 months since the question has been asked, but still answering it you or someone else might need it
as mentioned by Christoph Rackwit, to sum over the instances, i would be using the same method to calculate the total number of pixels, along with the code mentioned by you to find the total pixels for each instance, dervied from instance segmentation
import locale
pre_classes = MetadataCatalog.get(self.cfg.DATASETS.TRAIN[0]).thing_classes # this contains the class names in the same order used for training, replace it with your custom dataset class names
masks = predictions["instances"].pred_masks # this extracts the pred_masks from the predicitons
classes = predictions["instances"].pred_classes.numpy().tolist() # this extracts the pred_classes (contains index values corresponding to pre_classes) to a simple from the predicitons
results = torch.sum(torch.flatten(masks, start_dim=1), dim=1).numpy().tolist() # this calculates the total pixels of each instance
count = dict() # create a dict to store unique classes and their total pixel
for i in range(len(classes)): # itearte over the predicted classes
count[classes[i]] = count.get(classes[i], 0) + results[i] # add the current sum of pixel of particular class and instance to the previous sum of the same class, adds 0 if the class didnt already exist
locale.setlocale(locale.LC_ALL, 'en_IN.UTF-8') # set the locale to Indian format
for k, v in count.items(): # itearte over the dict
print(f"{pre_classes[k]} class contains {locale.format_string('%d', v, grouping=True)} pixels") # printing each class and its total pixel, pres_classes[k] for accessing corresponding class names for class index, locale.format_string for formating the raw number to Indian format
i used the predefined model to perform instance segmentation on the following image
which resulted in the following image
which also produced your required results
dog class contains 1,39,454 pixels
cat class contains 95,975 pixels
as i havent had any hands on experience on semantic segmentation, so the solution is provided using instance segmentation,but if you insist on achieving the same using semantic segmentation please provide the weights, and the inference methods and test dataset, so that i can help you out
anyways i hope that this is what you were looking for, any questions related to the code, logic or working, feel free to contact