It appears that you are experiencing resource contention issues on your Databricks cluster, which is affecting the performance of your pandas aggregations. This issue is likely due to the way Databricks allocates compute resources among users and tasks. When another user is running intensive Spark I/O operations, it can consume a significant portion of the cluster's resources, leaving fewer resources available for your pandas operations. This can lead to extended run times and even cause the kernel to die and the notebook to detach.
Here are a few suggestions to mitigate this issue:
Resource Allocation and Quotas: Ensure that your cluster is configured with appropriate resource quotas and limits. You can request more quotas for your cluster or namespace if needed. Refer to the Tuning Pod Resources document for guidance on how to request and tune resources.
Cluster Configuration: Consider configuring your cluster to have dedicated resources for different types of workloads. For example, you can set up separate clusters for Spark and pandas operations to avoid resource contention.
Job Scheduling: Schedule your pandas aggregations to run during off-peak hours when the cluster is less likely to be under heavy load from other users.1
Monitoring and Optimization: Use the Container Resource Allocation Tuning dashboard to monitor CPU and memory usage. Adjust the resource requests and limits based on the observed usage patterns to ensure that your application has the necessary resources to perform efficiently.1
Cluster Scaling: If your workload requires more resources than currently available, consider scaling up your cluster by adding more nodes or increasing the size of existing nodes.
Alternative Approaches: If rewriting the code to leverage PySpark is not feasible due to overhead, you might explore other distributed computing frameworks that are more lightweight and better suited for smaller datasets.
By implementing these strategies, you can improve the performance and reliability of your pandas aggregations on Databricks, even when other users are running intensive operations.