This is the solution I ended up developing:
For each operation that I want to classify, I take 25 samples and measure the number of relevant operations. For example, I add 25 elements to the middle of an array list and measure how many memory writes occurred each time. Then I look at the relationship between X (1 to 25), Y (operations), delta Y (change in Y), and delta delta Y (change in change in Y).
For example:
Memory writes when adding an element to the middle of an array list:
+----+----+---------+---------+
| x | y | dy | ddy |
+----+----+---------+---------+
| 1 | 1 | | |
| 2 | 2 | 1.000 | |
| 3 | 2 | 0.000 | -1.000 |
| 4 | 3 | 1.000 | 1.000 |
| 5 | 3 | 0.000 | -1.000 |
| 6 | 4 | 1.000 | 1.000 |
| 7 | 4 | 0.000 | -1.000 |
| 8 | 5 | 1.000 | 1.000 |
| 9 | 13 | 8.000 | 7.000 |
| 10 | 6 | -7.000 | -15.000 |
| 11 | 6 | 0.000 | 7.000 |
| 12 | 7 | 1.000 | 1.000 |
| 13 | 7 | 0.000 | -1.000 |
| 14 | 8 | 1.000 | 1.000 |
| 15 | 8 | 0.000 | -1.000 |
| 16 | 9 | 1.000 | 1.000 |
| 17 | 25 | 16.000 | 15.000 |
| 18 | 10 | -15.000 | -31.000 |
| 19 | 10 | 0.000 | 15.000 |
| 20 | 11 | 1.000 | 1.000 |
| 21 | 11 | 0.000 | -1.000 |
| 22 | 12 | 1.000 | 1.000 |
| 23 | 12 | 0.000 | -1.000 |
| 24 | 13 | 1.000 | 1.000 |
| 25 | 13 | 0.000 | -1.000 |
+----+----+---------+---------+
slope of y = 0.525 (increasing)
slope of dy = -0.026 (about constant)
slope of ddy = 0.000 (about constant)
category: O(n)
The only parameter that has to be tuned in this approach is what it means for slope to be "about 0." I have to use different values for different data structures and algorithms. For example, all of my tests with various lists use a value of +-0.05, meaning that slopes between 0.05 and -0.05 are considered a slope of "about 0." For trees, I had to use 0.03, and for hashmaps I had to use 0.2.
Obviously this method is not perfect. As others have pointed out, asymptotic complexity is a theoretical problem, not an empirical one.
However, I was able to successfully unit test the performance of most common operations on common data structures. For example, I can show that insert, find, and remove from array lists and linked lists all have the right linear or constant time performance. I've also tested binary search trees, AVL trees, hash maps, binary heaps, adjacency list and adjacency matrix graphs, disjoint set forests, minimum spanning tree algorithms, and sorting algorithms (though this method proved very brittle for sorting).
This method also works when the theoretical performance relies on amortized analysis. For example, adding to the end of an array list still shows constant time performance even when the array needs to be periodically re-allocated. Same for re-hashing hashtables, though I had to raise the bounds for what counts as "about 0" for the slope quite a bit for it to work properly.
Hope this approach helps somebody else!