You are in a good path if you are already thinking about optimization of your code. I must however point out, that writing good quality code, comes with the cost of spending a lot of time learning your tools, in this case the pandas
library. This video is how I was introduced to the topic, and personally I believe it helped me a lot.
If I understand correctly you want to: filter specific crime types, group them by month and add up occurrences, and finally plot monthly crime evolution for each type.
Trying out your code three times back to back I got 4.4346, 3.6758 and 3.9400 s execution time -> mean 4.0168 s (not counting time taken to load dataset, used time.perf_counter()
). The data used where taken from NYPD database (please include your data source when posting questions).
crime_counts
is what we call, a pivot table, and it handles what you did separately for each crime type, while also saving them in an analysis-friendly pd.DataFrame
format.
t1 = time.perf_counter()
# changing string based date to datetime object
df["ARREST_DATE"] = pd.to_datetime(df["ARREST_DATE"], format='%m/%d/%Y')
# create pd.Series object of data on a monthly frequency [length = df length]
df["ARREST_MONTH"] = df["ARREST_DATE"].dt.to_period('M') # no one's stopping you from adding new columns
# Filter the specific crime types
crime_select = ["DANGEROUS DRUGS", "ASSAULT 3 & RELATED OFFENSES", "PETIT LARCENY", "FELONY ASSAULT", "DANGEROUS WEAPONS"]
filtered = df.loc[df["OFNS_DESC"].isin(crime_select), ["ARREST_MONTH", "OFNS_DESC"]]
crime_counts = (filtered
.groupby(["ARREST_MONTH", "OFNS_DESC"])
.size()
.unstack(fill_value=0)) # Converts grouped data into a DataFrame
# Plot results
crime_counts.plot(figsize=(12,6), title="Monthly Crime Evolution")
plt.xlabel("Arrest Month")
plt.ylabel("Number of Arrests")
plt.legend(title="Crime Type")
plt.grid(True)
t2 = time.perf_counter()
print(f"Time taken to complete operations: {t2 - t1:0.4f} s")
plt.show()
Above code completed three runs in 2.5432, 2.6067 and 2.4947 s -> mean 2.5482 s. Adding up to a ~36.56% speed increase.
Note: Did you include the dataset loading time into your execution time measurements? I found that by keeping df
loaded and only running the calculations part, yields about 3.35s for your code, and 1.85s for mine.