For building large data sets in Bash
, I'd strongly advise to generally use an array
.
It'll scale much better, also offering much better handling and filtering performance, than concatenated strings - most especially, for very large sets of >5000 elements (depending on your machine's resources).
An exception to that 'rule of thumb' would be, to e. g. invoke a very efficient filtering tool like grep
once (not in large loops) over a large data set.
Also, I've found while
loops to be a bit more efficient than for
and more flexible, in many cases, especially, if indices/counters are involved.
The best-scaling way I've found, to build large, concatenated strings, is building and then converting an array natively.
A comparative example to the URL concat problem above:
#! /bin/bash
url="http://www.$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w $((4096-15)) | head -n 1).com"
items=2000
# Build array of items and measure time
startTime="$EPOCHREALTIME"
while [[ $((i++)) -lt $items ]]; do
tmpArr+=("$url")
done
stopTime="$EPOCHREALTIME"
awk 'BEGIN {printf "%s %f\n", "SECONDS ARRAY building from loop:", '"$stopTime"' - '"$startTime"'};'
# Bash/native expansion of array to string and measure time
# Mind temporary IFS restriction and '[*]' to wrap on and expand newlines
startTime="$EPOCHREALTIME"
_IFS="$IFS"; IFS=$'\n'
tmpVar="${tmpArr[*]}"
IFS="$_IFS"
stopTime="$EPOCHREALTIME"
awk 'BEGIN { printf "%s %f\n", "SECONDS VAR from array expansion:", '"$stopTime"' - '"$startTime"'; }'
# Directly concat strings
tmpVar=
startTime="$EPOCHREALTIME"
while [[ $((i++)) -lt $items]]; do
tmpVar+="$url"$'\n'
done
stopTime="$EPOCHREALTIME"
awk 'BEGIN { printf "%s %f\n", "SECONDS VAR direct building:", '"$stopTime"' - '"$startTime"'; }'
# Looped printf output concatenation
startTime="$EPOCHREALTIME"
tmpVar="$(
for (( i=0; i < $items; i++ )); do
printf '%s\n' "$url"
done
)"
stopTime="$EPOCHREALTIME"
awk 'BEGIN { printf "%s %f\n", "SECONDS VAR from printf loop:", '"$stopTime"' - '"$startTime"'; }'
# Print tmpVar concatenated string
#echo -e "\nVAR ELEMENTS:"
#printf '%s\n' "$tmpVar"
For 2.000 items, directly concatenating a string may still work fast enough.
This will change quickly, though, for a higher number of items (depending on your system's resources).
Where the effort for building an array and converting it natively, is minimal and will increase only linearly, the time for directly concatenating strings will grow exponentially - up to stalling your script; the chosen solution above will show even worse performance:
Items: 2.000
SECONDS ARRAY building from loop: 0.073598
SECONDS VAR from array expansion: 0.129333
SECONDS VAR direct building: 0.641954
SECONDS VAR from print loop: 1.196127
For 5.000 items:
SECONDS ARRAY building from loop: 0.201801
SECONDS VAR from array expansion: 0.326866
SECONDS VAR direct building: 4.797142
SECONDS VAR from print loop: 6.986455
For 10.000 items:
SECONDS ARRAY building from loop: 0.378829
SECONDS VAR from array expansion: 0.674607
SECONDS VAR direct building: 18.280629
SECONDS VAR from print loop: 27.756085
For 100.000 items:
SECONDS ARRAY building from loop: 3.748199
SECONDS VAR from array expansion: 6.427268
SECONDS VAR direct building: ??? >20 minutes
SECONDS VAR from print loop: ???