Post by Bruce Hohl@Greg, it is an interesting happen-stance that you replied as my question
arose from my pass at completing your duplicate file finder "exercise" at "If you want to "fix" this
"problem", you might suppress all the printing until the end, and then
iterate over the whole array and print only those values that contain a
newline. (This is left as an exercise.)" So with your suggestion to use
=== Duplicate file finder exercise === (NO comments)
while read -r md5_hash file; do
declare -n ind_var_hash=$var_hash
declare -a ${!ind_var_hash}+="('$file')"
done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)
declare -n e
echo ${!e}
So your approach was to experiment with bash commands until you found
something that would approximate giving you the ability to have a hash
of lists (associative array of indexed arrays).
And what you came up with was using the entire bash variable namespace
as your hash, and storing each list as a separate indexed array within
that namespace.
That's... definitely not how I would have done it. ;-)
You're also missing some quotes.
Anyway, here is the solution that I had in mind for that:
declare -A seen
while read -r md5 file; do
if [[ ${seen[$md5]} ]]; then
done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)
for i in "${!seen[@]}"; do
if [[ ${seen[$i]} = *$'\n'* ]]; then
printf 'Matching MD5:\n%s\n\n' "${seen[$i]}"
The stuff I wrote in the text was really quite literal: "store multiple
filenames for each MD5 value (in a newline-delimited pseudo-list)" and
"iterate over the whole array and print only those values that contain
a newline". That's what I'm doing here.
This is also a hack, using newlines to store multiple elements of a list
in a string variable, and this only works because we're already excluding
filenames that have a newline in them. This frees up the newline character
to act as a list delimiter.
In the absence of that opening, I would simply have written the program
in a different language -- one that allows you to create a hash of lists
without needing special hacks and tricks.
For example, a relatively straight conversion to Tcl:
#!/usr/bin/env tclsh
if {[llength $argv]} {set start [lindex $argv 0]} else {set start .}
foreach line [split \
[exec find $start -name "*\n*" -prune -o -type f -exec md5sum "{}" +] \
\n] {
set md5 [string range $line 0 31]
set file [string range $line 34 end]
lappend seen($md5) $file
foreach i [array names seen] {
if {[llength $seen($i)] < 2} continue
puts [format "Matching MD5: %s" [join $seen($i) { }]]
The output format is slightly different, but of course that can
be adjusted. The elements of "seen" are simply lists of filenames,
as this language supports this directly. I'm sure a similar solution
could be written in Python (which I don't know well enough to write in).
The only reason this solution is excluding filenames with newlines is
because of the md5sum command's output format.