Post by Bruce Hohl@Greg, it is an interesting happen-stance that you replied as my question
arose from my pass at completing your duplicate file finder "exercise" at
mywiki.wooledge.org/BashProgramming/04: "If you want to "fix" this
"problem", you might suppress all the printing until the end, and then
iterate over the whole array and print only those values that contain a
newline. (This is left as an exercise.)" So with your suggestion to use
=== Duplicate file finder exercise === (NO comments)
#!/bin/bash
while read -r md5_hash file; do
var_hash=md5_$md5_hash
declare -n ind_var_hash=$var_hash
declare -a ${!ind_var_hash}+="('$file')"
done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)
declare -n e
echo ${!e}
done
So your approach was to experiment with bash commands until you found
something that would approximate giving you the ability to have a hash
of lists (associative array of indexed arrays).
And what you came up with was using the entire bash variable namespace
as your hash, and storing each list as a separate indexed array within
that namespace.
That's... definitely not how I would have done it. ;-)
You're also missing some quotes.
Anyway, here is the solution that I had in mind for that:
=====================================================
#!/bin/bash
declare -A seen
while read -r md5 file; do
if [[ ${seen[$md5]} ]]; then
seen[$md5]+=$'\n'$file
else
seen[$md5]=$file
fi
done < <(find "${1:-.}" -name $'*\n*' -prune -o -type f -exec md5sum {} +)
for i in "${!seen[@]}"; do
if [[ ${seen[$i]} = *$'\n'* ]]; then
printf 'Matching MD5:\n%s\n\n' "${seen[$i]}"
fi
done
=====================================================
The stuff I wrote in the text was really quite literal: "store multiple
filenames for each MD5 value (in a newline-delimited pseudo-list)" and
"iterate over the whole array and print only those values that contain
a newline". That's what I'm doing here.
This is also a hack, using newlines to store multiple elements of a list
in a string variable, and this only works because we're already excluding
filenames that have a newline in them. This frees up the newline character
to act as a list delimiter.
In the absence of that opening, I would simply have written the program
in a different language -- one that allows you to create a hash of lists
without needing special hacks and tricks.
For example, a relatively straight conversion to Tcl:
=====================================================
#!/usr/bin/env tclsh
if {[llength $argv]} {set start [lindex $argv 0]} else {set start .}
foreach line [split \
[exec find $start -name "*\n*" -prune -o -type f -exec md5sum "{}" +] \
\n] {
set md5 [string range $line 0 31]
set file [string range $line 34 end]
lappend seen($md5) $file
}
foreach i [array names seen] {
if {[llength $seen($i)] < 2} continue
puts [format "Matching MD5: %s" [join $seen($i) { }]]
}
=====================================================
The output format is slightly different, but of course that can
be adjusted. The elements of "seen" are simply lists of filenames,
as this language supports this directly. I'm sure a similar solution
could be written in Python (which I don't know well enough to write in).
The only reason this solution is excluding filenames with newlines is
because of the md5sum command's output format.