Discussion:
[Help-bash] Is there a way to read the first empty field in a TSV input?
Peng Yu
2017-09-28 15:56:14 UTC
Permalink
Hi,

The following example shows that the variable "a" gets the value of
"x" which is the 2nd field of the input. I'd like "a" to always get
the first field even it may be empty. Is possible with bash?

~$ IFS=$'\t' read -r a b c <<< $'\t'x$'\t'y
~$ echo $a
x
~$ echo $b
y
~$ echo $c
--
Regards,
Peng
Greg Wooledge
2017-09-28 16:05:10 UTC
Permalink
Post by Peng Yu
Hi,
The following example shows that the variable "a" gets the value of
"x" which is the 2nd field of the input. I'd like "a" to always get
the first field even it may be empty. Is possible with bash?
~$ IFS=$'\t' read -r a b c <<< $'\t'x$'\t'y
Whitespace characters in IFS are treated differently than non-whitespace
characters in IFS.

If your input file has whitespace characters as separators but you want
them to be treated the same way that, say, colons are treated in
/etc/passwd then you have a couple choices:

1) Use a language other than bash.
2) Convert the separators into some other character that bash treats
the way you want.

For example:

IFS=$'\005' read -r a b c < <(tr $'\t' $'\005' <<< $'\tx\ty')

Here, I arbitrarily chose ASCII character 0x05 as the new separator,
on the assumption that it cannot appear in your data. Use tr to
convert all the tabs to 0x05 upon input, and set IFS to that char.

If your data has no "safe" characters like 0x05 or DEL that you can
use as a separator, then there are other scripting languages.
Evan Gates
2017-09-28 16:39:34 UTC
Permalink
Post by Greg Wooledge
Post by Peng Yu
Hi,
The following example shows that the variable "a" gets the value of
"x" which is the 2nd field of the input. I'd like "a" to always get
the first field even it may be empty. Is possible with bash?
~$ IFS=$'\t' read -r a b c <<< $'\t'x$'\t'y
Whitespace characters in IFS are treated differently than non-whitespace
characters in IFS.
If your input file has whitespace characters as separators but you want
them to be treated the same way that, say, colons are treated in
1) Use a language other than bash.
2) Convert the separators into some other character that bash treats
the way you want.
A third option, manually loop. I ran into this recently writing a gopher
client in bash where the separator is tab and all other characters are
allowed IIRC. My solution was:


pack() {
printf '%s\t' "$ft" "$disp" "$sel" "$host" "$port" "$search"
}

unpack() {
local line
IFS= read -r line
for k in ft disp sel host port search; do
printf -v "$k" %s "${line%%$'\t'*}"
line=${line#*$'\t'}
done
}

emg
Greg Wooledge
2017-09-28 17:47:01 UTC
Permalink
Post by Evan Gates
A third option, manually loop. I ran into this recently writing a gopher
client in bash where the separator is tab and all other characters are
pack() {
printf '%s\t' "$ft" "$disp" "$sel" "$host" "$port" "$search"
}
unpack() {
local line
IFS= read -r line
for k in ft disp sel host port search; do
printf -v "$k" %s "${line%%$'\t'*}"
line=${line#*$'\t'}
done
}
OK, yeah. Another unpack implementation would look something like:

for k in ft disp sel host port search; do
IFS= read -r -d $'\t' "$k"
done <<< "$line"
Evan Gates
2017-09-29 16:26:03 UTC
Permalink
Post by Evan Gates
Post by Evan Gates
A third option, manually loop. I ran into this recently writing a gopher
client in bash where the separator is tab and all other characters are
pack() {
printf '%s\t' "$ft" "$disp" "$sel" "$host" "$port" "$search"
}
unpack() {
local line
IFS= read -r line
for k in ft disp sel host port search; do
printf -v "$k" %s "${line%%$'\t'*}"
line=${line#*$'\t'}
done
}
for k in ft disp sel host port search; do
IFS= read -r -d $'\t' "$k"
done <<< "$line"
Ah, much better. Thanks!
Felipe Salvador
2017-10-05 21:37:35 UTC
Permalink
Post by Greg Wooledge
Post by Peng Yu
Hi,
The following example shows that the variable "a" gets the value of
"x" which is the 2nd field of the input. I'd like "a" to always get
the first field even it may be empty. Is possible with bash?
~$ IFS=$'\t' read -r a b c <<< $'\t'x$'\t'y
2) Convert the separators into some other character that bash treats
the way you want.
IFS=$'\005' read -r a b c < <(tr $'\t' $'\005' <<< $'\tx\ty')
IFS=$'\005' read -r A B C D < <(tr $'\t' $'\005' <<< $'\t2\t4')

Hi,
I'm a bit confused, is IFS=$'\005' treated as a field?
If so I would expect $A= ,$B=2,$C= ,$D=4 an so on...

But I get $A= ,$B=2,$C=4,$D= , while echoing \"$'\t2\t4'\"
return " 2 4" correctly.

Regards
--
Felipe Salvador
Greg Wooledge
2017-10-06 12:51:34 UTC
Permalink
Post by Felipe Salvador
Post by Greg Wooledge
2) Convert the separators into some other character that bash treats
the way you want.
IFS=$'\005' read -r a b c < <(tr $'\t' $'\005' <<< $'\tx\ty')
IFS=$'\005' read -r A B C D < <(tr $'\t' $'\005' <<< $'\t2\t4')
Hi,
I'm a bit confused, is IFS=$'\005' treated as a field?
If so I would expect $A= ,$B=2,$C= ,$D=4 an so on...
IFS is a list of characters that may act as field separators/terminators.
In this example, IFS has been set to the single character 0x05 (Ctrl-E,
or ASCII "ENQ"). 0x05 is the only charater that 'read' will use to
separate input fields.

The input that's sent to read is a stream of 5 characters:

0x05 2 0x05 4 \n

Therefore 'read' will split it into fields as follows:

first field empty
second field '2'
third field '4'
Post by Felipe Salvador
But I get $A= ,$B=2,$C=4,$D=
That's correct.

Remember, the entire PURPOSE of this example was to take a tab-separated-
value input file and parse it in such a way that there can be empty
fields before/between the tabs.

The way bash normally handles tabs in IFS is to consider a sequence of
multiple tabs as a single field separator, and to ignore leading and
trailing tabs.

The OP wanted each tab to be significant, as if they were commas or
pipe signs or colons.

The proposed workaround is to transform the tabs into something that
isn't treated as whitespace by bash.

If this example isn't clicking for you, then let's try colons instead
of 0x05 characters. (This can't be used if there are possibly colons
in the actual input data.)

wooledg:~$ tr '\t' : <<< $'\t2\t4'
:2:4
wooledg:~$ IFS=: read -r a b c d < <(tr '\t' : <<< $'\t2\t4')
wooledg:~$ declare -p a b c d
declare -- a=""
declare -- b="2"
declare -- c="4"
declare -- d=""

Using 0x05 instead of : is exactly the same, except that 0x05 is a bit
less likely to appear in the actual input.
Post by Felipe Salvador
, while echoing \"$'\t2\t4'\"
return " 2 4" correctly.
What are you trying to do, exactly?
Felipe Salvador
2017-10-07 14:39:34 UTC
Permalink
Post by Greg Wooledge
Post by Felipe Salvador
Post by Greg Wooledge
2) Convert the separators into some other character that bash treats
the way you want.
IFS=$'\005' read -r a b c < <(tr $'\t' $'\005' <<< $'\tx\ty')
IFS=$'\005' read -r A B C D < <(tr $'\t' $'\005' <<< $'\t2\t4')
Hi,
I'm a bit confused, is IFS=$'\005' treated as a field?
If so I would expect $A= ,$B=2,$C= ,$D=4 an so on...
IFS is a list of characters that may act as field separators/terminators.
In this example, IFS has been set to the single character 0x05 (Ctrl-E,
or ASCII "ENQ"). 0x05 is the only charater that 'read' will use to
separate input fields.
0x05 2 0x05 4 \n
first field empty
second field '2'
third field '4'
Post by Felipe Salvador
But I get $A= ,$B=2,$C=4,$D=
That's correct.
Remember, the entire PURPOSE of this example was to take a tab-separated-
value input file and parse it in such a way that there can be empty
fields before/between the tabs.
The way bash normally handles tabs in IFS is to consider a sequence of
multiple tabs as a single field separator, and to ignore leading and
trailing tabs.
I was wrongly considering $'\005' as a field rather than a separator,
as below:

$'\005'| 2 |$'\005'| 4

Now I get it:

$IFS=$'\005' read -r A B C D E F G < <(tr $'\t' $'\005' <<< $'\t2\t\t4\t\t6\t\t'

$ declare -p A B C D E F G
declare -- A=""
declare -- B="2"
declare -- C=""
declare -- D="4"
declare -- E=""
declare -- F="6"
declare -- G=""
Post by Greg Wooledge
The OP wanted each tab to be significant, as if they were commas or
pipe signs or colons.
The proposed workaround is to transform the tabs into something that
isn't treated as whitespace by bash.
If this example isn't clicking for you, then let's try colons instead
of 0x05 characters. (This can't be used if there are possibly colons
in the actual input data.)
wooledg:~$ tr '\t' : <<< $'\t2\t4'
:2:4
wooledg:~$ IFS=: read -r a b c d < <(tr '\t' : <<< $'\t2\t4')
wooledg:~$ declare -p a b c d
declare -- a=""
declare -- b="2"
declare -- c="4"
declare -- d=""
Using 0x05 instead of : is exactly the same, except that 0x05 is a bit
less likely to appear in the actual input.
Post by Felipe Salvador
, while echoing \"$'\t2\t4'\"
return " 2 4" correctly.
What are you trying to do, exactly?
I'm trying to learn something more, for the sake of knowledge.

Thank you very much Greg, for your patience an your thorough
explanation.
--
Felipe Salvador
Loading...