Using gawk interactively to query a database

In my previous post I showed you how to install a JBDC client to access a MaxDB database. Since I got that running I ran into an issue where I needed to retrieve a filepath from my database. This filepath was broken up into it individual components with an ID and PARENT_ID value. The idea is that you keep querying the database and prepending the filepath until you reach a PARENT_ID of 0. I’ll illustrate:

IDPARENT_IDNAMETYPE
958657my_file.txtfile
657 832Documentsdirectory
832 12Ivodirectory
12 0Usersdirectory

I want to retrieve the filepath for the file with ID 958. I copy the value from the database and do another query, looking up 657 this time. Now I prepend ‘Documents’ to what I already have. Repeat until PARENT_ID is 0. I end up with ‘/Users/Ivo/Documents/my_file.txt’.

The advantages is that you have to store way less information in the database if you don’t have to keep a full path for each file and it’s easy to find all the items that are in one directory, you just select everything with the same PARENT_ID. The disadvantage is that you have to do multiple queries. Normally this is not too much of a problem when you open a database connection, can do multiple queries in a row and then close the connection. But how do I do this from a bash script? The best I can do is restart the JDBC client for each part of the filepath. This means starting up a whole Java VM, opening and closing a connection every time. This means the example above takes about 7 seconds to complete. This gets worse when you have multiple files to look up.

So we look for alternatives. Unfortunately the PHP connector for MaxDB is ancient so that wasn’t an option. If there only was a way to make the JDBC client scriptable. Luckily, there is a way! We can use (g)awk to run the client as a ‘coprocess’ and send and receive data to/from it. But first we start of with some awk basics, namely pipes. These provide ways to get output from a program or send data to a program. I’ll give some examples:

awk 'BEGIN{"date" | getline x; print x; exit}'
Tue Dec 14 11:21:28 CET 2021

This example starts the ‘date’ program, captures a line of output (we’re expecting only one anyway), prints it and then exits. If we expect more output we can loop getline until it returns 0.

awk 'BEGIN{print "banana\norange\napple" | "sort"; exit}'
apple
banana
orange

This example throws three pieces of fruit at sort, which (unsurprisingly) sorts them.

The thing to take away from these examples is that they are one direction only and they are ‘one-shot’, meaning the external program is executed and then finishes, it doesn’t stay around to interact with. We can keep the external program alive by defining a pipe. An example:

awk 'BEGIN{
s="sort";
print "banana" | s;
print "orange" | s;
print "apple" | s;

print "-1-";
close(s);
print "-2-";
exit
}'
-1-
apple
banana
orange
-2-

The first time you put a line of text into ‘s’ it turns from a string to a pipe, or a handle to a running process. You can then feed more data into it. Once you are done, you can close the pipe, thereby closing the input to the program, which then finishes. But it’s still one-way communication. To fix that last step, we use a special token, namely ‘|&’. We’ll use the program ‘rev’ for this example. Any input to it just reverses it, so ‘123’ becomes ‘321’:

awk 'BEGIN{
pipe="rev"

print "Hello" |& pipe;
pipe |& getline t;
print "-" t "-";

exit
}'

Euhm.. ok.. so this should work. Why doesn’t it? It just hangs there, doing nothing.

There’s one more thing we have to learn about before we can use this, namely: buffering. When a program sends some output, the computer likes to wait until it has a sizeable chunk of data before actually writing it to the screen. It’s way faster to write 10 lines of text to the screen at once, then writing each letter individually for example. However this bites us in the butt right now because getline is waiting for ‘rev’ to send output but the operating system (it’s actually a part of libc) is holding off on sending the output to awk until it has more data. So we have to tell the operating system to stop buffering. We can use a program for this called ‘stdbuf’ (part of the coreutils package on Ubuntu). This allows us to change how the three datastreams of a program (input, output and error) are handled. Let’s retry our example:


awk 'BEGIN{
pipe="stdbuf -i0 -o0 -e0 rev"

print "Hello" |& pipe;
pipe |& getline t;
print "-" t "-";

print "world" |& pipe;
pipe |& getline t;
print "-" t "-";

exit
}'
-olleH-
-dlrow-

Now the example works! We output to, and input from the program. While we don’t do anything intelligent with this program, it does show how we can now communicate interactively with an external program running aside awk.

Ok, since I’m pretty sure you don’t have a MaxDB database lying around, I’ll just show you the script that I ended up making to look up a filepath. But with what I explained to you, you should be able to follow along easily:

awk 'BEGIN{
FS=OFS="\t";
pipe="sqlline -u 'jdbc:sapdb://my_server/my_db' -n my_user -p my_password --silent=true 2>/dev/null";
id=958;
print "!outputformat tsv"      |& pipe;
print "!set showHeader false"  |& pipe;

name="";
do{
  print "SELECT ID, PARENT_ID, NAME, TYPE FROM OBJECT WHERE ID = " id ";" |& pipe
  pipe |& getline
  stripQuotes();
  name = $3 ($4 != "file" ? "/" : "") name;
  id = $2;
  } while($2 != 0);
print name;

exit
}

function stripQuotes()
{
sub(/^"/, "");
gsub(/"\t"/, "\t");
gsub(/\t"/, "\t");
sub(/"$/, "");

gsub(/""/, "\"");      # Quotes are escaped inside values. Undo this
}
'
/Users/Ivo/Documents/my_file.txt

Leave a Comment

Your email address will not be published. Required fields are marked *