Select Page

Solving ETL Challenges: 5 Command-Line Tools You Need to Know

Author: Gajanan Gaidhane | 4 min read | May 14, 2020

ETL (extract, transform, load) can be as essential to your business as it can be painful. Without the necessary experience and technical skill set, dealing with ETL issues such as inefficient queries and poor system design can occupy far too much of your time and effort.

 
The good news: in many cases, you can diagnose and solve some of the most pressing ETL challenges using nothing more than the command-line tools that you already have at your disposal. Below, we’ll go over 5 scripts and programs that can help you immensely in your quest to optimize ETL.

1. netcat

Network latency can be a silent killer for your ETL workflow, causing data transfers to execute unbearably slowly. Although caching may help reduce the amount of data flowing through the network, it can only do so much, given the vast quantities of information taken up by ETL processes.

netcat is a Unix and Microsoft Windows utility that reads from and writes to network connections. The use cases of netcat include diagnosing network latency issues by testing the network’s speed.

There are multiple tutorials for testing network speeds in netcat (e.g. by Wayne E. Goodrich, James Bowes, and Rui Coelho), but each one involves just a few basic steps.

First, on the server machine, run the following terminal command:

nc -v -l 2222 > /dev/null

Next, on the client machine, run the dd command for copying files:

dd if=/dev/zero bs=1024K count=512 | nc -v $IP_OF_FIRST_MACHINE 2222

The results of this command will give you the network speed in megabytes per second, e.g.:

536870912 bytes (537 MB) copied, 4.87526 s, 117 MB/s

For the most accurate picture of your network speeds, it’s best to run netcat multiple times and take different types of readings: for example, during the ETL process, or outside normal business hours.

2. iperf

Like netcat, iperf is another command-line tool for measuring and tuning network performance. Linode provides a simple tutorial for testing network throughput with iperf.

As before, you will need to designate one machine as the server and the other as the client. On the server machine, run the following terminal command:

iperf -s

Then, on the client machine, use iperf to connect to the server (replacing 192.168.1.1 with the server’s IP address):

iperf -c 192.168.1.1

The results of this command will include the network speed under the “Bandwidth” heading, e.g.:

------------------------------------------------------------
Client connecting to 192.168.1.1, TCP port 5001
TCP window size: 45.0 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.1 port 50616 connected with 192.168.1.1 port 5001
[ ID] Interval    Transfer   Bandwidth
[ 3] 0.0-10.1 sec 1.27 GBytes 1.08 Gbits/sec

3. PuTTY

PuTTY is an SSH and Telnet client for Windows, Linux, and macOS that is used to execute remote sessions on a computer over a network. Running PuTTY enables you to use a computer via terminal commands, even if the computer is physically distant from your location.

Users most commonly interact with PuTTY using the graphical interface, which can be easily invoked from the command line:

putty

From here, you can specify the destination to which you want to connect, as well as load previous saved sessions.

4. top

The PuTTY client is especially useful for ETL when used with command-line tools such as top, which displays the running processes in Linux. You simply have to execute the terminal command:

top

PuTTY and top are invaluable for ETL performance monitoring and optimization on remote machines. What’s more, we can use pipes to search through the output of top for the ETL processes that we most care about. For example, the following command returns any processes in top that match the sequence “java”:

top | grep java

With the right search query, you can easily use top to oversee the activity of various Oracle ETL components such as Java, OBIEE, and Informatica.

5. df

Last but not least, the df tool displays the amount of disk space available and in use on the current system. Lack of space is one of the biggest ETL challenges that many organizations face. As your ETL processes grow and scale, you’ll need more space for logging, data transformations, and staging areas.

The df command is easily invoked as follows:

df

This displays the available disk space in bytes. For a more human-readable format, add the -h flag:

df -h

Conclusion

The five command-line tools above are essential for anyone looking to get started with ETL performance tuning and optimization. For more complicated issues, however, you may need to seek the help of ETL experts like Datavail.

Want to see how Datavail helps solve our clients’ most pressing ETL challenges? Download our white paper “The Top 5 Challenges of ETL (And How to Solve Them).”

Subscribe to Our Blog

Never miss a post! Stay up to date with the latest database, application and analytics tips and news. Delivered in a handy bi-weekly update straight to your inbox. You can unsubscribe at any time.