Allow me to preface this article by saying that I’m not a terminal person. I don’t use Vim. I find sed, grep, and awk convoluted and counter-intuitive. I prefer seeing my files in a nice UI. Despite all that, I got into the habit of reaching for command-line interfaces (CLIs) when I had small, dedicated tasks to complete. Why? I’ll explain all of that below. In this article, you’ll also learn how to use a CLI tool named Miller to manipulate data from CSV, TSV and/or JSON files.
Why Use The Command Line?
Everything that I’m showing here can be done with regular code. You can load the file, parse the CSV data, and then transform it using regular JavaScript, Python, or any other language. But there are a few reasons why I reach out for command-line interfaces (CLIs) whenever I need to transform data:
Easier to read.
It is faster (for me) to write a script in JavaScript or Python for my usual data processing. But, a script can be confusing to come back to. In my experience, command-line manipulations are harder to write initially but easier to read afterward.
Easier to reproduce.
Thanks to package managers like Homebrew, CLIs are much easier to install than they used to be. No need to figure out the correct version of Node.js or Python, the package manager takes care of that for you.
Ages well.
Compared to modern programming languages, CLIs are old. They change a lot more slowly than languages and frameworks.
What Is Miller?
The main reason I love Miller is that it’s a standalone tool. There are many great tools for data manipulation, but every other tool I found was part of a specific ecosystem. The tools written in Python required knowing how to use pip and virtual environments; for those written in Rust, it was cargo, and so on.
On top of that, it’s fast. The data files are streamed, not held in memory, which means that you can perform operations on large files without freezing your computer.
As a bonus, Miller is actively maintained, John Kerl really keeps on top of PRs and issues. As a developer, I always get a satisfying feeling when I see a neat and maintained open-source project with great documentation.
Installation
Linux: apt-get install miller or Homebrew.
macOS: brew install miller using Homebrew.
Windows: choco install miller using Chocolatey.
That’s it, and you should now have the mlr command available in your terminal.
Run mlr help topics to see if it worked. This will give you instructions to navigate the built-in documentation. You shouldn’t need it, though; that’s what this tutorial is for!
How mlr Works
Miller commands work the following way:
mlr [input/output file formats] [verbs] [file]
Example: mlr –csv filter ‘$color != “red”‘ example.csv
Let’s deconstruct:
–csv specifies the input file format. It’s a CSV file.
filter is what we’re doing on the file, called a “verb” in the documentation. In this case, we’re filtering every row that doesn’t have the field color set to “red”. There are many other verbs like sort and cut that we’ll explore later.
example.csv is the file that we’re manipulating.
Operations Overview
We can use those verbs to run specific operations on your data. There’s a lot we can do. Let’s explore.
I’ll be using a dataset of IMDb ratings for American TV dramas created by The Economist. You can download it here or find it in the repo for this article.
Note: For the sake of brevity, I’ve renamed the file from mlr –csv head ./IMDb_Economist_tv_ratings.csv to tv_ratings.csv.
Above, I mentioned that every command contains a specific operation or verb. Let’s learn our first one, called head. What it does is show you the beginning of the file (the “head”) rather than print the entire file in the console.
You can run the following command:
`mlr –csv head ./tv_ratings.csv`
And this is the output you’ll see:
This is a bit hard to read, so let’s make it easier on the eye by adding –opprint.
mlr –csv –opprint head ./tv_ratings.csv
The resulting output will be the following:
Much better, isn’t it?
Note: Rather than typing –csv –opprint every time, we can use the –c2p option, which is a shortcut.
That’s where the fun begins. Rather than run multiple commands, we can chain the verbs together by using the then keyword.
You can see that there’s a titleId column that isn’t very useful. Let’s get rid of it using the cut verb.
mlr –c2p cut -x -f titleId then head ./tv_ratings.csv
It gives you the following output:
Fun Fact
This is how I first learned about Miller! I was playing with a CSV dataset for https://details.town/ that had a useless column, and I looked up “how to remove a column from CSV command line.” I discovered Miller, loved it, and then pitched an article to Smashing magazine. Now here we are!
This is the verb that I first showed earlier. We can remove all the rows that don’t match a specific expression, letting us clean our data with only a few characters.
If we only want the rating of the first seasons of every series in the dataset, this is how you do it:
mlr –c2p filter ‘$seasonNumber == 1’ then head ./tv_ratings.csv
We can sort our data based on a specific column like it would be in a UI like Excel or macOS Numbers. Here’s how you would sort your data based on the series with the highest rating:
mlr –c2p sort -nr av_rating then head ./tv_ratings.csv
The resulting output will be the following:
titleId seasonNumber title date av_rating share genres
tt0098887 1 Parenthood 1990-11-13 9.6824 1.68 Comedy,Drama
tt0106028 6 Homicide: Life on the Street 1997-12-05 9.6 0.13 Crime,Drama,Mystery
tt0108968 5 Touched by an Angel 1998-11-15 9.6 0.08 Drama,Family,Fantasy
tt0903747 5 Breaking Bad 2013-02-20 9.554 18.95 Crime,Drama,Thriller
tt0944947 6 Game of Thrones 2016-05-25 9.4943 15.18 Action,Adventure,Drama
tt3398228 5 BoJack Horseman 2018-09-14 9.4738 0.45 Animation,Comedy,Drama
tt0103352 3 Are You Afraid of the Dark? 1994-02-23 9.4349 2.6 Drama,Family,Fantasy
tt0944947 4 Game of Thrones 2014-05-09 9.4282 11.07 Action,Adventure,Drama
tt0976014 4 Greek 2011-03-07 9.4 0.01 Comedy,Drama
tt0090466 4 L.A. Law 1990-04-05 9.4 0.1 Drama
We can see that Parenthood, from 1990, has the highest rating on IMDb — who knew!
By default, Miller only prints your processed data to the console. If we want to save it to another CSV file, we can use the > operator.
If we wanted to save our sorted data to a new CSV file, this is what the command would look like:
mlr –csv sort -nr av_rating ./tv_ratings.csv > sorted.csv
Most of the time, you don’t use CSV data directly in your application. You convert it to a format that is easier to read or doesn’t require additional dependencies, like JSON.
Miller gives you the –c2j option to convert your data from CSV to JSON. Here’s how to do this for our sorted data:
mlr –c2j sort -nr av_rating ./tv_ratings.csv > sorted.json
Case study: Top 5 Athletes With Highest Number Of Medals In Rio 2016
Let’s apply everything we learned above to a real-world use case. Let’s say that you have a detailed dataset of every athlete who participated in the 2016 Olympic games in Rio, and you want to know who the 5 with the highest number of medals are.
First, download the athlete data as a CSV, then save it in a file named athletes.csv.
Let’s open up the following file:
mlr –c2p head ./athletes.csv
The resulting output will be something like the following:
The CSV file has a few fields we don’t need. Let’s clean it up by removing the info , id , weight, and date_of_birth columns.
mlr –csv -I cut -x -f id,info,weight,date_of_birth athletes.csv
Now we can move to our original problem: we want to find who won the highest number of medals. We have how many of each medal (bronze, silver, and gold) the athletes won, but not the total number of medals per athlete.
Let’s compute a new value called medals which corresponds to this total number (bronze, silver, and gold added together).
mlr –c2p put ‘$medals=$bronze+$silver+$gold’ then head ./athletes.csv
It gives you the following output:
Sort by the highest number of medals by adding a sort.
mlr –c2p put ‘$medals=$bronze+$silver+$gold’
then sort -nr medals
then head ./athletes.csv
Respectively, the resulting output will be the following:
Restrict to the top 5 by adding -n 5 to your head operation.
mlr –c2p put ‘$medals=$bronze+$silver+$gold’
then sort -nr medals
then head -n 5 ./athletes.csv
You will end up with the following file:
As a final step, let’s convert this into a JSON file with the –c2j option.
Here is our final command:
mlr –c2j put ‘$medals=$bronze+$silver+$gold’
then sort -nr medals
then head -n 5 ./athletes.csv > top5.json
With a single command, we’ve computed new data, sorted the result, truncated it, and converted it to JSON.
[
{
“name”: “Michael Phelps”,
“nationality”: “USA”,
“sex”: “male”,
“height”: 1.94,
“weight”: 90,
“sport”: “aquatics”,
“gold”: 5,
“silver”: 1,
“bronze”: 0,
“medals”: 6
}
// Other entries omitted for brevity.
]
Bonus: If you wanted to show the top 5 women, you could add a filter.
Respectively, you would end up with the following output:
Conclusion
I hope this article showed you how versatile Miller is and gave you a taste of the power of command-line tools. Feel free to scourge the internet for the best CLI next time you find yourself writing yet another random script.
Miller
Dataset of IMDb ratings for American TV dramas
Powerful Terminal And Command-Line (CLI) Tools For Modern Web Development
How Should Designers Learn To Code? The Terminal And Text Editors (Part 1)
How To Develop An Interactive Command Line Application Using Node.js
A Deep Dive Into Serverless UI With TypeScript