NTTS 2021 – Spotlight session: Data analytics revolution
Download & Presentation: https://coms.events/NTTS2021/data/abstracts/en/abstract_0076.html

Abstract

We present a use case of the column-oriented data storage format Apache Parquet for a large database on international trade data, as a fast and efficient alternative of data provision as opposed to databases based on csv or Stata files. With the use of Apache Parquet, which provides efficient data compression and enhanced performance (with Apache Arrow), we were able to reduce the size of the UN Comtrade trade database for all classifications from ~900GB to 55GB (compression ratio of ~16), mean improvement in speed by 63.4%, and median improvement by 66% when extracting subsets of data from the database.