FinanceLane
  • Funding
    • Equity Funding
    • Debt Funding
    • Crowdfunding
    • Real Estate Funding
  • Investing
    • Stocks
    • Bonds
    • Mutual Funds
    • Commodities
    • Forex
    • Private Equity
    • Real Estate
    • Crypto Investing
  • Lending
    • Personal Loan
    • Business Loan
    • Mortgage
    • Credit Card
    • Microfinance
    • Peer-to-Peer Lending
  • Insurance
    • Life Insurance
    • Health Insurance
    • Auto Insurance
    • Education Insurance
    • General Insurance
  • Banking
    • Individual Banking
    • Business Banking
    • Investment Banking
    • Neo Banking
    • Payments Bank
  • Wealth
    • Earning
    • Savings
    • Investments
    • Budgeting
    • Credit Management
    • Tax Planning
    • Retirement
  • Fintech
    • Payments
    • Digital Banks
    • Alternative Financing
    • Asset Management
    • Softwares
  • Startup
    • Startup Ecosystem
    • Merging & Acquisition
    • Equity Investing
    • Franchising
    • Business Offers
  • Crypto
    • Crypto Coins
    • Crypto Trading
    • Bitcoin
    • Blockchain
    • DAPP
    • Crypto Investing
  • Login
No Result
View All Result
FinanceLane
  • Home
  • Funding
  • Investing
  • Lending
  • Insurance
  • Banking
  • Wealth
  • Crypto
  • Newsletters
  • Feedback
Home News Feed Blockchain News

Optimizing Parquet String Data Compression with RAPIDS

Blockchainby Blockchain
July 17, 2024

Jessie A Ellis Jul 17, 2024 17:53

Discover how to optimize encoding and compression for Parquet string data using RAPIDS, leading to significant performance improvements.

Optimizing Parquet String Data Compression with RAPIDS

Parquet writers offer various encoding and compression options that are turned off by default. Enabling these options can provide better lossless compression for your data, but understanding which options to use is crucial for optimal performance, according to the NVIDIA Technical Blog.

Understanding Parquet Encoding and Compression

Parquet’s encoding step reorganizes data to reduce its size while preserving access to each data point. The compression step further reduces the total size in bytes but requires decompression before accessing the data again. The Parquet format includes two delta encodings designed to optimize string data storage: DELTA_LENGTH_BYTE_ARRAY (DLBA) and DELTA_BYTE_ARRAY (DBA).

RAPIDS libcudf and cudf.pandas

RAPIDS is a suite of open-source accelerated data science libraries. In this context, libcudf is the CUDA C++ library for columnar data processing. It supports GPU-accelerated readers, writers, relational algebra functions, and column transformations. The Python cudf.pandas library accelerates existing pandas code by up to 150x.

Benchmarking with Kaggle String Data

A dataset of 149 string columns, comprising 4.6 GB total file size and 12 billion total character count, was used to compare encoding and compression methods. The study found less than 1% difference in encoded size between libcudf and arrow-cpp and a 3-8% increase in file size when using the ZSTD implementation in nvCOMP 3.0.6 compared to libzstd 1.4.8+dfsg-3build1.

String Encodings in Parquet

String data in Parquet is represented using the byte array physical type. Most writers default to RLE_DICTIONARY encoding for string data, which uses a dictionary page to map string values to integers. If the dictionary page grows too large, the writer falls back to PLAIN encoding.

Total File Size by Encoding and Compression

For the 149 string columns in the dataset, the default setting of dictionary encoding and SNAPPY compression yields a total 4.6 GB file size. ZSTD compression outperforms SNAPPY, and both outperform uncompressed options. The best single setting for the dataset is default-ZSTD, with further reductions possible using delta encoding for specific conditions.

When to Choose Delta Encoding

Delta encoding is beneficial for data with high cardinality or long string lengths, generally achieving smaller file sizes. For string columns with less than 50 characters, DBA encoding can provide significant file size reductions, especially for sorted or semi-sorted data.

Reader and Writer Performance

The GPU-accelerated cudf.pandas library showed impressive performance compared to pandas, with 17-25x faster Parquet read speeds. Using cudf.pandas with an RMM pool further improved throughput to 552 MB/s read and 263 MB/s write speeds.

Conclusion

RAPIDS libcudf offers flexible, GPU-accelerated tools for reading and writing columnar data in formats such as Parquet, ORC, JSON, and CSV. For those looking to leverage GPU acceleration for Parquet processing, RAPIDS cudf.pandas and libcudf provide significant performance benefits.

Image source: Shutterstock Read The Original Article on Blockchain.News

Tags: DATA COMPRESSIONDATA SCIENCENewsPARQUETRAPIDS

Related Topics

Advisory

Post Office account death claim rules: How to claim money from post office after account holder’s death with or without nomination

Advisory

Identity theft scams on the rise: Why you must be alert against new frauds and here’s how you can save yourself

Prev Next

You May Like

Advisory

Post Office account death claim rules: How to claim money from post office after account holder’s death with or without nomination

Advisory

Identity theft scams on the rise: Why you must be alert against new frauds and here’s how you can save yourself

Blockchain

THORChain Announces Mainnet Upgrade to Version 3.6.0

Blockchain News

Gala Games Unveils Brock Moneyman Mystery Box with Unique VEXI Characters

Blockchain News

Gala Music Launches The Hot Box Mystery Box with Exclusive NFTs and Rewards

Blockchain News

dYdX Unveils Isolated Markets and Margin for Enhanced Trading Flexibility

Blockchain News

Stablecoins: Transforming Global Payments and the Future of Money

Advisory

Atal Pension Yojana gets record 7.65 crore subscribers: Know what has really worked and how it helps subscriber in retirement

Financial News

Blockchain

a16z Crypto Explores Token Best Practices and Emerging Trends

Blockchain
by Blockchain
Bitcoin

BlackRock reschedules its $10 million Bitcoin investment to January 5, 2024

Blockchain
by Blockchain
Blockchain News

AGI Development: The Heart of Future AI, Zhu Songchun’s Vision

Blockchain
by Blockchain
Blockchain News

NVIDIA Modulus Revolutionizes CFD Simulations with Machine Learning

Blockchain
by Blockchain
Advisory

New FASTag KYC rules from August 1, 2024: Check if your FASTag will be valid

FinanceLane
by FinanceLane
Advisory

Best mid-cap funds in 3 years: 5 schemes with up to 36.26% returns

FinanceLane
by FinanceLane
Bitcoin

Crypto VC C1 With Coinbase Lineage Eyes Acquisition in Australia: Report

CoinDesk
by CoinDesk
Blockchain News

BitMEX Lowers Margin Requirements for SOLUSDT Trading

Blockchain
by Blockchain
Bitcoin

Sam Bankman-Fried Appeals Fraud Conviction, Requests New Trial

CoinDesk
by CoinDesk
Bitcoin

Ether ETF Volumes Top $1B on Day One

CoinDesk
by CoinDesk
Bitcoin

Frank McCourt’s Decentralized Internet Project Enters Ethereum Ecosystem With Consensys Partnership

CoinDesk
by CoinDesk
Blockchain News

Enhancing AI Network Resiliency: The Role of Spectrum-X and BGP PIC

Blockchain
by Blockchain
Load More
FinanceLane.com
  • Disclaimer
  • Privacy Policy
  • Terms of use
  • Subscribe
  • Contact

Subscribe to get the latest updates

Follow us on

© 2022 FinanceLane.com. All rights reserved.

Welcome Back!

Sign In with Facebook
Sign In with Google
Sign In with Linked In
OR

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
  • Home
  • Funding
    • Equity Funding
    • Debt Funding
    • Real Estate Funding
    • Crowdfunding
  • Investing
    • Stocks
    • Bonds
    • Mutual Funds
    • Private Equity
    • Merging & Acquisition
    • Real Estate
  • Lending
    • Personal Loan
    • Business Loan
    • Credit Card
    • Microfinance
    • Peer-to-Peer Lending
  • Insurance
    • Life Insurance
    • Auto Insurance
    • Education Insurance
    • Health Insurance
  • Banking
    • Business Banking
    • Payments Bank
    • Investment Banking
    • Individual Banking
  • Wealth
    • Earning
    • Savings
    • Investments
    • Budgeting
    • Credit Management
    • Tax Planning
    • Retirement
  • Fintech
    • Alternative Financing
    • Payments
    • Asset Management
    • Digital Banks
    • Softwares
  • Fintech
    • Alternative Financing
    • Asset Management
    • Digital Banks
    • Softwares
    • Payments
  • Crypto
    • Crypto Investing
    • Crypto Trading
    • Crypto Coins
    • Bitcoin
    • Blockchain
    • DAPP
  • Subscribe
  • Contact
  • Login

© 2022 FinanceLane - Terms and Conditions | Disclaimer | Privacy Policy

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.