Skip to content
background-image background-image

Analyzing YAML Data with Pandas in Python

This example demonstrates the use of Python's Pandas library to extract and analyze YAML data provided as a string.

Introduction

Backstory

Imagine you're working on a project where you need to process and analyze data from various sources. One of the sources provides data in YAML format. Your motivation is to create a data processing script using Python and the Pandas library to efficiently read and manipulate this YAML data.

Statement

On input data, we have a YAML-formatted string with the following structure:

- name: Alice
  age: 28
  city: New York
- name: Bob
  age: 24
  city: Los Angeles
- name: Carol
  age: 31
  city: Chicago

Input data has the following structure:

"data": string

import pandas as pd
import yaml

# Convert YAML string to a Python list of dictionaries
data_list = yaml.safe_load(INPUT_DATA[0]["data"])

# Create a Pandas DataFrame from the data
df = pd.DataFrame(data_list)

# Print the original DataFrame
log.info("Original DataFrame:")
log.info(df)

# Perform some data manipulations
df["age_group"] = pd.cut(df["age"], bins=[20, 30, 40], labels=["20-30", "30-40"])

# Print the DataFrame after adding the 'age_group' column
log.info("DataFrame with Age Groups:")
log.info(df)

Explanation

  • Converting YAML to List of Dictionaries: We use the yaml.safe_load() function from the PyYAML library to convert the YAML data string into a Python list of dictionaries.
  • Creating Pandas DataFrame: We create a Pandas DataFrame from the list of dictionaries, allowing us to manipulate and analyze the data efficiently.
  • Displaying Data: We display the original DataFrame to visualize the structured data.
  • Data Manipulation: We perform data manipulation by adding an "age_group" column to the DataFrame using Pandas' cut function. This categorizes individuals into age groups based on specified bins.

Conclusion

This Pandas-based analysis of YAML data simplifies the process of handling structured data received in YAML format. By leveraging Pandas, you can quickly convert, manipulate, and analyze YAML data, making it a valuable tool for data analysts working with various types of structured data from different sources.