Skip to content
background-image background-image

Analyzing XML Data with Pandas in Python

This example demonstrates the use of Python's Pandas library to extract and analyze XML data provided as a string.

Introduction

Backstory

You work as a data analyst at "TechTreasure," an e-commerce startup specializing in electronic devices. Your company sources product information from various suppliers. One of the suppliers, "ElectroTech," sends product data in XML format. You need to extract and analyze this data to gain insights into the average prices of different product categories. Your goal is to assist the pricing team in setting competitive prices for TechTreasure's products.

Motivation

By analyzing the average prices of different product categories, you can provide valuable insights to the pricing team. This data will help them understand the competitive landscape and make informed decisions about pricing TechTreasure's products. Using pandas, you can efficiently process the XML data, calculate average prices, and generate a report for the pricing team.

Statement

On input data, we have a XML-formatted string with the following structure:

<products>
    <product>
        <name>Laptop A</name>
        <category>Laptops</category>
        <price>1200</price>
    </product>
    <product>
        <name>Phone B</name>
        <category>Phones</category>
        <price>800</price>
    </product>
    <product>
        <name>Tablet C</name>
        <category>Tablets</category>
        <price>500</price>
    </product>
</products>

Input data has the following structure:

"data": string

import pandas as pd
from xml import etree
from io import StringIO

# Parse the XML data
root = etree.ElementTree.fromstring(INPUT_DATA[0]["data"])

# Create a list to store extracted data
data = []

# Extract data from XML and store in a list of dictionaries
for product in root.findall("product"):
    name = product.find("name").text
    category = product.find("category").text
    price = float(product.find("price").text)
    data.append({"Name": name, "Category": category, "Price": price})

# Create a DataFrame using the extracted data
df = pd.DataFrame(data)

# Calculate the average prices by category
avg_prices = df.groupby("Category")["Price"].mean()

# Display the average prices
log.info(avg_prices)

# Generate the average price report as a CSV string
csv_string_io = StringIO()
avg_prices.to_csv(csv_string_io, header=True)

# Get the CSV string value
csv_string = csv_string_io.getvalue()

return [{ "csv": csv_string }]

Explanation

  • Parsing XML Data: We use Python's xml.etree.ElementTree to parse the XML data string and create an XML element tree.
  • Extracting Data: We extract data from the XML element tree by iterating through each "product" element and retrieving the "name," "category," and "price" values.
  • Creating Pandas DataFrame: We create a Pandas DataFrame from the extracted data, allowing us to manipulate and analyze the data efficiently.
  • Calculating Average Prices: We calculate the average prices of products by category using Pandas' grouping and aggregation capabilities.
  • Generating CSV Report: We generate a CSV report by converting the average prices DataFrame into a CSV-formatted string using Pandas.

Conclusion

This Pandas-based analysis of XML data simplifies the process of handling structured data received in XML format. By leveraging Pandas, you can quickly convert, analyze, and generate reports from XML data, making it a valuable tool for data analysts working with various types of structured data.