Skip to content

sm373373/spark-dataframe-to-pandas

Repository files navigation

Spark DataFrame to Pandas Converter

Overview

This Visual Studio Code extension helps developers working with PySpark DataFrames to easily convert them to Pandas DataFrames during debugging sessions. With this extension, you can right-click a PySpark DataFrame in the Locals pane and instantly convert it to a Pandas DataFrame for easier inspection and manipulation. The newly converted Pandas DataFrame can also be viewed using the VSCode Data Viewer (with the Data Wrangler extension).

Features

  • Convert Spark DataFrame to Pandas DataFrame: Simply right-click on any PySpark DataFrame in the debugger and select "Convert Spark DataFrame to Pandas".
  • Automatic Variable Creation: A new variable is created for the Pandas DataFrame, so it can be accessed and explored during your debug session.
  • View Pandas DataFrame in Data Viewer: After conversion, you can inspect the Pandas DataFrame using the VSCode Data Wrangler extension.

How It Works

  1. During a debug session, pause the execution where a PySpark DataFrame is present.
  2. In the Debugger’s Variables window, right-click on the Spark DataFrame you want to convert.
  3. Select the option Convert Spark DataFrame to Pandas from the context menu.
  4. The extension converts the Spark DataFrame and assigns it to a new variable with the suffix _pandas.
  5. Optionally, you can view the DataFrame using the Data Wrangler's Data Viewer.

Installation

  1. Install the extension via VSCode by building and packaging it:
    vsce package
  2. Open the VSIX file generated and install it in your VSCode environment.
  3. Install the Data Wrangler extension if you want to use the View in Data Viewer feature.

Requirements

  • VSCode: Ensure you have Visual Studio Code installed.
  • VSCode Debugger: A working debugger setup in VSCode for your PySpark projects.
  • Data Wrangler: For viewing DataFrames, install the Data Wrangler extension to enable the Data Viewer feature.

Extension Settings

This extension currently doesn't require any configuration settings.

Usage

  1. Start a debug session in VSCode with a PySpark script.
  2. Pause the execution at a breakpoint where a Spark DataFrame is present.
  3. Right-click the DataFrame in the Locals window.
  4. Select Convert Spark DataFrame to Pandas.
  5. The converted DataFrame will appear in the Locals window with a _pandas suffix.
  6. Right-click the new variable and choose View Value in Data Viewer to inspect it (requires Data Wrangler).

Contributing

Feel free to open issues and submit pull requests if you have suggestions for improvements or new features.

Known Issues

  • Refreshing the variables window may take a few moments after converting the DataFrame. In some cases, manually stepping in the debugger might help update the view.
  • The View in Data Viewer feature requires the Data Wrangler extension to be installed.

License

MIT License. See LICENSE for more information.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published