A PySpark notebook to convert XML files to CSV format. I needed the Stack Exchange Data Dump in CSV format for my project. All the converters available online are good for small files but when you have xmls ranging from hundreds of MB to GB, they were taking too much time.
This one works in parallel, utilizing Spark's RDDs and complete the conversion in few minutes with a minimal 2-core and 2 GB RAM Spark setup.
-
Notifications
You must be signed in to change notification settings - Fork 5
adich23/XmltoCsv_StackExchange
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Xml to Csv converter for Large files using Apache Spark
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published