by Faisal Dosani
We just open sourced 2 projects (datacompy, and locopy) with roots in Data Science and Engineering which we will showcase. While is it exciting and rewarding to share your ideas with the world it isn't always easy. Thinking about licenses, copyrights, and protecting confidential information is a must!
Working in a large organization which is embracing the mantra 'open source first' is really exciting. Part of this journey is to make sure we give back to the open source community when we can. Two of our projects had gained traction internally:
locopy. As part of our commitment we wanted to make sure we could open source these projects for others to use and contribute back to.
DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.
Locopy helps load flat files to S3 and then to Amazon Redshift, and assist with ETL processing. It is DB Driver (Adapter) agnostic, provides basic functionality to move data to S3 buckets, execute COPY commands to load data to S3, and into Redshift, and UNLOAD commands to unload data from Redshift into S3.
While building these products was exciting and fun, some of the legal considerations were as interesting, complex, and required collaboration between many teams, from security, licensing, brand, and IP/copyright. We'll explore the projects, and some of these other considerations which can make or break if you decide to release a project into the wild, along with the road blocks we faced with in these areas.