Here are the steps to follow to run Python, Pandas and PySpark on a Mac:
Do we need more people … or a better workflow? This is a scenario that we all go through once in a while. You have the “official” development workflow in place that may consist of a mix of old school waterfall techniques, new agile techniques and different layers of quality control and approval. The development process often uses corporate tools, spreadsheets and analysts to make sense of the software to build. Behind this, to actually deliver software, you have the “real” workflow where efficient teams build a strong relationship with the data consumers.
So much has been written about agility in software development that I was really wondering what I could add that brings a little value or clarity about the process and its application to Business Intelligence and Data Science. Here is a little story …
Data Modelers are sometimes introvert people who like sifting through mountains of database schemas and documentation. Data modeling is to some extent an intellectual undertaking where you almost have to reach a level of connection to the domain you study that resembles a Zen master’s connection to the universe.
Sometimes, trying to save money on the salary of the people building your foundational data architecture can have repercussions that are a lot more costly than the money you “save” by going cheap.
You are building a new data system using the latest cool technologies? This is great, but don’t forget that it is all about the users. The perception of success is different between users and data engineers. Let’s talk about the success factors from the point of view of your users.
In data warehousing, temporal data models and data flows have a real tendency to become complex very quickly. Adding to this, you may have to handle multiple disparate data sources that do not merge very well. You may want to load the same type of business events from multiple sources and run into missing attributes that creates blanks in your final serving tables.
For experienced data architects processing data for data warehouses and business intelligence solutions, we have been used to think “incrementally”. We have been creating complex data models and incremental load processes that are effectively required to work around the limitations in storage and speed of our databases and ETL tools.
In my current data systems, I usually create a persistent staging area (PSA) that receives raw/untransformed data, keeps all of it (all history), in the format we receive it (zip, json, csv, etc), etc. That first layer is very important because anything else you generate from it implies human decisions that transform the data … and those decisions can (will) be wrong and will need to be changed. The PSA is the key to maintain an agile data workflow where prototyping and experiments are possible.