Here are the steps to follow to run Python, Pandas and PySpark on a Mac:
Do we need more people … or a better workflow? This is a scenario that we all go through once in a while. You have the “official” development workflow in place that may consist of a mix of old school waterfall techniques, new agile techniques and different layers of quality control and approval. The development process often uses corporate tools, spreadsheets and analysts to make sense of the software to build. Behind this, to actually deliver software, you have the “real” workflow where efficient teams build a strong relationship with the data consumers.
The evolution happening in distributed storage, distributed processing, and in-memory processing, opens the door to new ways of serving data for analytics. Instead of using complex incremental processes to serve data to your consumers, a Recomputable Data System (RCDS) re-computes your analytics datasets by reading ALL the raw data every time it runs. It is also capable of handling batch and real-time processing, presenting a current and consolidated view of your business whenever you need it.
So much has been written about agility in software development that I was really wondering what I could add that brings a little value or clarity about the process and its application to Business Intelligence and Data Science. Here is a little story …
Data Modelers are sometimes introvert people who like sifting through mountains of database schemas and documentation. Data modeling is to some extent an intellectual undertaking where you almost have to reach a level of connection to the domain you study that resembles a Zen master’s connection to the universe.
Sometimes, trying to save money on the salary of the people building your foundational data architecture can have repercussions that are a lot more costly than the money you “save” by going cheap.
In Data Warehousing, the perception of success is different between users and engineers. In part 2, let’s talk about the success factors from the point of view of the data engineer.
In Data Warehousing, the perception of success is different between users and engineers. In part 1, let’s talk about the success factors from the point of view of our users.
In data warehousing, temporal data models and data flows have a real tendency to become complex very quickly. Adding to this, you may have to handle multiple disparate data sources that do not merge very well. You may want to load the same type of business events from multiple sources and run into missing attributes that creates blanks in your final serving tables.
For experienced data architects processing data for data warehouses and business intelligence solutions, we have been used to think “incrementally”. We have been creating complex data models and incremental load processes that are effectively required to work around the limitations in storage and speed of our databases and ETL tools.
In my current data systems, I usually create a persistent staging area (PSA) that receives raw/untransformed data, keeps all of it (all history), in the format we receive it (zip, json, csv, etc), etc. That first layer is very important because anything else you generate from it implies human decisions that transform the data … and those decisions can (will) be wrong and will need to be changed. The PSA is the key to maintain an agile data workflow where prototyping and experiments are possible.
I discovered computers in the 80’s when I was around 12 years old. It’s been an amazing ride through all the different evolutions of information computing from those early “DATA” lines hard-coded in Basic to today’s Big Data and NoSQL solutions often hosted in the Cloud.