Dataflows for Customer Segmentation
Customer segmentation is the division of potential customers in a given market into discrete groups. This aids in attaining a deeper understanding of customers' preferences and needs with the idea of discovering what each segment finds most valuable to more accurately tailor marketing materials toward that segment. RFM (Recency-Frequency-Monetary) segmentation is one of the most popular, easy-to-use, and effective segmentation methods to analyze customer behavior. Here, ‘Recency’ denotes the time since the last order, ‘Frequency’ denotes the number of transactions over the time period being considered and ‘Monetary’ refers to the net monetary value of the customer’s purchases.
The basic workflow for RFM Calculation is shown below:
Figure 1: RFM Calculation workflow
The Recency, Frequency, and Monetary values are first calculated for all the customers from sales/transaction data and then converted to scores ranging from 1 to 5, denoting the percentile of the data that it belongs to. For example, the most recent customer would be given an R score of 5, while the least recent customer would be assigned an R score of 1. The R, F, and M scores are combined to create an RFM score. Here, for instance, 111 would be the worst customer (Low R score: Purchased a very long time back, low F score: Very low purchase frequency and low M score: Spent very less) while 555 would be the best customer (Purchased very recently, purchases very often and spends a lot of money).
Based on the business requirement, multiple RFM scores can be assigned to an RFM ‘Category’. Assigning a category in this manner helps drive further marketing actions for specific customer segments.
The steps involved to carry out RFM segmentation with Dataflows are detailed below. Though this could be achieved using SQL code, it would be much simpler to execute in Python.
Step 1: Loading transaction data into the Dataflow
- Ensure required input data is available in DWC (either through file upload or remote connections) – ‘Transactions’ table in this case (Sample dataset is available here)
- Drag and drop the table into the Dataflow
Step 2: Adding a ‘Script’ operator to the Dataflow
- Drag and drop a Script operator into the Dataflow
- Connect the output node of the ‘Transactions’ table to the Script operator
- In the ‘Properties’ panel, click on the ‘Pencil’ icon to open the script editor. Add the required Python code to the transform(data) function – Sample code detailed below (Figure 2)
- Add/remove required columns from the output Script node, corresponding to the output expected from the python code (In this example, CustomerID from the source table has been retained and new columns – R, F, M, RFM Score, rfsegment have been added. All other columns from the source table have been removed)
Figure 2: Python code for RFM Segmentation – to be added to the ‘transform’ function in the script node
Step 3: Adding an output table to the Dataflow to store the results of RFM segmentation
- Either create a table with the same structure as expected from the output of RFM processing in the Script node or drag and drop an empty output table to the Dataflow (Choose the ‘Create and Deploy’ option in case of latter) – the data insertion mode for the target table (Append/Truncate) can also be configured in the Properties tab
- Connect the output of the Script node to the output table
- Create a mapping between the Script node output fields and the output table fields (This should happen automatically – if not, providing the necessary mapping involves a simple drag and drop procedure)
NOTE: The output table should match exactly the columns of the Script node (column names and datatypes), or else it will result in a Dataflow execution error.
Step 4: Save and execute the Dataflow
Figure 3: (A) Data Flow monitor view (B) Data flow ‘Completed’ status in the Properties tab
Figure 4: RFM Result table
The result of RFM Segmentation can be seen here – each customer has a calculated RFM Score, based on which they have been assigned to an RFM Category.
- Click the ‘Save’ icon to save the dataflow and the ‘Execute’ icon to begin execution
- The status of the dataflow execution can be monitored in the Data Flow Monitor, within the Data Integrator Monitor (Figure 3(A))
- Once Dataflow execution is completed successfully, the status of the Dataflow will change to ‘completed’ in the properties pane of the Dataflow (Figure 3(B))
- The output table can be previewed from the Dataflow builder by clicking on the output table ‘Preview’ option
- It is also possible to schedule a Dataflow from the Data Integration monitor by clicking on the ‘Create Schedule’ option from the Dataflow dropdown (Figure 5(A)). Required frequency, start, and end dates, etc. can be filled in the 'Create schedule' popup (Figure 5(B)).
Figure 5: (A) Data Flow monitor - Scheduling (B) Create Schedule popup
Conclusion
Dataflows, on the whole, provide an easy-to-use interface enabling users to manipulate and persist data from different sources. The addition of Python as a scripting language opens up many new avenues for complex transformations. There are still many limitations in how Python can be used as of now. Python is currently executed in the sandbox mode. This means accessing the file system or network and importing other Python modules is restricted (Pandas and NumPy are the only libraries that can be used), as much as building classes and using coroutines. There is also no way to debug the Python code, with any error in the code itself or even a data mismatch between the Script node output and the target table leading to a Dataflow execution failure – without any details on the cause of failure itself. The limitations of Python scripting in DWC will be explored further in an upcoming blog.