Table of Content
- Brief about Rattle
- Primer Installation of R and RStudio
- Base Rattle Installation
- Understanding the Rattle Interface
- Data Ingestion in Rattle - I
- Data Ingestion in Rattle - II
- Explore Data in Rattle I
- Explore Data in Rattle II
- Explore Data in Rattle III
- Transform Data in Rattle I
- Transform Data in Rattle II
- Clustering in Rattle I
- Clustering in Rattle II
- Linear Regression in Rattle
- Logistic Regression Model in Rattle I
- Logistic Regression Model in Rattle II
- Decision Trees in Rattle
- Random Forest in Rattle
- Boosted Trees in Rattle
- Support Vector Machines in Rattle
- Artificial Neural Network in Rattle
- Ensemble Learning in Rattle
There are a couple of Analytical tools to handle data and create models. Most of these tools contributed by Microsoft, Google, SAS, R and Python are intensive on programming skills. A few do have the capability to allow the user to avoid programming and still explore their data science skills without programming. While I would discuss these as we go ahead in our series, lets discuss R right now. Python has evolved as our go-to platform for Analytics, Machine Learning and Artificial Intelligence. But our old friend R is still running in the race. Already implemented in a lot of applications and services, we still have plenty of juice left. An added advantage is the community support which is unmatched. For a non-programmer or someone who has just started executing data processes for machine learning, we can use Rattle. Business user's typically understand their data process flow. But they are hand tied because of lack of programming skills. Rattle could help them to explore their data without coding. Perhaps it can help them learn coding gradually.
Rattle is an open source package available on CRAN, which can be installed on R or RStudio. For ease of usage we will be deploying it in RStudio. Rattle library would give us access to a host of function in R accessible via a Graphical User Inteface. The tutorial is a series of explainer videos in a sequence to showcase the usage of Rattle for Data Mining. Below each explainer video, we will be having brief explanation. The tutorial assumes that the learners are already aware of the Data Science concepts. If you want to go through the data science concept's, you can enroll for this quick open source learning video series over here. This course would help you to prepare the conceptual aspects of data science.
In the video tutorial above, we cover the Base R installation followed by RStudio installation. Based on the OS, you may have to select the appropriate flavor. I have shown the installations for a Windows setup. But using the same approach and links you can install for Mac or Linux as well.
In the video tutorial above, before we start the analytics part, we need to install the Rattle package from CRAN. The steps for the same are shown over here. While the installation happens, a couple of supporting packages would be needed as per usage. That means, the base dependency packages might get installed now. But the auxiliary packages would be prompted by the system basis requirement. You may need to say a yes for install when the system prompts you. Rattle would make it almost effortless to use the R layer. This would more or less seem like using a Microsoft Revo R or SAS University edition. The Rattle graphic user interface would make it pretty intuitive to process, analyze and visualize your data.
Here I am explaining the overview about the Rattle Interface. The video explains about the various tabs which can be used by an analyst. It starts from data tab, then explore tab, statistical test tab, data transform tab. Further we access the cluster, model and evaluate tab. The last but not the least, is the log tab. Here we can see the execution logs. As a project, we may use almost 70% of these tabs. If you are doing a supervised learning process, you may not use the "Associate" tab. Only if you need to do an association rule mining implementation, you may explore that.
The data ingestion is the first tab on your Rattle GUI. Herein you can extract data from a flat file, excel file, SQL, Oracle or other formats. You also have the option of reading the data from existing installed r packages in your environment. You have the option of connecting through an ODBC, but since we are not using a database, I haven't shown that option. For your data analytics, the first step is data staging. As a process, you may have to do transformations on the data. Validating and updating your data dictionary may happen at this step.
Herein, we are looking at the steps to partition the data into training and testing sets. This is a continuation of the project phase wherein the business user may want to access specific problems in the data. As an analyst, we need to make sure that the complexity of information, the field mapping and data architecture is sound. While this tool doesn't intend to set up infrastructure for AI, it certainly gives insights in that direction.
Explore tab in Rattle helps us to do an exploratory data analysis. It has the summary option which lets us analyze the data. Draw distribution map and get insights. It helps us to describethe data for business used case requirment. You get an insight into the data distribution, visually as well as tabulated formats. You can view the basic statistics from this view.
I have further explored the visualization aspect using Histogram and Box Plots. The data used here is the standard Boston Housing dataset. You can explore other form and shape of data as per your requirement. You can use the health, bank and IOT data sets as well in Rattle. Every time you make a change of configuration, you need to execute to get the results. We have also explored correlations in this part of the tutorial.
In the last part of Explore Data in this tutorial, we learn how to quickly create interactive charts on a local host. We deploy a Visual Interactive responsive chart. This is executed using GGPlot codes created by the GUI while we use the WYSIWYG layout.
In this video tutorial we are using the Rattle tool to apply a Principal Component Analysis on the Independent variables. This helps us to create new features in the data eliminating Multicollinearity. Very quickly we also check a few statistical tests in this module.
Here we try some of the transform data options including Rescale, Impute and Recode. The intention is to keep the information intact but change the perspective for ease of processing. Transformation of data also helps us to visualize the data better.
Clustering is an important Unsupervised Learning technique implemented on Business data like Customers, Products and Markets. We have explored this option in Rattle in this Tutorial. Although, this run is just for prototyping and know where a replacement for production environment. As part of Artificial Intelligence implementation, clustering is an important algorithm as its what all humans do organically. Right from Customer segmentation, Market Analysis, finding decongested traffic routes and looking for product or market growth we use this technique. We do use a bit of transform part of the tool to get sensible clusters.
We execute some more variations in Kmeans. I have also executed some clustering using the Weighted Entropy Kmeans, hierarchical and Bicluster methods. You can try varying the number of centers, the distance methods and iterations for variations.
In the above video tutorial, we are executing a point prediction case using Linear Regression. This is accessible in the Model tab. Herein we start with data ingest, model type selection and finally applying the predictions. We have used the test, train and validate random splits using Rattle. Finally, we evaluate the model in evaluate tab. Rattle as a Visual Analytics tool makes it very easy for non programmers to build Machine Learning Models. While we do visualize our data, its shouldn't be expected to be at par with a visualization tool like Tableau or Spotfire. Reading the anomalies in the data is an important aspect of modelling. Make sure you use the explore-tab to find abnormalities in your data. Evaluate tab is the section from which we are processing our output / results in a report format. This can later be used for presentation to business users before the prototype is moved to development.
We step into Logistic Regression model build using Rattle. Herein we intend to use a data set which is related to an HR Attrition case study. Since its in Microsoft Excel format and has multiple sheets, we use the RStudio data import option. This object in RStudio is accessible in Rattle as R dataset. We use Rattle as a data analytics as well as BI tool. The case we use for logistic regression is related to attrition data. The Classification model has to be used as a decision system by the Functional HR to identify opportunities to reward or penalize incentive schemes. Assessments of these results would help HR identify Good employees who are on verge of separation. It would kind of act as a dashboard metrics while doing employee assessments related to their appraisals. This tutorial gives a functional HR Business person the opportunity to mine his/her data from a business perspective. It gives the individual the ability to map rewards depending upon his attrition score. We do the data ingest and data analysis in this part.
We build the model and create the Receiver Operating Characteristic's curve. I have also created the sensitivity, specificity and recall curves. Re-building the model with hold out data set to check the model fit. We cover different ways of using Rattle for this variation. As a process, we check the metrics of classification using Microsoft Excel.
Next we implement Decision trees on our data set. Its our first pure ML modelling exercise. We can do a recursive partitioning on the data or conditional one. Both options are available. Decision tree is implemented in this case of the HR attrition case using Rattle. We have visualized the classification model over here. The graph shows us the relationship model depending upon the variable importance. We use the variable importance to select terms / topic / features which give us higher classification fit. The tutorial will give you clarity regarding the different options in building the decision tree model. You can tune the min split option and the max depth parameters. This would decide the levels in the tree. The structure of your tree, a broader vs deeper would be important from the under fit vs over fit perspective.
The next machine learning algorithm we implement in our Graphical User Interface for RStudio and R Programming is Random Forest. The purpose of a Random Forest is to reduce the variance of a model. Its a framework which uses randomness of variable selection and randomness of data.
Next we implement Boosted trees. Here we have the option of using the Algorithms choice as Extreme or Adaptive on the dashboard. In turn, the Extreme option is representing the XGBoost Algorithm while the Adaptive option is representing the AdaBoost Algorithm. On an iterative basis, the algorithms converge on minimal errors. we check the metrics of classification using Microsoft Excel.
Support vector machines are one of the most widely used algorithms. We have implemented the same over here. When doing this programmatically you would have to think about Margins, the range between the support vectors. But over here we just use the template and make the best use of our data. You can refer to this programmatic implementation of Support vector machines for handwritten text classification.
Although from a framework perspective, Rattle does provide a structure and implementation for ANN. It doesn't provide us the flexibility to specify the topology or structure or the Network. At max, we can just increase or decrease the number of nodes. We don't have the flexibility to add input, output and hidden layers.
Ensemble Learning typically means using the learning from multiple models to improve the prediction or classification strength. Its like unity is strength. In Rattle, we don't have a pure play ensemble as we get to execute the models together but really doesn't combine their weighted outputs.
About the Author: