## Introduction to multivariate data analysis in chemical engineering

Multivariate data analysis (MVA) is the investigation of many variables, simultaneously, in order to understand the relationships that may exist between them. Multivariate data analysis methods have been around for decades, but until recently, have primarily been used in laboratories and specialist technical groups, rarely being applied to production processes.

Most chemical manufacturing processes are highly multivariate in nature due to the complex reactions involved i.e. there are a large number of variables which are typically very interactive. Complex systems require multiple measurements to fully understand them.

However, the Statistical Process Control (SPC) tools used in chemical engineering still rely largely on univariate (i.e. one variable at a time) methods, which do not show the full picture of complex chemical processes despite collecting masses of data through instruments and control systems. These SPC tools use traditional statistical approaches such as mean, standard deviation and Student’s-t, which only look at single variables individually.

While univariate statistics can be useful for investigating simple systems, they tend to fail when more complex systems are analyzed. This is because they cannot detect relationships that may exist between the variables being studied, as they treat all such variables as being independent of each other.

This relationship among variables is known as covariance or correlation, and is a central theme in MVA. Covariance describes the influence that one variable has on others, and process upsets will typically be caused by several variables acting together.

A common example is the relationship between Temperature and pH, as shown below. Suppose I am the Plant Manager at a chemical plant. My process has been running smoothly until suddenly, the product quality starts to deteriorate and I have to make a decision what to do.

I have two control charts at my disposal for the measurements performed on the system. These are supposed to be related to product quality and are meant to serve as an indicator that the process is in control.

The ‘pH’ and ‘Temperature’ control charts indicate that nothing is wrong, but in fact there is. The point marked with a red dot in both charts is an abnormal situation, but how can I detect this?

Obviously univariate statistics have failed me! What if I plot the points of both control charts against each other to form a simple multivariate control chart? The plot labeled ‘Multivariate view’ shows this.

The square region formed by dotted red lines in the ‘Multivariate view’ plot shows the univariate limits my process is allowed to operate in, however, on this simple multivariate graph it can be seen that variable 1 and variable 2 are related to each other, i.e. higher temperature corresponds to higher pH. Accordingly, the multivariate limits, indicated by the ellipse in the figure, are very different from the two univariate limits, indicated by the dashed lines in the figure. The red highlighted point in the figure is well inside the two univariate limits but is outside the multivariate control limits. In this case, if multivariate control charts were applied the process operator would be able to detect the process deviation.

This is a simple example of how multivariate methods enable superior Early Event Detection capabilities compared to univariate control charts, especially when systems are complex and the number of input variables becomes large i.e. greater than 10.

**Typical multivariate techniques**

The main multivariate techniques are Exploratory Data Analysis, Regression/Prediction methods, and Classification methods.

Exploratory data analysis (EDA) attempts to find the hidden structure or underlying patterns in large, complex data sets. This gives a better understanding of the process and can lead to insights that would not have been observed otherwise. EDA methods include Cluster analysis and Principal Component Analysis (PCA). An example application of exploratory data analysis is checking for contaminants in a process or feedstock, or identifying by-products caused by incorrect process settings.

Regression analysis involves developing a model from available data to predict a desired response or responses for future measurements. Multivariate regression is an extension of the simple straight line model case, where there are many independent variables and at least one dependent variable. Regression methods include Multiple Linear Regression (MLR), Principal Component Regression (PCR) and Partial Least Squares Regression (PLSR). Common applications include predicting purity, yield or end product quality from input raw material quality.

Classification is the separation (or sorting) of a group of objects into one or more classes based on distinctive features in the objects. Classification methods include Linear Discriminant Analysis (LDA), SIMCA, and Support Vector Machine Classification (SVM-C). Example applications include grouping products according to similar characteristics or quality grades.

**Barriers to use of multivariate methods**

Multivariate methods are used today in the chemical, pharmaceutical, oil and gas, petroleum refining, mining and metals, pulp and paper, agriculture and food industries, to name a few.

However, due to their sophisticated nature, multivariate analysis has predominantly been used by scientists in R&D or Technical departments. This is because applying these techniques requires knowledge of the most appropriate methods for different data types, developing models, interpreting plots etc. Historically, these skills have not been a focus for chemical engineers, who have tended to use first principle models.

Collecting data systematically and being able to get it into a format suitable for analysis was another common obstacle to using multivariate methods in the past. This is less of an issue today because most processes are instrumented and sophisticated control systems are widely used. In fact, the challenge now is that so much data is collected that it is increasingly difficult to cut through the huge amount of information to find the underlying patterns, which is driving the use of multivariate methods.

**Why is the situation changing?**

While control systems and analytical instruments have improved greatly in recent years, the software and techniques used to analyze increasingly large, complex data sets has not evolved at a similar pace to the improvements in hardware and control systems.

Today, however, leading companies are looking for new sources of competitive advantage and realizing that the enormous amount of data collected during their production operations offer great insights to improve product development and process performance.

Additionally, companies are under increasing pressure to improve the sustainability of their products and processes, which can be achieved with greater insight offered by more powerful analytical tools.

Similarly, thinner margins mean companies are constantly looking to drive out costs by running processes closer to limits, using lower cost components where possible, reducing energy use and trying to minimize waste and rework costs.

Recent technology changes have enabled multivariate models to be developed by specialist groups and then applied to real-time process data. These tools can be integrated directly with instrumentation or as part of a larger control system, with results displayed in a choice of formats ranging from operator to expert view.

**Advantages of multivariate data analysis**

Multivariate data analysis provides a simpler yet more accurate view of the overall process health. It allows users to identify variables that contribute most to the overall variability in the data, essential for understanding complex data or processes. It helps isolate those variables that are related i.e. co-vary with each other, which can then be taken into account in further model development and analysis.

Multivariate data analysis is highly visual in nature. Rather than simply displaying statistics, it shows data plotted in a variety of forms, so patterns are easier to see and aiding in interpretation.

Traditional univariate control charts show many different variables simultaneously, making it extremely difficult to get a clear, complete picture. Multivariate control charts condense all of this data into one or two plots, taking into account the complex interactions between variables. If the process begins to drift, it is possible to ‘drill down’ into the specific samples or outliers to quickly identify the root cause of the problem using a combination of multivariate and univariate diagnostics.

**Benefits of applying multivariate models to chemical process monitoring**

Multivariate data analysis can be used across the value chain of chemical engineering from product development, scale up or scale down, process engineering and process optimization. Manufacturers who have adopted these tools can quickly see improvements in their operations and on their bottom line.

*Increased yields:* Identify which combination of process variables produces the highest yield, and improve process understanding generally to find the optimum settings.

*Improved quality:* Root causes of quality problems can often be difficult to identify. Multivariate analysis gives deeper insights and can pinpoint the variables or their interactions causing problems. When combined with spectroscopy, it can enable cost-effective 100% quality testing.

*Improved product safety and sustainability:* Reduce the use of hazardous chemicals, minimise scrap and optimise processes for more sustainable products.

*Reduced process failures:* Identify issues in a process before they become problems causing the process to fail, and use powerful multivariate diagnostics to drill-down into the cause of the problem.

*Reduced variability:* Maintain consistent end-product quality more effectively by early correction of drifts in process using multivariate predictive models.

*Cheaper product development:* Experimental design (DoE) combined with multivariate analysis lets you develop products faster and more cost-effectively, by reducing the amount of tests and experiments needed.

*Faster time to market:* Move products from the pilot plant to full production scale faster. Multivariate analysis provides the product and process insights needed to help make scale up smoother to get to market faster.

**How to get started with multivariate data analysis**

Virtually all manufacturers collect a large amount of data, but in many cases it remains unused in ‘silos’ or analyzed with insufficient tools. Implementing multivariate methods into a process environment starts with answering fundamental questions such as:

- What data are being collected
- How is the data available i.e. is it in paper reports, on local disks or accessible in a control system
- What is the organization/department trying to achieve, i.e. what are the pressing issues

With basis in these questions a problem statement is easily defined, and plans for a better use of the data addressing relevant issues can be created. The speed and affordability of developing and implementing multivariate monitoring and control strategies is comparable to those for univariate programs, but a multivariate approach gives manufacturers significantly better understanding of their processes, as well as more robust and sensitive control strategies.