
Opening the Black Box: Machine Learning Methods
Individuals, consumers and businesses across a broad range of sectors frequently face the consequences of decisions based on evidence collected from data. Prominent examples include decisions on the availability of credit, whether or not a surgical intervention should be undertaken, the price of insurance premiums, and parole hearings.
The resulting quantification of human life through digital information, often for economic value, has produced datasets that include many thousands of records, but are wide in the sense of representing more information about each record.


Diebold (2003) refers to Big Data as “the explosion in the quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and unprecedented advancements in data recording and storage technology.” In this note we use the term big data to refer to datasets where the width of data - namely the number of variables (or features) is large relative to the depth or specifically the number of observations.
Complementary to these developments has been significant advances in high dimensional statistics and machine learning (ML) methods more generally. As Lawrence (2019) notes, machine learning seeks to “emulate cognitive processes through the use of data.” Machine learning algorithms mimic this process using mathematical functions, differentiated by the type and number of parameters which control its behaviour. In this context “Learning” is the process of taking a set of inputs, and using a function to make it representative of the outcomes.
ML methods have been employed across a wide range of sectors. One prominent example is the detection of complex genomic interactions that can lead to diseases like diabetes, cancer, or Alzheimer’s. Here the challenge of big and “wide” data is especially pronounced given that although small individual differences in the gene sequence may have little effect, complex interactions among genes and environmental factors can result in significant effects. As McKinney et al (2021) note,


In the presence of high dimensional data, traditional statistical methods are not best suited to uncover these types of Interactions.
The Challenges of Big Data
Sparsity of data occurs when the volume of the data space represented grows so quickly that the data cannot keep up.
Figure 1 demonstrates this phenomenon. In moving from a)-c), the data space moves from one to three dimensions, with the given data filling less and less of the data space. In order to maintain an accurate representation of the space, the data for analysis needs to grow exponentially (3).
Figure 1: Sparsity and the Curse of Dimensionality


A model built upon sparse data will learn from the frequently occurring combinations of the attributes and will predict training outcomes accordingly. However, when the model is confronted by less frequently occurring combinations a problem of overfitting can result in a fall in prediction accuracy.
Unlocking the Value in Data
Economics and policymakers have also witnessed an exponential growth in both the depth and type of datasets. In this context data might constitute a high-dimensional array of numeric data or data in the form of text. In competition proceedings (4), questions such as whether merging parties are close competitors, or the extent to which customers of the merging parties supporting the proposed merger, cannot be addressed using standard quantitative data in tabular form. In this context Natural Language Processing is able to extract standardised qualitative data from textual information.
Speaking at the 2017 launch of Ofwat’s report - unlocking the value in customer data - Cathryn Ross, the then Chief Executive of Ofwat, emphasised that for companies to make the most of their data “they must understand the potential of the data they hold, as well as the role of econometrics and artificial-intelligence techniques such as machine learning (ML) in extracting more value from data.”
As an example a regulator would like to know the extent to which demographic variables, such as income, household size, and socio-economic status, are informative of the impact of energy policies on households. This is a difficult problem given that undertaking post hoc analysis as the set of demographic variables increase, runs into the well known multiplicity problem. In simple terms a key problem with this approach is that in mining the data for significant effects, the more one searches over a large set of possible effects, the more likely something will be found, with consequences for overfitting.
In assessing whether demographic variables are informative in terms of the impact of Time-of-Use tariffs on load profiles, the Customer-Led Network Revolution project noted ... a surprisingly consistent average demand profile across the different demographic groups, with much higher variability within groups than between them. This high variability is seen both in total consumption and in peak demand.
One reason for this finding might be that it is the (unknown) combination of low income (how low), household size (how big) and education (how little) that describes vulnerable customers. Or in other words, ex-ante segmentation based on a coarse set of demographics might not be informative.
In addition, unlocking value in data is not confined to simply determining which demographic variables are important.
Such an approach ignores the fact that many of these variables should be considered together, in a multiplicative fashion. However, finding such interacting sets of variables is challenging for many statistical models due to the “combinatorial space they need to interrogate, relative to the depth of the dataset”(5). This problem represents an accentuated form of the curse of dimensionality where the data available for building the model may not capture all possible combinations of the available information; in this sense the data space becomes sparse.
Machine Learning Methods
Machine learning methods developed in statistics and computer science have proven particularly powerful for predictive tasks. For example, hedonic pricing models, including models of house and car prices, seek to understand the impact of a large number of attributes on prices. In the interests of parsimony and addressing related problems of collinearity, and controlling the bias-variance trade-off, analysts have deployed methods from high-dimensional statistics (i.e ridge regression and Least Absolute Shrinkage and Selection (LASSO) ) and machine learning (i.e. random forests and generalised random forests) to identify the most important variables.


In this context the critical observation is that the impact of a policy also depends on how effective it is in selecting its targets. Examples include hiring decisions based on predictions of an employee’s productivity, the allocation of program services prioritized on predictions of who might benefit the most, and pre-trial bail decisions informed by predictions about recidivism.


Dr Melvyn Weeks, University of Cambridge
Dr Melvyn Weeks is a senior lecturer and fellow of Clare College, Cambridge University. Dr Weeks is an assistant editor of the Journal of Applied Econometrics, as well as an associate at Cambridge Econometrics. His work has been published in The Economic Journal, Journal of the American Statistical Association, Journal of Applied Econometrics, European Economic Review, Computational & Economics.
- (1) The Economist, November 21st, 2019.
- (2) NYT, August 12, 2012 - http://www.nytimes.com/2012/08/12/business/how-big-data-became-so-big-unboxed.html
- (3) See https://deepai.org/machine-learning-glossary-and-terms/curse-of-dimensionality.
- (4) See https://www.compasslexecon.com/the-analysis/using-natural-language-processing-in-competition-cases/03-22-2022/
- (5) See Bauer et al (2017).
Related Posts
Privacy Overview
This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Essential
Name | Description | Lifetime |
---|---|---|
ADD_TO_CART | (Adobe Commerce only) Used by Google Tag Manager | 1 Year |
GUEST-VIEW | Stores the Order ID that guest shoppers use to retrieve their order status. Guest orders view. Used in Orders and Returns widgets | 1 Year |
LOGIN_REDIRECT | Preserves the destination page that was loading before the customer was directed to log in | 1 Year |
MAGE-BANNERS-CACHE-STORAGE | (Adobe Commerce only) Stores banner content locally to improve performance | 1 Year |
MAGE-MESSAGES | Tracks error messages and other notifications that are shown to the user | 1 Year |
MAGE-TRANSLATION-STORAGE | Stores translated content when requested by the shopper | 1 Year |
MAGE-TRANSLATION-FILE-VERSION | Tracks the version of translations in local storage | 1 Year |
PRODUCT_DATA_STORAGE | Stores configuration for product data related to Recently Viewed/Compared Products | 1 Year |
RECENTLY_COMPARED_PRODUCT | Stores product IDs of recently compared products | 1 Year |
RECENTLY_COMPARED_PRODUCT_PREVIOUS | Stores product IDs of previously compared products for easy navigation | 1 Year |
RECENTLY_VIEWED_PRODUCT | Stores product IDs of recently viewed products for easy navigation | 1 Year |
RECENTLY_VIEWED_PRODUCT_PREVIOUS | Stores product IDs of recently previously viewed products for easy navigation | 1 Year |
REMOVE_FROM_CART | (Adobe Commerce only) Used by Google Tag Manager | 1 Year |
STF | Records the time messages are sent by the SendFriend | 1 Year |
X-MAGENTO-VARY | Configuration setting that improves performance when using Varnish static content caching | 1 Year |
FORM_KEY | A security measure that appends a random string to all form submissions to protect the data from Cross-Site Request Forgery | 1 Year |
MAGE-CACHE-SESSID | The value of this cookie triggers the cleanup of local cache storage | 1 Year |
MAGE-CACHE-STORAGE | Local storage of visitor-specific content that enables ecommerce functions | 1 Year |
MAGE-CACHE-STORAGE-SECTION-INVALIDATION | Forces local storage of specific content sections that should be invalidated | 1 Year |
PERSISTENT_SHOPPING_CART | Stores the key (ID) of persistent cart to make it possible to restore the cart for an anonymous shopper | 1 Year |
PRIVATE_CONTENT_VERSION | Appends a random, unique number and time to pages with customer content to prevent them from being cached on the server | 1 Year |
SECTION_DATA_IDS | Stores customer-specific information related to shopper-initiated actions, such as wish list display and checkout information | 1 Year |
STORE | Tracks the specific store view/locale selected by the shopper | 1 Year |
Marketing
Name | Description | Lifetime |
---|---|---|
CUSTOMER_SEGMENT_IDS | Stores your Customer Segment ID | 1 Year |
EXTERNAL_NO_CACHE | A flag that, indicates whether caching is on or off | 1 Year |
FRONTEND | Your session ID on the server | 1 Year |
GUEST-VIEW | Allows guests to edit their orders | 1 Year |
LAST_CATEGORY | The last category you visited | 1 Year |
LAST_PRODUCT | The last product you looked at | 1 Year |
NEWMESSAGE | Indicates whether a new message has been received | 1 Year |
NO_CACHE | Indicates whether it is allowed to use cache | 1 Year |
Functionality
Name | Description | Lifetime |
---|---|---|
MG_DNT | Allows you to restrict Adobe Commerce data collection if you have custom code to manage cookie consent on your site | 1 Year |
USER_ALLOWED_SAVE_COOKIE | Used for cookie restriction mode | 1 Year |
AUTHENTICATION_FLAG | Indicates if a shopper has signed in or signed out | 1 Year |
DATASERVICES_CUSTOMER_ID | Indicates if a shopper has signed in or signed out | 1 Year |
DATASERVICES_CUSTOMER_GROUP | Indicates a customer's group. This cookie is stored as sha1 checksum of the customer's group ID | 1 Year |
DATASERVICES_CART_ID | Identifies a shopper's cart actions | 1 Year |
DATASERVICES_PRODUCT_CONTEXT | Identifies a shopper's product interactions. This cookie contains the customer's unique quote ID in the system | 1 Year |
Statistical
Name | Description | Lifetime |
---|---|---|
_ga | Used by Google Analytics | 1 Year |
_ga_* | Used by Google Analytics | 1 Year |
Validate your login
Sign In
Create New Account