Stories by JOHN CHAN on Medium

AI and the Future of Personalized Education: A Paradigm Shift in Learning

JOHN CHAN — Sun, 18 May 2025 03:54:45 GMT

Recently, I’ve been exploring the theory of computation. With the rapid advancement of artificial intelligence — essentially a vast collection of algorithms and computational instructions designed to process inputs and generate outputs — I find myself increasingly curious about the fundamental capabilities and limitations of computation itself. Concepts such as automata, Turing machines, computability, and complexity frequently appear in discussions about AI, yet my understanding of these topics is still developing. I recently encountered fascinating articles by Stephen Wolfram, including Observer Theory and A New Kind of Science: A 15-Year View. Wolfram presents intriguing ideas, such as the claim that beyond a certain minimal threshold, nearly all processes — natural or artificial — are computationally equivalent in sophistication, and that even the simplest rules (like cellular automaton Rule 30) can produce irreducible, unpredictable complexity.

Before the advent of AI tools, my approach to learning involved selecting a relevant book, reading through it, and working diligently on exercises. A significant challenge in self-directed learning is the absence of immediate guidance when encountering difficulties. To overcome this, I would synthesize information from various sources — books, online resources, and Q&A platforms like Stack Overflow — to clarify my doubts. Although rewarding, as it encourages the brain to form connections and build new knowledge, this process is undeniably time-consuming. Imagine if we could directly converse with the author of a textbook — transforming the author into our personal teacher would greatly enhance learning efficiency.

In my view, an effective teacher should possess the following qualities:

Expertise in the subject matter, with a depth of knowledge significantly greater than that of the student, and familiarity with related disciplines to provide a comprehensive understanding.
A Socratic teaching style, where the teacher guides students through questions, encourages active participation, corrects misconceptions, and provides constructive feedback. The emphasis should be on the learning process rather than merely arriving at the correct answer.
An ability to recognize and address the student’s specific misunderstandings, adapting teaching methods to suit the student’s individual learning style and level.

Realistically, not all teachers I’ve encountered meet these criteria. Good teachers are scarce resources, which explains why parents invest heavily in quality education and why developed countries typically have more qualified teachers than developing ones.

With the emergence of AI tools, I sense a potential paradigm shift in education. Rather than simply asking AI to solve problems, we can leverage AI as a personalized teacher. For undergraduate-level topics, AI already surpasses the average classroom instructor in terms of breadth and depth of knowledge. AI systems effectively function as encyclopedias, capable of addressing questions beyond the scope of typical educators. Moreover, AI can be easily adapted to employ a Socratic teaching approach. However, current AI still lacks the nuanced ability to fully understand a student’s individual learning style and level. It relies heavily on the learner’s self-awareness and reflection to identify gaps in understanding and logic, prompting the learner to seek clarification. This limitation likely arises because large language models (LLMs) are primarily trained to respond to human prompts rather than proactively prompting humans to think critically.

Considering how AI might reshape education, I offer the following informal predictions:

AI systems will increasingly be trained specifically as teachers, designed to prompt learners through Socratic questioning rather than simply providing direct answers. A significant challenge will be creating suitable training environments and sourcing data that accurately reflect the learning process. Potential training resources could include textbooks, Q&A platforms like Stack Overflow and Quora, and educational videos from Khan Academy and MIT OpenCourseWare.
AI-generated educational content will become dynamic and personalized, moving beyond traditional chatbot interactions. Similar to human teachers, AI might illustrate concepts through whiteboard explanations, diagrams, or even programming demonstrations. Outputs could include text, images, videos, or interactive web-based experiences.
The number of AI teachers will vastly exceed the number of human teachers, significantly reducing the cost of education. This transformation may occur before 2028, aligning with predictions outlined in AI-2027.

In a hypothetical future where AI can perform every cognitive task, will humans still need to learn? Will we still require teachers? If AI remains friendly and supportive, I believe human curiosity will persist, though the necessity for traditional learning may diminish significantly. Humans might even use AI to better understand AI itself. Conversely, if AI were to become adversarial, perhaps humans would still have roles to fulfill, necessitating AI to teach humans the skills required for these tasks.

How to Determine the Sample Size of AB Test?

JOHN CHAN — Mon, 26 Feb 2024 12:47:52 GMT

Install the Math Everywhere Plugin to view math formulas.

TLDR

For two groups with a two-sided alternative, normal distribution with homogeneous variances ( \(\sigma_0² = \sigma_1 ^2 = \sigma ^2)\) and equal sample sizes (\(n_0 = n_1 = n\)),

\[n = \frac{16}{\Delta²}\]
where
\[\Delta = \frac{\mu_0 — \mu_1}{\sigma} = \frac{\delta}{\sigma}\]

Examples:
What sample size is required if the baseline conversion rate is 2% and the minimal detectable difference is 0.5%?

We need the variance to solve for the sample size using the above equation. Since conversion is just a binomial proportion, the variance is given by \(p \cdot (1-p) \). Then, \[n = \frac{16 \cdot 0.02 \cdot (1–0.02)}{0.005²} = 12544\] (If you need a refresh on the variance of the binomial distribution, follow this link.)

I know most of you are not satisfied with the conclusion without explanation.

Statistics 101

Before we delve into the sample size calculations, these are the key concepts to understand. If you need a refresher, I have linked to some resources you may find helpful.

Type I Error (\(\alpha\) or p-value): The probability of rejecting the null hypothesis when true.
Type II Error (\(\beta\)): The probability of not rejecting the null hypothesis when it is false
Power \(= 1- \beta\): The probability of rejecting the null hypothesis when it is false.
\(n, \sigma, \mu, S.E. \): sample size, variance, mean (I find Section 2.4 of the book Introduction to Probability. by Dimitri P. Bertsekas and John N. Tsitsiklis an excellent resource on these topics) , standard error.
Critical value (\(z\)): The number of standard deviations required to reject the null hypothesis when true. It is closely related to p-value/ confidence intervals.

All the statistical concepts you will need to know can be summarised into one graph:

Assuming \(H_0\), the null hypothesis is true, i.e., we sampled from the \(H_0\) distribution, and the p-value represents the probability of observing what you have observed. Because we said if the probability falls within the dark area, we will reject the null hypothesis, that’s a false rejection. Thus, the dark area is also a type I error (false negative). Assuming, on the contrary, \(H_1\), the alternative hypothesis is true, i.e., we sampled from the \(H_1\) distribution, the shaped area represents the probability of not rejecting the \(H_0\) given \(\alpha\) because the p-value is less than \(\alpha\). Thus, the shaped area is a type II error (false positive). There is a trade-off between Type I error and Type II error. If we lower one, the other will increase (Imagine moving the critical value line left and right).

The only concept left in this chart at this point is \(S.E.\). Given that we knew the variance, \(\sigma\), of the distribution, the sampling error from \(n\) samples is \(\sigma/n\). The proof can be found on the Wikipedia page. The \(S.E.\) for two samples assuming equal variance (comparing the distribution between \(H_0\) and \(H_1\)) is \(SE = \sqrt{\frac{\sigma²}{n_1} + \frac{\sigma²}{n_2}} = \sigma\sqrt{2/n}\). I will not derive the formula here. The derivation used the definition of variance and pooled variance.

Deriving the Sample Size Formula

The critical value defines the boundary between the rejection and nonrejection region. This value must be the same under the null and alternative hypotheses. Thus, the fundamental equation for the two-sample situation: \[0 + z_{1-\alpha/2} \sigma \sqrt{\frac{2}{n}} = \delta — z_{1-\beta} \sigma \sqrt{\frac{2}{n}} \], rearranging it will yield \[n = \frac{2( z_{1-\alpha/2} + z_{1-\beta})²}{(\frac{\delta}{\sigma}) ^2}\]. For \(\alpha = 0.05\) and \(\beta=0.2\) the values of \(z_{1-\alpha/2}\) and \(z_{1-\beta}\) are 1.96 and 0.84, respectively. So, \(2( z_{1-\alpha/2} + z_{1-\beta})² \approx 16\) (You can find the corresponding value in Z-table. Usually, the table is given, but if you are interested in how it is derived, you may refer to this.).

The equation can calculate detectable differences for a given sample size \(n\): \[\delta = \frac{4\sigma}{\sqrt{n}}\]

Q&A

What if there are more than two groups? The required sample size per group is the same, provided you only compare with the control.
Is normality a poor assumption? While the sample distribution of the metric \(Y\) does not follow a normal distribution, the average \(\bar{Y}\) usually does because of the Central Limit Theorem.
Can we stop the experiment once the significant level is reached? NO! The sample size should be predetermined. More detailed explanation here.
Is there a problem if the metric unit differs from the randomisation unit? Yes. For example, click-through rate (CTR) is the ratio of total clicks to pageviews. The analysis unit is no longer a user but a pageview. When the unit of a user randomises the experiment, this can create a challenge for estimating variance because the assumption is that the samples need to be i.i.d. One trick is to write the ratio metric as the ratio of “average of user level metrics”, for example, clicks per user divided by pageviews per user.

I wouldn’t say I liked geography in secondary school.

JOHN CHAN — Sat, 17 Feb 2024 08:55:19 GMT

I wouldn’t say I liked geography in secondary school. It seemed disconnected from the real world — all about memorising facts and theories. It wasn’t until I read ‘Guns, Germs, and Steel’ that I grasped geography’s profound influence on cultures and economies. From there, ‘The Prisoner of Geography’ fueled my interest even more. Because my work focuses on the Brazilian market, I began exploring Latin America specifically. This led to books like ‘Latin America Region and People,’ ‘The Oxford Handbook of The Brazilian Economy,’ and ‘Placing Latin America Contemporary Themes in Geography.’ I even found helpful introductory websites on the region.

Mental Model

Topology and Climate: Topological features such as mountains, rivers, highlands, plains, and pampas determine the potential for settlement, connectivity, trade, and conflict. Climate, together with topology, influences water availability, the suitability of growing crops, and the potential for livestock farming.
Agriculture and Land Distribution: The availability of fertile soil and flat land has historically affected the region’s prosperity.
Natural Resources Distribution: Precious metals like gold and silver, other minerals, oil, and valuable natural resources are closely linked to prosperity. These resources have driven trade and created job opportunities within countries and businesses.
Language and Food: Language and food are prominent cultural features. Shared language and similar cuisines enhance economic and social connections between people within a region. We may use these features to segment the region into different groups.
Critical Infrastructure: Highway networks, power transmission infrastructure, ports, airports, and sanitation systems provide the foundational support necessary for economic development.

This is the mental model I began developing over two weeks through in-depth reading and analysis. It’s a framework I’m continually refining, but these dimensions seem essential for understanding the region.

Signup my blog for more: ZhiZhi Gewu (zhizhi-gewu.com)

Topology and Climate

Andes Mountains: The most dominant feature, the Andes span the entire western length of South America. This young and seismically active mountain range boasts some of the highest peaks in the world, including Aconcagua in Argentina. They have been a formidable barrier to internal communication and transport, isolating communities and challenging large-scale infrastructural development. The Andes also influence climate patterns and agricultural
River Basins: Three major river basins shape the continent:
Amazon Basin: The world’s largest rainforest and river system, covering much of northern South America. Its size and ecological significance present both opportunities and challenges for conservation and development. The basin’s rivers, particularly the Amazon River, are crucial for transportation, yet the dense jungle has historically limited overland travel and economic exploitation.
Orinoco Basin: Draining northern South America through Venezuela and Colombia.
La Plata Basin: Encompassing parts of Brazil, Argentina, Uruguay, Paraguay, and Bolivia, including the Paraná and Paraguay rivers.
Pampas: The fertile plains of the Pampas, primarily in Argentina, are significant for agriculture, supporting large-scale cultivation of crops and livestock grazing. This area has been a critical driver of Argentina’s economy, although it also presents challenges regarding environmental sustainability and land use management.
Highlands:
Brazilian Highlands: An extensive system of low mountains and plateaus in eastern South America. The Brazilian Highlands are a treasure trove of valuable natural resources.
Guiana Highlands: Located between the Amazon and Orinoco river basins, characterised by heavily forested plateaus.
Coastal Plains: Relatively narrow land strips along the Atlantic and Pacific coastlines. Coastal plains often serve as areas of significant economic activity due to their sea access, facilitating trade, fishing, and tourism. Major cities and ports are typically located on or near these plains, acting as hubs for commerce and communication with the rest of the world.
Latitude’s Fundamental Impact:
Spans vast range from subtropical to temperate zones, creating the foundation for a mosaic of diverse climates.
Specific Climate Zones:
Subtropical:
Northern Mexico (arid to semi-arid with varying precipitation)
Mediterranean climates along certain coasts
Tropical:
Dominates much of the region
Year-round rainfall in some areas, distinct wet/dry seasons in others
Southern Subtropical & Temperate:
Unique conditions ideal for diverse agriculture
The Andes: Master Shapers of Climate:
Create rain shadows on their leeward sides, fostering arid zones like the Atacama Desert.
Drive altitudinal zonation:
Drastic temperature variations with elevation
Distinct zones supporting specific vegetation and agricultural practices (e.g., coffee at certain altitudes)
Ocean Current Influences:
Shape coastal climates along both Atlantic and Pacific coastlines.
El Niño Phenomenon: Agent of Disruption:
Periodically disrupts typical weather patterns across the region.
Can intensify droughts or trigger severe flooding events.

Agriculture and Land Distribution

Amazon and Agriculture Challenges:
Large portions of Brazil are covered by jungle, complicating land clearing for agriculture.
Slash-and-burn practices permitted by the government lead to soil degradation and short-term agricultural use.
Soil becomes untenable for crops within a few years due to deforestation and soil depletion.
Navigability and Land Use:
Parts of the Amazon River are navigable, but its muddy banks and surrounding terrain limit agricultural expansion.
The savannah region below the Amazon, once considered unfit for agriculture, is now a major soybean producer due to Brazilian technology (More information).
Southern Cone Agriculture:
Comprises traditional agricultural lands in Brazil, Argentina, Uruguay, and Chile.
Efforts like moving Brazil’s capital to Brasilia were intended to develop the interior, yet the agricultural heartland remains underdeveloped in transport infrastructure.
Infrastructure and Geography’s Impact on Agriculture:
Brazil’s geographical constraints hinder transport infrastructure development, affecting agricultural and economic potential.
The vast interior and the Brazilian Shield present significant obstacles for connecting coastal cities and developing efficient trade routes.
Lack of modern roads and rail infrastructure complicates goods movement, including agricultural products.

Natural Resources Distribution

Gold:

Mexico: One of the world’s leading gold producers.
Peru: Significant gold reserves and production.
Brazil: Major gold deposits, particularly in the Amazon.
Chile: Long history of gold mining, ongoing production.
Colombia, Argentina, Dominican Republic: Also have notable gold resources.

Silver:

Mexico: World’s top silver producer.
Peru: Another major silver producer.
Chile, Bolivia, Argentina: Possess significant silver deposits.

Minerals:

Chile: World’s top producer of copper. Holds large lithium reserves for batteries.
Peru: Significant producer of copper, zinc, and other metals.
Brazil: Major producer of iron ore, bauxite (for aluminium), manganese, and other minerals.
Mexico: Possesses notable deposits of various minerals.
Bolivia: Important lithium reserves.

Oil and Gas:

Venezuela: Holds the world’s largest proven oil reserves, largely heavy crude.
Brazil: Significant offshore oil and gas production.
Mexico: Major oil producer, both onshore and offshore.
Colombia, Argentina, and Ecuador Hold notable oil and gas reserves.

Language and Food

Core Influences: Spanish and Portuguese

Dominant Languages: Spanish and Portuguese, introduced during colonial times, remain the primary languages across the region.
Rich Dialectical Variation: Spanish and Portuguese display significant variations throughout Latin America. Pronunciation, vocabulary, and even grammar evolve as a result of geographic, social, and historical factors. For example, Spanish dialects in lowland Andean regions differ significantly from those in the highlands. In Brazil, the Portuguese exhibit regional distinctions that parallel cultural landscapes.

Indigenous Language Legacy

Persistence and Influence: Many indigenous languages continue to thrive. This includes languages like Quechua, Aymara, Nahuatl, and Guaraní, spoken by millions. In Paraguay, Guaraní enjoys remarkable dominance as a national language. Indigenous languages remain central to cultural identity and have had a lasting impact on Latin American Spanish and Portuguese through loanwords and linguistic syncretism.

Other Influences

African Languages: The legacy of the slave trade continues with linguistic features and contributions from various African languages in many parts of Latin America.
Immigrant Languages: Immigration flows from Europe, Asia, and elsewhere have enriched the linguistic landscape with influences from languages like Italian, German, Japanese, and Arabic.

Language and Society

Social Dynamics: Language plays a crucial role in understanding social hierarchies. In countries like Bolivia and Guatemala, indigenous language speakers can face challenges related to marginalisation. By contrast, Paraguay’s widespread use of Guarani across social classes suggests a scenario of greater linguistic inclusivity.
Linguistic Identity: Language choices are strongly interconnected with cultural identity. Indigenous and African influences are critical contributors to the varied identities found across Latin America.

Critical Infrastructure

Railways

Historical Underinvestment: Compared to Europe or North America, Latin America’s railways are generally less developed. Rail was historically used for commodity transport, and many lines fell into disrepair with a shift towards road transport.
Fragmented Networks: Rail networks are often country-specific, with varying track gauges and little international connectivity. This limits efficiency and regional trade.

Road Networks

Backbone of Transport: Roads dominate freight and passenger movement. Density and quality are highest near major cities. Rural areas suffer from poor quality or seasonally impassable roads.
Major Trade Corridors: Pan-American Highway system spans the region. Other highways act as key trade routes within and between countries. Challenges lie in maintenance, expanding capacity, and improving rural connectivity.
Urban Congestion: Rising vehicle ownership and rapid urbanization put immense pressure on urban roads. Bottlenecks and poor traffic management are common. This necessitates investment in urban expressways, traffic management systems, and public transportation improvements.

Airports

Geographical Necessity: Given vast distances and challenging terrain, air travel is vital. While major cities typically have modern airports, many regional and smaller airports suffer from limited capacity and outdated infrastructure.
Investment Growth: Investment in airport upgrades and expansion is ongoing. Increased privatization aims to attract private capital. Focus areas include runway modernization, terminal expansion, and improved passenger amenities.

Ports

Crucial for Exports: Latin America’s reliance on commodity exports means deep-water ports are critical. Many ports along both the Atlantic and Pacific coasts have seen major updates to meet increasing trade demands.
Capacity Issues: Despite upgrades, capacity restrictions, inefficient logistical operations, and dredging challenges still hinder some ports. This creates vulnerabilities and additional transportation costs.
Hinterland Connections: Strong linkages between ports and inland transport networks (rail and highways) are vital for efficient handling of goods. Improving these connections is a focus of development.

Power Grids

Widespread but Uneven Access: Most regions have basic electricity service, though rural areas remain underserved. The reliability and efficiency of power grids vary between countries.
Renewable Energy Push: Latin America holds large potential for renewable energy, specifically hydroelectric, solar, and wind. Projects are expanding generation capacity and improving transmission lines.

The complex interplay of geography, natural resources, cultural inheritance, and infrastructural development paints a fascinating and challenging picture of Latin America. This initial mental model serves as a launchpad for deeper investigation into the region’s economic, political, and social dynamics.

If I were to write a practical guide on consumer loan credit risk management, this is what the…

JOHN CHAN — Mon, 12 Feb 2024 03:44:08 GMT

If I were to write a practical guide on consumer loan credit risk management, this is what the content page would look like.

Session 1: Fundamentals
Chapter 1: The Art and Science of Decision-Making
Chapter 2: Fundamentals of Interest and Loan

Session 2: Key Objective and Its Components
Chapter 3: The Balance Between Risk and Reward|
Chapter 4: Commonly Used Risk Metrics

Session 3: Prediction into the Future
Chapter 5: Estimating Long-term Impact from Short-term Outcome
Chapter 6: Creating Models of the Environment
Chapter 7: Estimating the Response from Interventions

Session 4: Diversification and Portfolio Management
Chapter 8: Modern Portfolio Theory
Chapter 9: Basel Accord

Session 5: Infrastructures that Facilitate Risk Management Practice

Session 6: Other Aspects of Risk

Chapter 1: The Art and Science of Decision-Making delves into the most critical outcome of risk management-decisions. If there are no decisions to be made, risk becomes inevitable (assuming we are conducting business), rendering risk management unnecessary. This chapter also explores typical decisions in risk management, such as underwriting and setting credit limits. Decision-making is a science because it optimises particular objectives. Conversely, it’s an art requiring a balance between exploitation and exploration. Merely exploiting known factors may not yield the best outcomes. We must venture into the unknown.

Chapter 2: Fundamentals of Interest and Loans examines the concept of interest, drawing insights from “The Impatience Theory of Interest: A Study of The Causes Determining The Rate of Interest” by I. Fisher. It is crucial to understand why customers are willing to pay for a product, the value it provides, and how its price is determined. This chapter also introduces the basics of financial instruments-loans, which, alongside equity, underpin other financial vehicles such as options.

Session 2: Key Objectives and Their Components discusses the primary goal of risk management, which is not to minimise risk but to maximise profit. The definition of profit can vary within an organisation-for instance, in terms of observation period and granularity (Product A only, or both Product A and B). This session also outlines typical components of revenue and cost and introduces commonly used risk metrics such as FPD, 3FPD, Vintage, and Flow Rate.

Session 3: Predicting the Future delves into the science behind decision-making in risk management, highlighting the mismatch between the timeliness of decisions and the duration of optimisation goals. For example, short-term risk can deteriorate rapidly during a financial crisis. Even though the optimisation goal might be a 12-month profit, immediate action is required based on the assumption that short-term risks will translate into long-term risks. Estimating the long-term impacts of short-term outcomes is critical. This session also covers typical models used in credit risk management, placing them within a broader decision-making framework rather than focusing on model development details. The final chapter discusses decision impacts, scientific methods for quantifying them, and how understanding responses can lead to better decisions.

Session 4: Diversification and Portfolio Management emphasises finance’s core principle: resource allocation. Although I have not used modern portfolio theory in my career, it could offer valuable insights. Firstly, it might provide a framework for funding allocation across markets, different strategies, and even within user segments of the same portfolio. Secondly, it could facilitate objective monitoring of risk-adjusted rewards over time. The session’s last chapter should explore the regulatory framework of risk, such as Value at Risk (VaR), which is outside my expertise.

Session 5: Infrastructure Facilitating Risk Management Practices. Since this guide is practical, it will cover the essential infrastructures for conducting risk management, including data platforms, decision engines, experiment platforms, and modelling platforms.

Session 6: Other Aspects of Risk highlights the limitations of this guidebook. Credit risk is just one form of risk; others include fraud, recovery, operations, etc.

After three years in the risk management industry, these are my learnings. This represents the branch of my knowledge tree, yet there is still much to explore and fill in the gaps.

Originally published at https://www.zhizhi-gewu.com on February 12, 2024.

The Bias of Using Observational Data to Estimate Causal Effect

JOHN CHAN — Sat, 27 Jan 2024 10:45:31 GMT

Please installed Math Everywhere Plugin to view math formula.

Let’s consider the effect of college attendance on an individual’s mental ability. We find that individuals who have attended college score higher than those who have not. What are the possible reasons? Does college attendance cause an increase in mental ability? There are three possible explanations. First, attending college might make individuals smarter on average. Second, those who attend college might have been smarter in the first place (i.e., even if they didn’t attend college, their mental ability is higher). Third, the mental ability of those who attend college may increase more than it would for those who did not attend college if they had instead attended college (meaning, the response to college attendance between the two groups is inherently different). We will try to demonstrate this using mathematical formulas.

Let’s define the Naive Estimator as:
\[\hat{\delta} = E_N[y_i|d_i = 1] — E_N[y_i|d_i=0]\]

Here, \(N\) is the sample size from the observational data. \(y_i\) is the realized treatment effect of individual \(i\). \(d_i = 1\) means the individual received treatment and \(d_i=0\) means they did not. The estimator suggests that the treatment effect can be estimated by subtracting the average mental ability of those who did not attend college from those who did. Of course, this estimate is a naive one.

The definition of Average Treatment Effect (ATE) is \(E[\delta]=E[Y¹] — E[Y⁰]\). The powers \(1\) and \(0\) indicate whether treatment is received or not. Notice that \(Y\) is a random variable as opposed to \(y_i\), which is the realized value for the random variable. Also, note that as opposed to individual treatment effect, which is defined as \(\delta_i = y_i¹ — y_i⁰\), we are interested in the aggregate causal effects. Let \(\pi\) be the proportion of the population that takes the treatment. We can rewrite the ATE as:
\[E[\delta]=\{\pi E[Y¹|D=1]+(1-\pi)E[Y¹|D=0]\} — \{\pi E[Y⁰|D=1]+(1-\pi)E[Y⁰|D=0]\}\]
For a sufficiently large sample size \(N\), \(E_N[y_i|d_i = 1] \to E[Y¹|D=1]\), and \(E_N[y_i|d_i = 0] \to E[Y⁰|D=0]\). Also, \(E_N[d_i] \to \pi\). However, there is no assumption-free way to compute the two remaining unknowns: \(E[Y¹|D=0]\) and \(E[Y⁰|D=1]\), which are the counterfactuals. Therefore, we are unsure whether the Naive Estimator is equal to the ATE. So, when will they differ?

Let’s rearrange the ATE formula in the following way (the algebra is a bit tricky, but it is just algebra). Let \(E[\delta]=e\), \(E[Y¹|D=1]=a\), \(E[Y¹|D=0]=b\), \(E[Y⁰|D=1]=c\), and \(E[Y⁰|D=0] = d\). Then, \(e=\pi a + b — \pi b — \pi c — d + \pi d\). This simplifies to \(0 = e — b + d — \pi a — \pi b + \pi c + \pi d\). We need to find what is equal to \(a — d\). Thus, \(a — d = (a — d) + …\), which becomes \(a — d = e + a — b — …\). Finally, it simplifies to:
\[a — d = e + (c — d) + (1 — \pi)[(a — c) — (b — d)]\]
\(a — d\) is different from \(e\) when \(c — d\) is non-zero or \((a — c) — (b — d)\) is non-zero. \(c — d = E[Y⁰|D=1] — E[Y⁰|D=0]\) which is the baseline bias (those who attend college are naturally smarter than those who did not). \((a — c) — (b — d) = (E[Y¹|D=1] — E[Y⁰|D=1]) — (E[Y¹|D=0] — E[Y⁰|D=0]) = E[\delta|D=1] — E[\delta|D=0]\). This is called the differential treatment effect bias (the response to college attendance between the two groups is inherently different).

When trying to recover the causal effect from observational data, we attempt to use different techniques to remove these two biases. One simple way is to conduct a randomized control experiment where \((Y¹,Y⁰)\) is independent of \(D\).

If you like my article, please consider subscribing to my blog: https://www.zhizhi-gewu.com/#/portal/signup

Originally published at https://www.zhizhi-gewu.com on January 27, 2024.

Prediction Or What If: What’s the Nature of Your Problem?

JOHN CHAN — Wed, 10 Jan 2024 12:00:35 GMT

With the rise of machine learning techniques, I believe we have become more susceptible to the ‘law of instrument bias.’ This principle suggests that if your only tool is a hammer, every problem looks like a nail. While techniques like regression, classification, and tree-based models offer tremendous value to businesses and societies, they are not well-suited for a crucial type of problem: intervention.

Subscribe My Blog: https://www.zhizhi-gewu.com/prediction-or-what-if-whats-the-nature-of/#/portal/signup

In predictive modelling, given a set of features \(X \), we estimate the probability that \(Y \) is true — \(P(X|Y) \). Machine learning algorithms use historical data to approximate this conditional probability distribution, which isn’t always available. Then, the trained model is used for making predictions. This approach works effectively in many tasks, such as facial recognition and image classification. However, the model’s predictive power diminishes if the probability distribution changes.

This limitation is why predictive models are unsuitable for intervention problems. When a policy interacts with its subjects, such as through vaccine administration or advertising exposure, it alters the likelihood of a response, \(Y \), given \(X \). Relying on the pre-intervention conditional probability distribution can lead to inaccurate predictions. The probability we need to estimate is not \(P(X|Y) \) but \(P(X|do(Y)) \), where ‘do’ signifies intervention. One might consider using a predictive model with a comparable group that has not undergone the intervention. However, once an action is taken, the model cannot observe the counterfactual, or ‘what-if,’ scenario. Predictive models suffice in an idealized world where data for both scenarios are available.

Many business problems are, in fact, intervention problems. This includes areas like marketing, UI/UX design, policy design, and customer service. To assess the effectiveness of a particular campaign, layout, rules, or service scripts, we must compare them with their respective counterfactual scenarios: what if we didn’t run the campaign or alter the layout, rules, and scripts? The standard approach is a randomized experiment. However, experiments are not always feasible; they can be costly and time-consuming. Therefore, I believe businesses should arm themselves with another tool — Causal Inference — and distinguish between prediction problems and intervention problems.

Originally published at https://www.zhizhi-gewu.com on January 10, 2024.

[Write to Learn Series] You May Not Know Linear Regression Enough!

JOHN CHAN — Wed, 03 Jan 2024 18:00:55 GMT

[Write to Learn Series] You May Not Know Linear Regression Enough! Discover the Causal Effect of Credit Limit on Risk

Disclaimer: The accuracy of the information in this article is not guaranteed. The methodologies discussed are conceptual and not yet implemented in business.

A common challenge in consumer loan industries is balancing risk management with credit limit increases. Credit limits, often determined by expert judgment, are typically excluded from the Behavior Scorecard Model (B-Score) to maintain prediction stability. Traditional machine learning models struggle with changes in feature distributions and may inaccurately portray the relationship between credit limits and risk. This phenomenon was particularly evident in my analysis of the Brazil market, which exhibits a stronger sensitivity to credit limits compared to Southeast Asian markets I’ve worked in (ID, PH, VN, MY).

In Brazil, I hypothesize — though unvalidated — that the high household debt ratio, compounded by elevated interest rates, makes consumers more susceptible to debt. This contrasts with Southeast Asia, where credit limits haven’t reached a risk tipping point. This insight underscores the need to understand the nuanced relationship between credit limits and risk, identifying features that differentiate user sensitivity.

The standard approach involves hypothesis-driven experiments, such as segmenting users by income for randomized controlled trials (RCTs). However, the complexity and duration of such tests, along with their exponential increase in dimensionality and resource requirements, limit their feasibility. An alternative is universal testing, but this is cost-prohibitive in the credit industry due to the randomness of credit limit assignments.

To circumvent these challenges, we can explore the treatment effect by assuming a causal mechanism between credit limits and risk. Let’s consider six variables: Application Scorecard (Acard), Behavior Scorecard (Bcard), utilization, tenure, credit limit, and bill amount. These are represented in a Direct Acyclic Graph (DAG), illustrating their causal relationships.

By generating synthetic data based on this DAG, we observe a negative correlation between risk and credit limit. However, this is confounded by Acard’s influence on both variables.

sns.regplot(x=synthetic_data_v2['credit_limit'],y=synthetic_data_v2['risk'])

The influence of Acard on both Credit Limit and Risk is pivotal: higher Acard scores often lead to increased credit limits, and conversely, a higher Acard typically indicates lower risk. This makes Acard a key confounder (in the pattern B←A →C, A is the confounder). Experienced analysts are adept at controlling for Acard to draw accurate conclusions. However, the addition of more variables can complicate the analysis. Take, for instance, the observation that higher bill amounts correlate with lower risk. Intuitively, one might expect the opposite since a larger bill amount suggests a greater repayment burden, presumably increasing risk. A closer look at the Direct Acyclic Graph (DAG) reveals that the confounding factors here are the interactions between Acard/Credit Limit and Bcard/Utilization. By appropriately adjusting for these confounders, we can uncover positive correlations that align more closely with our intuitive understanding of these relationships.

How are all these related to Linear Regression? The answer lies in The Frisch-Waugh-Lowell Theorem (FWL). Controlling for confounders is equivalent to adding the confounders to the regression.

smf.ols('risk ~ credit_limit + acard', synthetic_data_v2).fit().summary().tables[1]

coef std err t P>|t| [0.025 0.975]

smf.ols('risk ~ bill_amt + utilization + acard', synthetic_data_v2).fit().summary().tables[1]

credit limit is now positively correlated to risk.

coef std err t P>|t| [0.025 0.975]

bill amt is now positively correlated to risk.

The FWL Theorem states that the following are equivalent:

smf.ols('risk ~ credit_limit + acard + utilization + bill_amt + tenure', synthetic_data_v2).fit().summary().tables[1]

the OLS estimator obtained by regressing y on x₁ and x₂
the OLS estimator obtained by regressing y on x̃₁, where x̃₁ is the residual from the regression of x₁ on x₂

The second method says that the residuals from this regression represent the part of X1 that is independent of the other variables. Then, the final regression estimates the relationship of y on x1 free from the impact of other variables.

Could I control all variables then? Let’s try:

There is no relationship between credit limit and risk. We need to select the right variables to be included in the regression.

This method enables us to explore credit limit and risk relationships without extensive RCTs, allowing for more efficient user segmentation and feature testing, which I will try to cover in the next post on Effect Heterogeneity.

However, it’s crucial to remember that causal diagrams are based on assumptions that may not always hold. They can be challenged and refined using real-world data and analytical techniques like d-separation or conditional independence testing.

For an in-depth exploration and practical application of these concepts, feel free to visit the accompanying Google Colab notebook: https://colab.research.google.com/drive/1KeQUkB2eiHKTgT_2S_rr8trbZv-lbGUX?usp=sharing

Originally published at https://www.zhizhi-gewu.com on January 3, 2024.

Journey Through Years: Reflecting on Growth, Principles, and Future Aspirations (2024) | Data…

JOHN CHAN — Mon, 25 Dec 2023 09:32:39 GMT

Journey Through Years: Reflecting on Growth, Principles, and Future Aspirations (2024) | Data Driven Everything

Since the year 2020, I have shared my reflections on the year and my expectations for the upcoming one. It’s interesting to review these notes:

I would say the main difference between the older and newer reflections is that my thoughts and ideas have become more structured. In the first one or two years, I discussed specific goals I had achieved or missed and what I wanted to achieve the following year. These diverse goals showed that I was still exploring the world. Since 2022, I’ve focused more on forming my own principles to navigate life. I still have goals and aspirations, but they are now part of a larger framework.

Reflection shouldn’t be done only once a year because humans tend to forget. We often overestimate the lasting effects of both happiness and pain. Given enough time, our feelings weaken. That’s why it’s important to reflect regularly. This year, I attempted to write weekly/monthly reflections, and the yearly reflection is essentially a summary of these. (I must confess that I didn’t write every week.)

I listed seven guiding principles in my 2022 year-end reflection. I believe there is no need for change, as adhering to these principles will at least prevent me from failing:

High moral standards, righteousness, and trustworthiness
Humility
Usefulness to the world
Healthiness
A drive to understand the world
Long-term thinking and being a friend of time
Focus

Reviewing my weekly/monthly reflections, I felt proud of maintaining high moral standards and being trustworthy. However, the inner battle between good and evil is real. It doesn’t mean I won’t defend my interests if someone tries to harm them. It’s a struggle over whether to become that person on the other side, as the immediate gain is tempting, and I feel angry each time. I’ve concluded it’s not worth being that person; one cannot cheat their way to the top, and I must stand up for my interests. Being good doesn’t mean being weak.

I also felt proud of my curiosity to better understand the world. As previously mentioned, I think the world can be divided into the Physical, Psychological, and Logical realms. I’ve read extensively on these subjects, completing books such as “Calculus” by Michael Spivak (an introduction to Analysis), “價值” by 張磊, “心” by 稻盛和夫, “Pioneering Portfolio Management” by David F. Swensen, “Difficult Conversations” by Douglas Stone, “What Color Is Your Parachute?” by Richard N. Bolles, “Superforecasting: The Art and Science of Prediction” by Philip Tetlock, and “Bold Vision: The Untold Story of Singapore’s Reserves and Its Sovereign Wealth Fund” by Freddy Orchard. Alongside these books, I’ve also been actively reading about Linear Algebra, Causal Inference, Reinforcement Learning, and the Credit Card Industry later this year. Apart from books, I mainly read articles on science, artificial intelligence, and finance through Medium and Feedly, following various RSS feeds. I tried reading the WSJ and Financial Times, but the information overload led me to unsubscribe. I think I’ll focus on materials with higher information density for now.

I need to improve in humility. There are two aspects to this issue. First, overconfidence may hinder progress. Second, even if I’m right, the way I express my understanding and confidence may hurt others’ feelings. In my reflections, I consistently ponder these aspects. I’m not overly concerned about overconfidence, as I’m aware of many unknowns. My main concern is how quickly I judge right from wrong, which often intensifies conflicts in conversations. I need to improve my persuasion techniques, but first, I must control the urge to prove things right or wrong.

Another area for improvement is health. I’m still very healthy according to my health screenings, but I want to develop a consistent exercise habit, including both aerobic and anaerobic exercises. The main obstacle is time. To develop a habit, consistency is key. However, routines can be disrupted by holidays, overtime work, fatigue, illness, etc. Another issue is that I mainly do aerobic exercises. Incorporating weight training is important for preventing muscle loss and improving appearance.

For the remaining three principles — usefulness to the world, long-term thinking and being a friend of time, and focus — I don’t have much to say for now. Perhaps it’s because I’m still young. However, they will become increasingly important as I age, so it’s better to incorporate these factors into my decision-making going forward.

Looking ahead to 2024, I will continue to follow these seven principles and have outlined some areas of focus below:

In conclusion, I hope that sharing these reflections publicly will not only serve as a personal reminder to me but also provide enlightenment and inspiration to all my readers

Originally published at https://data-driven-everything.com on December 25, 2023.

The Bad Habits We Inherited From School

JOHN CHAN — Sun, 23 Jul 2023 08:36:19 GMT

The context for this blog post is important to grasp, so allow me to provide some background. Having graduated from University in 2017 and promptly joined the workforce, I’m now in my fifth year of employment. When I encounter fresh faces on the team, particularly those born in the 2000s, I feel a sense of age creeping in. They also prompt me to reflect on the gap between academic life and the realities of the working world, bringing to light the fact that excelling in school doesn’t always translate to success in the workplace, although it does enhance the chances.

The first habit borne out of school years that tends to be detrimental in the professional world is being a good problem-solver ONLY. In school, we’re groomed to tackle problems as swiftly as possible; the more practice tests and exercises we ace, the better students we become. Yet, the workforce has taught me that not all problems warrant solving, and the significance of problems can vary greatly. It’s equally, if not more, important to discern which problems are worth our attention. I’m not advocating that we shirk our responsibilities or dump difficult tasks onto others, but broadening our perspective to distinguish the most critical issues to address can be an invaluable skill.

The second habit that doesn’t translate well from classroom to office is the expectation of a syllabus or structured guideline to success. We’re accustomed to detailed curriculums and learning paths in school, but the real world rarely offers such a roadmap. This is mainly because performing well, even within a single industry, demands a diverse skill set, and there are countless paths to achieving success. We must, therefore, learn by observation, soaking up wisdom from those more experienced. Don’t anticipate that your colleagues will freely impart their knowledge; in fact, some might view you as competition and guard their insights jealously. To thrive, we must be proactive, observe others, ask questions, and critically analyze their responses.

The third unhelpful habit is viewing everyone as a potential competitor. The competition in school can be fierce, with everyone vying for that coveted number one spot. However, the real world isn’t as black and white; success isn’t exclusive. There’s room for more than one ‘number 1’. If we focus solely on outdoing others, we risk isolating ourselves, which hampers collaborative efforts. Success in the real world often necessitates teamwork and sharing insights and resources. So, if you’re viewing everyone as a competitor, your reach may be limited despite being an academic superstar.

The final habit to shed is relying on external motivation to learn. In school, parents and teachers often fuel our drive to learn. Unfortunately, I’ve observed that some peers cease to learn once they leave this structured environment, primarily because they lack the motivation to do so. Why subject ourselves to the rigors of learning when no one is prodding us, right? Wrong. The impetus to learn should be self-driven, facilitating our ability to adapt to our ever-evolving world.

A few years down the line, I may pen an article titled “The Bad Habits We Inherited After Working For 10 Years”. Until then, I believe that if we sustain our curiosity, eagerness to learn, and readiness to face and solve challenges, we’ll be just fine.

Originally published at https://data-driven-everything.com on July 23, 2023.

Revolutionizing Data Analytics with Text-to-SQL | Data Driven Everything

JOHN CHAN — Sat, 15 Apr 2023 08:29:24 GMT

A significant portion of a Data Analyst or Business Intelligence Analyst’s time is spent translating business questions into SQL queries. These analysts often serve as an interface between humans asking questions and the computers processing the data. In business settings, Software Engineers essentially act as translators, converting business requirements into executable code. With advancements in Natural Language Processing (NLP) and Large Language Models (LLMs), the role of analysts could potentially be replaced by language models, reducing the need for manpower and enabling access to databases without expert SQL knowledge.

The state-of-the-art Text-to-SQL model has achieved a remarkable 79.1% execution accuracy, measured by the percentage of generated SQL queries that return correct results when executed on the Spider development set, and a 97.8% valid SQL ratio[¹]. In comparison, OpenAI’s Codex davinci, without any fine-tuning, reached 67.0% execution accuracy and a 91.6% valid SQL ratio. If the model generates accurate results without any corrections, its performance may even surpass that of a human. Consider how often we can write a valid SQL query on the first try and obtain the desired statistics immediately.

After spending several hours on a Saturday morning reviewing literature in this area, I’ve found that current research can be divided into a few key parts:

Evaluation methodology: addressing limitations and improvements in existing text-to-SQL datasets and evaluation metrics
Multilingual Text-to-SQL: expanding research to encompass more languages
Cross-Database Text-to-SQL: developing models that generalize across domains and databases
Pre-training Text-Table Data: training LLMs on text-table data
Structure Grounding: mapping natural language phrases or words to database elements, such as tables, columns, and values, while determining relationships between them.

Text-to-SQL is poised to make a significant impact on businesses. By reducing costs and increasing efficiency when generating statistics from databases, this technology has garnered attention from major companies like Microsoft, Google, and Baidu, which are actively developing their own tools.

My Two Cents:
One challenge in the field of text-to-SQL is the inherent imprecision of human language. This task is not a direct translation, but rather requires contextual information and clarification of definitions and requirements. Ideally, the model should treat SQL query generation as an interactive process, with performance evaluated based on the satisfaction of the final query compared to the cost of executions. Additionally, techniques used by human analysts could be incorporated into the language model, such as examining a few rows of data before writing the query, writing simple code first, and then adding complexity. By “teaching” language models the definitions of metrics and the correct results to check against, complexity can be added primarily in terms of dimensions.

In conclusion, text-to-SQL is an exciting field with significant potential for both human-computer interaction research and practical business applications.

References

[¹]: Rajkumar, Nitarshan, Raymond Li, and Dzmitry Bahdanau. ‘Evaluating the Text-to-SQL Capabilities of Large Language Models’. arXiv, 15 March 2022. https://doi.org/10.48550/arXiv.2204.00498. ↩

Originally published at https://data-driven-everything.com on April 15, 2023.