Invited Talks of #ICLR2024

A Belated Blog

Nicholas Teague
From the Diaries of John Henry
18 min readMay 17, 2024

--

Recently had an enjoyable week in virtual attendance to ICLR 2024, an international AI research conference that this blog has surveyed in prior years. This attendance was coming from somewhat of a sabbatical from the blogging circuit — I had actually registered for ICLR 2023 but didn’t attend because I WAS ON MY HONEYMOON :), I think that is a good excuse for falling a little behind on the whole double descent bandwagon don’t you?

This was one of first conferences I had attended in a while in which I didn’t have a workshop paper somewhere or other, which for those looking to gain some acceptance by the research crowd is a fun way to try and contribute and as a bonus you don’t get annoyed by all of those recruiters attempting to hire you like those poor sobs in the main proceedings must have to deal with.

I will use this post to share some of my notes from the main talks partly because I haven’t blogged in a while and it is fun to share things that might have the potential to help to advance the field without having to deal with all of that peer review red tape, and secondly because since most of this blog has never been extensively peer reviewed in any reasonable way I don’t have to worry about people taking it too seriously. Actually perhaps one way to think about these notes are as my own kind of peer review feedback towards those invited talks presented in between paper sessions, which will loosely be structured in a manner of my sharing a few key points from the talks followed by any public comments I shared to the group chat for the speakers. (Yes I agree these conference chat boards are usually a ghost town, trying to do my part to make it a little more interactive for those watching from home.)

I do recall that a more diverse regional representation was more common in these venues before everything went virtual. My personal philosophy is unless intending to be open to professional interaction I would probably limit my paper viewing to scope like an abstract or poster in the manner of walking down the hall in the poster session. I recognize that cross border research is part of the reason this stuff has potential for controversy. I think the traditional justification for allowance has included the context of international challenges like Covid, or the potential significance of a small number of regional players gaining early access to AGI creating a perverse incentive to prevent other firms from advancing. Extending way beyond factual speculation, to be honest I had always considered it kind of a scary allusion to what could come to pass from the Ci Liu science fiction work The Three Body Problem as far as researchers facing new forms of obstacles to innovation.

I don’t think I have yet seen much in the way of public debate about whether there is a desire for long term retention of such openness as certain threshold of artificial intelligence may yet become met. My philosophy is that as long as I can attempt to conduct myself in a manner that avoids obvious paths of bad outcomes and respect channels of authority as they become known, any sources of ambiguity towards what is or isn’t appropriate for communication can at least be resolved with time, thus I don’t think it is entirely unreasonable that my work has found it’s way into a sort of black hole for a while, although to be honest the significance of such deficit has appeared somewhat extreme even in that context. (Is one of those risks you take when you attempt public research without formal affiliation, had always hoped that a fallback to my work not reaching significant visibility could be to at least leverage the experience for access to some more desirable affiliation, it remains to be seen whether that may ever pan out. :)

Why your work matters for climate in more ways than you think

Priya Donti (MIT)

Note that Priva Donti has been a part of the AI for Climate Change movement since around time I became active in the community. Her paper with collaborators “AI for Climate Change” was a good survey of potential applications when it was published in 2019, although that survey paper did not quite convey the full scope in modern paradigms of foundation model viability.

A big thing I have seen from these workshops (which I have attended sporadically) is that in addition to practical applications in energy and conservation, a noted part of AI for climate change movement includes benefit from channelling AI towards satellite imagery interpretation and assessment. Priya noted some other recurring themes that have shown up in the AI for Climate change community which has been conducting frequent workshops in this and other venues for several years:

  • Distilling raw data
  • Improving predictions
  • Optimizing complex systems
  • Predictive maintenance
  • Accelerating scientific discovery
  • Approximating time-intensive simulations
  • Data management

She also highlighted a significant trend relevant to our field, with a chart demonstrating data center power consumption on a growth track that looked reminiscent of a similar experience from the initial internet boom early in the 2000’s. My recollection is that some time ago a way to think about data center power consumption was that it could be expected to take up on order of around 1% of a nation’s power consumption. The growth projections of this demonstration suggested that there will likely be an increase to that figure.

Note to self: when you get a chance look up the term “optimization in the loop AI”, which Priya discussed in the context of power grid optimization via Reinforcement Learning and where such applications could become a big lever to improve penetration capacity for renewable energy generation to our national grid.

Public comment: “Hello Priyah, attending virtually this year, have sat some of your prior workshops. This isn’t a fully formed thought but was thinking about it during your presentation. Perhaps a hidden lever that could be applied towards benefit of the domain, if markets require some kind of price signal towards channeling resources to avoid externalities, a traditional way that policy has attempted to accommodate was through added costs from things like a pure carbon tax, cap and trade, or requiring purchase of carbon offsets (which can include scope like forestry planting / preservation credits, energy conservation investments, etc), or otherwise through incentives for alternatives rather than deterrents (things like eg production tax credit for renewable energy generation or loan programs for emerging tech, see work of Jigar Shah for instance). I speculate that top tier AI / data center resources are currently reaching some form of capacity constraint in ability to distribute artificial intelligence (as evidenced by the growing power demand for data centers noted above), perhaps some way that we could add additional market signal would be to prioritize channeled tiers of intelligence capabilities towards those domains which may not have same resources as high carbon emitters. In other words, if eg an emerging tech resource is identified capable of making headwinds towards these trends, other than a monetary tax incentive, we could likewise channel hidden incentives through channeled higher tiers of intelligence from those resources serving the community. I don’t know what that would look like, and the ethics of commercial vendors of cloud sourced AI compute having policy agendas is itself kind of murky, so perhaps at a minimum some form of transparency would be appropriate.

In other words, put simply, if AI compute is a finite resource, perhaps we as an industry can find some way to prioritize allocation of more impactful resources towards those domains with potential for macro benefit (in a manner similar to how economic incentives can be channeled by the tax code).”

Copyright Fundamentals for AI Researchers

Kate Downing (Law office of Kate Downing)

One of big reasons why Google Books was allowed to proceed and services like Napster were shut down was the presence of guardrails to mitigate channels for e.g. reading a full book line by line. Services like Napster didn’t only have the potential to be used for infringement, they actively promoted the use of their service for that end. Our courts want to see a proactive attempt to mitigate risks for widespread copyright infringement.

Note that when the speaker highlighted some of the current arguments being used by vendors as to why foundation models (like large language models and diffusion models) are not really copyright infringers, very few in the crowd were willing to raise their hands in agreement. I think we all acknowledge that the models are capable of copyright infringement. The major vendors are starting to disable capabilities for prompts that solicit “copying the style” of an author or artist.

Public Comment: “With regards to current foundation model services beginning to disable prompts with instructions to “copy the style” (of author or artist), I expect that a good compromise towards major copyright owners would be to facilitate such services as an opt in feature in a “GPT store” type setting where copyright owners can set their own pricing for allowing models to serve derivative models from their works.”

Public comment: “An important differentiator for foundation model services could be achieved by establishing a reliable means for services to channel some form of citation into their output. (Various RAG services are probably making this more feasible for those without a search engine.) From a shielding liability standpoint, a citation mechanism would be a great way to demonstrate that models are continuing to incentivize the creation of new novel work. Perhaps we could find some way to establish a license to certify “all original content” for publications that could make a work eligible for citation? (The copyright office database would probably be a great starting point.)”

Test of Time awards

Auto-Encoding Variational Bayes

Diederik (Durk) Kingma, Max Welling

Variational autoencoders were a precursor to modern paradigms of diffusion for generative image models. The talk had some interesting comments about the path to diffusion. Apparently the earliest demonstrations took place around 2015 and considered the diffusion process as an analog towards thermodynamics. To be honest I didn’t take extensive notes and just wanted to highlight the recipient and acknowledge the importance of this work towards many paradigms of learning.

The runner up for test of time award went to “Intriguing properties of neural networks” by Christian Szegedy et al which was considered the first documentation of how neural networks can be susceptible to adversarial perturbations (which led to a great deal of energy channeled into considerations surrounding how to make models “adversarial robust” against “adversarial examples”). Szegedy’s work even demonstrated a practical way to derive adversarial images to a classifier, where such adversarial examples are different from traditional channels for network misinterpretations from standpoint that they arise from subtle and sometimes even human imperceivable deviations to pixel composition. Note that Ian Goodfellow’s followup work “Explaining and Harnessing Adversarial Examples” is more commonly cited than this work (possibly because of a more intuitive paper title).

In hindsight it appears that rather than being a showstopper, the potential for adversarial channels suggests that at a minimum those applications requiring the most security be made aware of exposure to this channel, after all without the integration of sampled noise that may become a path towards an adversarial example, applications such as denoising diffusion models or the variational auto-encoders that preceded them wouldn’t be feasible. Teaser: more suggestive evidence on this debate about the benefits of non-determinism versus adversarial robustness can be found later in this essay.

ICLR Town Hall

With regards to the questions surrounding financial incentives for AI research conference participants, I wonder if there may be any analogy found with collegiate sports, which is a huge money maker to participating universities but also one in which student athletes have traditionally been bound to amateur status in order to participate (but with perks like scholarships and exposure to pro sports recruiting and things like that). The NCAA (National Collegiate Athletics Association) came to a recent form of compromise in which they began to allow student athletes access to different forms of monetization while still outside of a direct salary basis, including things like ability to sell likeness rights to video game makers or commercials. I don’t know if there will ever be a AI research conference video game looking to buy likeness rights for these paper writers, only point is that even if someone like ICLR isn’t offering a direct salary to participants there may still be several forms of perks. (An obvious one is that exposure to cutting edge research at least in theory should benefit a participant’s job prospects.)

The ChatGLM’s Road to AGI

Jie Tang (KEG, Tsinghua University)

So I wasn’t familiar with KEG at first but it turns out they are actually among the current top 5 in most liked organizations on the Hugging Face (which is sort of a GitHub for distributing model weights and whatnot).

It was interesting to see that this speaker appeared to have proceeded as far back as 2019 by way of following through on the “system 1 and 2” convention (proposed by an infamous Yoshua Bengio Neurips talk that was itself inspired by Daniel Khanamen’s behavioral economics framing), where the convention is for a systems 1 and 2 framing an AI may rely on alternate conventions for structured reasoning (eg leveraging a knowledge graph or supplemental agents) in comparison to few shot approaches towards prompt response. From the talk I couldn’t tell if this was a continued approach in their current models — the talk also sought to identify emergent properties of modern LLM paradigms. I wonder if there is a general expectation that with further iterations of LLM’s, with expanding context window, and access to RAG resources as a form of memory the lines between system 1 and 2 may become more blurred?

The scope of foundation model system architecture traditionally has been a tad outside of the scope of these conference proceedings, mostly these papers focus on specific subdomains of neural network architecture, training algorithms, and etc. The interaction concerns when such models are integrated into combined systems is usually the domain of commercial research labs and not likely to be channeled through these forums due to competitiveness concerns and the like. Even as labs like META or Hugging Face release their open source resources, the industry appears to have aligned towards such open source components being limited to a specific module of the neural network model (at least that’s my impression), this partly makes sense because when you start integrating additional components like RAG lookup capabilities or external knowledge graph (like those offered by the Wolfram Language) you end up with cloud resource costs for accessing third party services.

Stories from my life

Devi Parikh (Georgia Tech)

The speaker first shared some amusing stories from the early days of her research journey in the 2000’s era. While she had neural networks in her playbook all of her peers looked down at her work as they studied support vector machines and kernel networks. She noted that addressing missing data was a huge complicating factor even back then. Her version of a hyperparameter sweep was to have a set of PC’s in a lab each training a single model, and then she would log in to each PC to change some parameter setting one at a time. Her work even attempted to hand craft knowledge hierarchies. It was surprising when a peer had found a way to use pair wise comparisons to back into a meta representation. She kept at it all the way to modern foundation models and chatbots.

As an important insight, she noted that although getting a paper accepted to a major conference is kind of the visible success metric, the true latent variables that drive the most successful researchers is associated with advancing science, solving hard problems, and making a difference to things that matter. The conference circuit just represents a form of logistics for how to pursue that.

As advice to aspiring researchers: you need to be careful about climbing arbitrary ladders and pursuing arbitrary research objectives as they present themselves to you. It also matters whether some objective will make you happy. (Although I would contend that even if it doesn’t there may be cases where some higher calling may make that worthwhile, you just have to know how to identify those edge cases.)

She talked about self organization for productivity — calendars, to do lists, things like that. She found that organizing by blocks of time was more effective than organizing through a list of tasks. It is easy to forget that some of the researchers benefiting from this talk are still in a university setting without the self organization capabilities that tend to develop from spending time in the workforce.

Note that Devi has a podcast called Humans of AI which has been well received and offers non formal conversations with prominent academic researchers. Sort of a human interest version of reminding researchers that there are still people behind all of this.

Public comment: “Hello Devi, appreciated your presentation and your podcast, it provided a great reminder to the human side of AI research and the different challenges we all may go through in different stages of our research career. This is somewhat of a small matter, you mentioned when discussing the early 2000’s that “even back then NaN’s were a challenge”. I wanted to call to your attention and others that I had conducted sort of a long term deep dive into address a formally engineered method for missing data infill in tabular modality (eg for people working with dataframes), if you might be interested the writeup is available on arxiv as “Missing Data Infill with Automunge”. Regards.”

Machine Learning in Prescient Design’s Lab-in-the-Loop Antibody Design

Kyunghyun Cho (Genentech)

While the scope of the talk was associated with the speaker’s focus on targeting protein design for medical applications, I found a few components that were of particular relevance towards some of my recent interests in the combinatorial optimization domain, which appears to be one of the conventions leveraged for the application. One interesting takeaway was associated with some well known conventions towards combining multiple objectives into a common optimization objective.

The relevant terminology appears to be the “Pareto frontier”, which describes a curve found at the horizon of the best performance of an optimization that is compliant to those applicable constraints (wherein each axis for this curve would be one of the objectives and the curve separating those regions of compliant or noncompliant samples associated with whether a point meets the set of constraints for the objectives). The speaker noted a few different conventions for how objectives may be aggregated into a common metric, with conventions such as impact of an aggregation towards the Pareto curve hyper-volume, some form of scalarization approach, and then there are also methods that rely on a form of entropy search. (The speaker noted a few citations for each, I expect with a few key words they could probably be further clarified). The whole point of these forms of objective aggregations are that once we can choose a common metric, it vastly simplifies the combinatorial optimization to one that samples from a single objective.

Diving deeper into the weeds, it appears that an esoteric point related to practical matters for hyper-volume comparison form of objective aggregation may be associated with how many forms of performance metric balancing are highly scale sensitive with respect to each metric. (I don’t know if that suggests that simple forms of normalization are sufficient or otherwise some tuned form of scaling is appropriate.) Other obstacles to combining objectives can arise when one objective is dependent on the other, or when one objective is harder to measure than another.

In the domain of application discussed in the talk, as the space of proteins is quite large, the suitable proteins towards an application are likely to be distant to each other. (I expect that creates more difficulty for variational optimization in comparison to quantum annealing conventions — as a suggestion for those open to considering a new cloud provider there are some really neat resources available from D-Wave for quantum annealing forms of optimization which should be quite capable for such use cases.)

The speaker also noted that one way to evaluate regions of the protein space is by way of saliency maps (think of this term as analogous to a computer vision task which places a heat map over areas that our eyes are drawn to). He also demonstrated how the groupings of saliency masks could be enhanced by simple measures like injecting gaussian noise.

  • **Hold on a second, doesn’t this slide show a really clean way to think about how we could consider for aggregated systems of neural networks the merits of broad attempts at adversarial robust deployment in comparison to simple measures like integrating gaussian noise into inference? Could we interpret that with sufficient noise channeling into inference those of us without the budget for complex robustness mitigations wouldn’t have to worry nearly as much about the risks of adversarial channels? ( I speculate that it might :)***
Excerpt from invited talk, Kyunghyun Cho

In the context of protein design, the application may use saliency maps to target regions for mutation. Using a centralized database of proteins/molecules(?) called designDB, they produce a set of candidate antibodies, the property of these candidates are predicted for a bunch of different properties, then multi-objective optimization finds a subset to synthesize and evaluate in the lab. Every stage has semiautomated methods to evaluate. Their methods have demonstrated performance that appears capable of approaching the pareto frontier, and the dream is to one day further integrate the entire drug discovery and testing pipeline into a learnable framing, conducting some form of aggregated learning through the entire chain.

Blogging Track

Sharing in closing a few sparse comments on some random selections of the blog submissions to the conference proceedings. After all, many of these bloggers are just as much domain experts as those snobs in the main proceedings, maybe you should hire a few :).

A New Alchemy: Language Model Development as a Subfield?

Colin Raffel (University of Toronto, Vector Institute) [link]

The premise of this essay was kind of gimmicky, but I found the content worthwhile by addressing those aspects of large language model development that may be relevant to adapting towards various applications in industry which the blogger suggested could have the potential to be extended towards making LLM development a whole subfield of research. Considerations like what are our performance objectives (model performance, speed, memory, etc), where are our channels for error (reasoning, out of domain learning, etc) or where are the algorithmic levers that can be leveraged in these settings (eg convolutions for images, quantization, etc). The dialogue was fairly non-rigorous with obvious omissions, but clearly sufficiently informed by experience that it could serve as a helpful precursor for those who may wish to extend it to some more rigorous treatment, as could potentially take LLM development out of the domain of alchemy towards the domain of science. (Perhaps a task for the author himself?)

It is true that the field of deep learning has a long history of relying on non-rigorous methodologies for establishing some system, as training a neural network in the deep learning era is by definition one in which computational cycles are often capable of as much or more influence towards a model than design choices (see Sutton’s the bitter lesson premise for instance). The point of this article is one shared by myself in that even with the help of exponential computational power as future generations of hardware may enable, the whole point of these research conferences are that the field will likely evolve to new paradigms not only through number crunching but also by leaps of insight and paradigms of technological leap. This proposed form of extension of LLM development to a whole research domain might be just what would be needed to make such capabilities accessible by the bitter lesson.

Double Descent Demystified

Rylan Schaeffer (Stanford University), Zachary Robertson (Stanford University), Akhilan Boopathy (MIT), Mikail Khona (MIT), Kateryna Pistunova (Stanford University), Jason W. Rocks (Boston University), Ila R. Fiete (MIT), Andrey Gromov (UMD & Meta AI FAIR), Sanmi Koyejo (Stanford University)

To be honest it was disheartening to see the figure 11 in this paper without even a comment towards the blatant derivative from my paper Geometric Regularization [arxiv:2202.09276]. I blame [Henighan et al. 2023] from Anthropic for failing to cite my work as an obvious inspiration for their paper. Next you are going to tell me that Jack Clark doesn’t write his own science fiction stories. I canceled my sponsorship to the Import AI newsletter in protest :).

Next Steps:

These virtual conferences are great but lacking in opportunities for conversation. If anyone from the ICLR crowd wishes to say hello I am planning to attend the upcoming D-Wave Qubits conference to learn about all of the amazing real world applications that may benefit from emerging conventions in quantum annealing for low latency and high variable bandwidth optimization enabled by their best in class hybrid quantum / classical optimization samplers. After all large language models are great but error prone while these forms of optimization technologies are sufficiently tractable for industrial and safety critical domains much more common in commercial industry. Hope to see a few of you there.

For more essays please check out my Table of Contents, Book Recommendations and Music Recommendations.

Mrs John Henry :)

© Nicholas Teague 2024, all rights reserved

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.