Search with Non-generative-AI

Geoffrey Gordon Ashbrook
39 min readJun 9, 2024

--

https://github.com/lineality/arxiv_explorer_tools

Search, Sort, Match, Connect, Graph, & Transform with Non-generative models

A case study involving topic-sorting daily arxiv new-research-articles to highlight the importance of non-generative uses of AI models

2024.06.09,10,22 g.g.ashbrook

Code Here: https://github.com/lineality/arxiv_explorer_tools

Part 1: Overview

Part 2: More Details

Part 3: Discussion

How about the problem of sorting through volumes of material that are simply so big that we do not have time to look at everything?

Two common examples of this may be

1. looking at job applications where someone (working on hiring) wakes up to the task of picking a few applications and scheduling an interview and finds 20,000 applications in the in-box, or

2. where people who need to keep up with current research in their field (a long standing task in medicine) with two specific examples:

1. You are working on a medical/health task and you need to find relevant papers and relevant data in those papers (new and old).

2. You are trying to stay on top of sub-topics as they are published.

With the example of arxiv: https://arxiv.org/list/cs/new just for computer science there are usually 500 to 1500 new articles features each day (usually on the higher end), which is too many to even read the titles of each.

It is a constant problem that people are unaware of relevant arxiv papers for weeks or months (or forever) because there are literally too many to be aware of and this is doubly galling because they are hiding in plain sight. They are quite literally right in front of you…but you still can’t see them.

An additional problem, perhaps crucial, is that too often there is no simple ‘topic word’ by which these papers can be filtered.

For example even general topics such as computer-vision and images are not so easy to find with such a key-word-tag:

Today, a light day, there are only 856 featured new articles.

If we search directly for words or phrases relating to computer vision we find significantly different and non-overlapping sub-sets of articles:

- 192 matches for ‘computer vision’ matches

- 276 matches for ‘vision’ matches

- 363 matches for ‘image’ matches

- 12 matches of ‘pixel’ matches

- 0 matches for ‘pictures’ matches

(Here ‘matches’ means you ‘find’ search for that word or phrase on the web-page, and a ‘match’ is just a word on the website so one article-abstract may contain a dozen such matches.)

As just one example, one of the ‘pixel’ mentioning articles https://arxiv.org/pdf/2406.02381 is focused on applications (bio-medical) with no other references in the title or abstract or subjects keyword-tags to anything related to images or vision so it is hard to guess if the paper is about computer vision at all even if you did take the time to read the title and abstract (it turns out, if you read the whole paper, it very much is about computer vision after all).

Some sort of improved convention of tags or keywords or sorting words could probably go a long way to help. Even trying to sort the documents into very general areas such as: language-ai, image-ai, time-series-ai, is currently impossible based on simple key-word-tags that could very easily be an even loosely standardized convention. But, even if we had such a tool (and we should), all problems in search can not be solved by a few key words.

In any event, there are many such problems often without easy solutions, which is part of what makes matrix-ai a particularly useful and capable tool.

Language concept vectors (an area where we are sadly perpetually without clear shared terminology) allow very ‘fuzzy’ searches where the input (unlike the key-word-tag approach mentioned above that would be used more directly for deterministic search and tabular (as in tables of data) organization) for example: “Mystery” as a label key-word, such as “Mystery”-category of books with “Mystery”-category labels in a “Mystery”-category of a library search system with a “Mystery”-category shelf of books).

If you want to look for computer vision papers generally, or for example image-resolution-enhancement related articles more specifically, as the papers come in by the hundreds or thousands each day what can you do?

The ‘good old fashion ai’ or ‘gofai’ approach might be to have a human panel of experts spend weeks, months, or years crafting (and then updating and maintaining) a set of semantic abstract rules and or cleaned training sets to puzzle out what terms tend to match what subjects. In some cases this can work well, but the fuzzier the situation the worse and you probably do not have an institute of academics under your direction waiting for such assignments with years to spend on it (not to mention paying for all of that).

Statistical Learning or Machine Learning type AI, and other non-GOFAI AI, can bring many more tools to this challenge. After purely deterministic hand-crafted expert systems appeared not to be suitable for most practical use-cases, many other tools from very roughly 1990-to-2010 were developed giving people a broader tool set. From refurbished uses of good-old Baysian models, to decision trees such as XBboost, to supervised logistic classification, to unsupervised learning, to reinforcement learning, to state-vector-machines, to (evolutionary) Genetic Algorithms, there are many tools that make many tasks more manageable. I recommend Natural “Language Processing in Action” by Hobson Lane and others,

https://www.manning.com/books/natural-language-processing-in-action

(hopefully the 2nd edition will be good, the first edition is a classic) as a tour through some of the many many ways that language tasks can be done.

Looking for computer vision arxiv articles with such statistical-learning & non-neural-network machine-learning would be more likely a hobby project that would not be guaranteed to produce a portable, sustainable, use-able, solution. Though, if you had time, it might be very fun to collect years of data samples and statistics on pattern frequencies, and spend a few months reading tool documentation; it would be educational for sure, but most people do not have time for that.

As a side note: for larger AI-application, especially in NLP or natural language processing, the whole-result is nearly always an ensemble of tools old and new, so knowing about ‘old AI’ is critical in real life for projects that you have more than 30-minutes to work on. Also, there are many edge cases where among older models and methods there are techniques that are still the most effective for that particular case of task. Only throwing the newest technology at every problem is not a sufficient long term plan.

These pre-foundation-model technologies (such as statistical learning) have tended to be good for one very specific task for which they were designed. For example, make and train a system and tools to help you to find one type of item in a stack of data, such as the classic data-study example of survivors out of all people on the titanic. Predict one y given X-data. (And to be sure, there are and will be many uses for ‘do one thing well systems’.)

But in our case our ‘one task’ is very fuzzy (perhaps not really one task at all). We need anyone to be able to put any description of any subject or sub-subject into ‘our tool’ and come out with a good-enough-filtered selection of research articles matching that material. This is certainly not training a model to find y given X.

So now, in our super-brief walk through the timeline of AI, we move into the language-concept vector section of sub-symbolic AI:

Not all neural-networks are new and big and fancy foundation-models that can handle a broad cross-section of higher level concepts. There are many simpler neural-network tools that fit well with the do-one-task-well narrow-ai category of tools mentioned above. Books written before the explosion of foundation-model success in ~2023 are full of confident analysis of deep learning neural networks and generative chatbots, 100% confident that the abilities of foundation models are completely impossible even in principle: oops.

From here we will be jumping further ahead to foundation-model type AI-models as vector-matrix-ai.

Part of what is exciting about artificial neural networks is that in less than one day one person can make a use-able tool that anyone can use to search flexibly through arxiv (or perhaps any list of documents) for whatever subject you want (not just one fine-tuned narrow case), and developed and working on normal off-the shelf hardware and software making it accessible to users and developers (though without a fancy user interface…probably no one will be interested in using it). But the point is…it is use-able (if not desirable).

To clarify, this is not a single one step solution to everything (even just the context of search as in article search here). That we can quickly build a useful if superficial fuzzy search quickly is a big step forward, but try not to lose sight of the superficial aspect of this quick first approach. Getting more into the weeds and realizing more of the potential of meaning and concept matrix models becomes quickly much less simple. At the same time, there is a kind of arms-race between ‘better models’ and ‘more cheap and easy gains’ vs. none-simple ways of improving, specializing, and productionizing. It may be a common theme in various phases of AI that very expensive projects to push for deeper improvements may be superseded by a much simpler end-to-end general solution. A possible example is the infamous Netflix recommendation challenge model that took a huge amount of post-doctoral level expertise, ended up not being productionizable anyway because the solution was so impractical, and then was replaced by something much simpler and better not far in the future. At the same time, another theme I (if in the minority) think, not all situations are well suited to extremely simple general all-in-one or end-to-end (probably those terms are not always the same in meaning, but hopefully close enough here) models. I have tried to articulate some of these factors such as ‘externalization.’

(see: https://en.wikipedia.org/wiki/Netflix_Prize )

The foundation-model vector-matrix AI we are now talking about is related to the popular ‘chat-gpt’ technologies of foundation models, but here we will not be using generative tools. Hence, the emphasis on the many practical non-generative uses of matrix-ai. This is one of the mysteries of the psychology and sociology around AI in 2023–2024, that despite the laudable enthusiasm for generative models, non-generative technologies (often the same exact models, just used in a different way) are barely discussed at all (try finding even a single youtube video taking about using non-generative models for something that many videos (thankfully) come out daily discussing generative models for).

The difference in how we are using foundation models here, is that instead of looking at generated language or images or sound, we are looking directly at the matrix-vectors of the concepts in the models.

In a way this overcomes some of the ‘black box’ and variability factors that come with the (very useful) generative features of models. One way of looking at this may be that if can phrase the question that you are asking an AI-model into a form that is either binary (like classification) or a spectrum or probability question, then you are able to get a much more granular and stable answer to that question. Instead of being then end, this concept vector is then the beginning of what more you can do with that output, and often using hardware and software that is greatly more practical than with cloud-generative-chat tools.

If you are curious to look more into this, one place to look is the tool LMQL, which arguably highlights the potential of what you can do if you have access to the matrix-vectors of an AI model.

https://lmql.ai/

https://arxiv.org/abs/2212.06094

However, as of 2023–2024, I have found LMQL to be not yet a mature and workable tool for most production-projects, arguably largely due to the fact that for most of 2023 there were few performant models that were not so proprietary that only people inside the walls of google, open ai, and anthropic, had any access to the vectors. And also to the real world problem of basic ascii characters breaking the entire system (which is just not going to work).

But the potential is fabulous and I think LMQL still represents a good, if still academic, example of the doors and possibilities opened by dealing directly with the vectors of matrix-AI.

LLM to Vec and LMQL

https://arxiv.org/abs/2404.05961

https://mcgill-nlp.github.io/llm2vec/tutorial/

As another classic example, interestingly from as late as 2019 (two years after the transformer architecture was published, https://arxiv.org/pdf/1706.03762), is the king-queen word analogy process:

From Sutor et al 2019, “Metaconcepts: Isolating Context in Word Embeddings”

See: https://par.nsf.gov/servlets/purl/10108132

“Metaconcepts: Isolating Context in Word Embeddings” by Peter Sutor Jr., 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)

If you are chatting with a generative model you get a kind of randomized chain of related concepts, which many people hope will be pragmatically useful in many use-cases (though as of June 2024 this remains to be seen).

In this case we want to ask the model how close a given paper’s title and abstract can be measured to be from the cluster of concepts that we are looking for (such as image resolution in computer vision). But instead of the model generating words to express some commentary about this relationship, we can actually measure this ‘closeness’ or ‘distance’ directly (within the context of a given measurement) using a variety of conceptual-closeness measurements. This then becomes a quantifiable property in a measurable concept-space to which we can bring any STEM tools we can think to bring (currently distance-measures are the main tool).

The basic plan for this system:

1. pick an “embedding model” (a vector-making foundation-model) and get it working locally on your computer.

2. pick out some distance measures

3. calibrate your distance measures so that you know how close or far you want things to be.

4. decide on a threshold of how many and which measures you think constitute ‘enough’ of a connection

5. pick a goal-target-subject, such as “image resolution improvement in computer-vision”

6. get the page from archive listing today’s articles

7. find out the ‘conceptual location’ of your goal-target-subject, such as “image resolution improvement in computer-vision”

8. find out the ‘conceptual location’ of each paper description.

9. apply each of your distance measures to the two vectors

10. make an overall score for each article

11. make a list of the articles you want to keep

12. save those results in readable html format and data-friendly json or csv format (so it is computer-readable later)

Easy peasy!

In our case we will use the bge-large embedding model, in gguf model format so it can run on any computer’s cpu:

https://huggingface.co/BAAI/bge-large-en-v1.5

https://huggingface.co/CompendiumLabs/bge-large-en-v1.5-gguf

For distance measures, the most famous and commonly discussed are often: cosine similarity, dot product, euclidean distance, and manhattan distance.

We will use these fourteen distance measures:

list of vector distance metrics:

1. cosine_similarity_distance

2. correlation_distance_dissimilarity_measure

3. pearson_correlation

4. canberra_distance

5. euclidean_distance

6. manhattan_distance

7. minkowski_distance

8. squared_euclidean_distance_dissimilarity_measure

9. chebyshev_distance

10. kendalls_rank_correlation

11. bray_curtis_distance_dissimilarity

12. normalized_dot_product

13. spearmans_rank_correlation

14. total_variation_distance_dissimilarity_measure

To start with I picked one threshold, but you might also want to pick a strict and loose threshold. Ideally you may want at least three:

A. general binary threshold

B. strict binary threshold

C. loose binary threshold

As a note, I am very much not a fan of having any package dependencies in a project (e.g. ‘let test models’ can run with zero). Most of the dependencies in this case come from A: jupyter notebook which is not needed to run the software but is useful for development and experimentation and mostly 2: llama.cpp is not (very sadly) out of the box compatible with embedding models, so we are using an insanely bloated set of python-llama.cpp packages (not ideal at all, but it works for now).

A mature version of this should be possible with zero dependencies, which might involve making some linear algebra functions from scratch to avoid needing scipy and sklearn, but that’s all fun. An even more production-oriented version of this may be a rust server with no python at all.

On the other hand, this is also a great research area exploring different uses of vector-space, for which ‘bloat’ is not a problem as the goal there is flexibility not production-performance, security, etc.

For better or worse, this tool is currently more of a research-mode tool, but it can get the job done. Version-6 is also included in the repo, as a study-tool to quickly and easily compare the raw-vectors of compared language inputs.

Metric Calibration Methodology:

1. collection equations and algorithms for distance metrics

2. put those into functions (in this case, python)

3. normalize each metric (0–1)

4. allow the function to output a raw score or a binary score based on a threshold (e.g. >= 0.5 is 1; < 0.5 is 0)

5. test empirically

use a variety of example of:

1. identical

2. clear yes,

3. ambiguous, and

4. clear no

cases of meaning-similarity with significantly different diction (words used) and grammar

- for each metric, observe and record the score range for each category of case/

Note, even scores that officially measure dissimilarity are normalized with [float]”0" meaning dissimilar and [float]”1" meaning identical.

6. use empirically observed ranges to set the threshold for each function

7. test and adjust the calibration with as many examples as you have time for, including real examples from the data you will be using in your use-case, in this case: arxiv article abstract-blurbs with title (optionally with subject list: can be argued either way — a more interesting test without that included (excluded due to it being ‘time leakage’ information; e.g. you are not classifying the original thing itself, you are classifying the final answer classification label someone else invented (which may not even be accurate or may be significantly incomplete (e.g. biomedical research that is computer vision but is categorized as medical (not pure-vision research)) As with data science in linked-in resumes, one study may include more than 50 or 100 technical sub-domains which is perhaps uncharacteristically diverse and the custom is to use just one or two headlines as a proxy, which for technical specifically is significantly incomplete, yet by custom people are unprepared for including full technical descriptions with so many parts)

8. at least attempt to get peer review of your results and reproducibility

9. use in real projects and monitor success or failure to inform improvements in the system

Description/Specifications:

- Speed: 100 items per minute or 8 minutes for 800 items

- python

- python venv

- script or jupyter notebook

- hugging face gguf models

- python-llama-cpp (not llama.cpp)

- 14 precalibrated distance measures in binary output functions (string or loose settings as future addition)

- current input: scraped website with iterable parts (any other iterable list of text/strings can be easily swapped in)

- currently the 14 measures are not “weighted,” relative each-other

- an overall ‘how many successes for yes’ triggers final acceptance

Considerations:

- GI-GO: garbage in garbage out: If your problem is noisy inputs (e.g. if your match-to-this item is not really what you want to match to), this tool (or perhaps any tool) cannot telepathically change the inputs to something else you would or should have preferred to put in, as opposed to what you actually put in.

Vector-Space Greenfield

The term ‘Greenfield’ is sometimes used to describe a technology project that is building something new (as opposed to maintaining or adapting pre-existing tools).

In this sense using vector space for file management is largely a greenfield area. A few distance and vector-database search methods are routinely used, but overall the field is largely unexplored with a large number of goals and approaches to solutions.

- You can match to a term.

- You can match to empirical targets or examples such as one or more sample papers (and or summaries or metadata from those).

- As in the king-queen example, you can match to an analogy of an empirical target.

- You can use suites and regimes of profiling diagnostics with or without a specific topic.

profiling:

You also do not have to sort simply by subject, you can evaluate these papers in terms of other “concepts” that might make you want to read or avoid the paper:

- professionalism

- sloppiness

- clarity

- obtuseness

- rudeness

- pushy salesmanship

- extremist propaganda

etc.

Transformation

Just as images can be altered by vector shifting, so can text. Your target sample may have some properties you want, but others you do not want and which muddy the search. You can vector-shift-transform your target to get a cleaner match for what you are looking for.

So far we have been looking only at the raw text (if somewhat buried in HTML) about an article which is of a more or less consistent size and format.

Taking the next step of comparing each full article opens of many more puzzles, and challenges, and likely opens the door into the parallel topic of document processing and automated standardized meta-data extraction/construction, which are likely requisite steps within such an overall deeper full-media-handling system (and this is just ‘text,’ though a table of data is arguably not serial text depending on formatting, multimedia and multimodal tasks open yet more levels of challenge).

‘closest’ vs. close-enough

Another area where having consistent numerical measures can be useful is with relative closeness vs. objective closeness. Let’s say your boss wants you to find the closest few items out of hugely too many items to individually examine. To stick with our classic examples, this could be 20,000 articles or 20,000 enrollment applicants, or 20,000 job applicants, or 20,000 RPF bid proposals etc.

Two importantly different situations are:

- A. you have many good options and you are ‘spoiled for choice’ where any pick from those are fine. Vs.

- B. nothing is close enough to be useful at all

If you simply ask a generative model to ‘recommend the best’ you have, aside from no repeatability, no way to navigate around the above two scenarios.

But with direct use of vectors, you have a hard, repeatable, numerical measurement of distance that you can explore more deeply however you like.

Note: The quality of that numerical measure still depending on how good and how good-for-task your model is, but having a measurable distance is at least something to work with.

Comparative Vector Analysis

Another category of comparison is make a dataframe/database of the embedded values of all the comparison items, and perform other task:

A. Which is closest to your description?

B. How do they sort into groups being similar to each-other?

virtual super-documents

and chain-of-research RAG

Part 3: Discussion

“Noise” and Measurement

In Daniel Kahneman’s final book, “Noise,” written with Olivier Sibony, and Cass R. Sunstein, and published in 2021, Kahneman describes a number of problems and scenarios which lie at the fascinating intersection of real world decision making in many fields of life, with a focus on (the book’s title) ‘noise.’

While the whole book is too much to go into in what is already a too-long article here, I would suggest that the repeatability and measurability of vectors may relate to both analysis and solutions relating to practical problems in this book. Also, perhaps shortly before he passed away Daniel Kahneman recorded an audio interview with The Economist (perhaps sadly hard to obtain now) in which he described his optimism for machine learning and AI to help guide human decision making.

The challenge of fitting consistent numerical measures to fuzzy judgements, and both the underestimated difficulty of this and the (if selectively) unperceived failure to do this resulting in professional evaluation differences that are wildly variable (as in 90% different on the surface, not just a statistically significant .001 difference. Completely different judgements that culturally we pretend are stable and the same,

resulting in huge losses for companies, massive disparities in legal systems, etc.)

outlining both within-person judgment (which may be analogous to internal-data in project-object-relationship terms) and across people judgements (which may relate to ‘externalization’ of project STEM data as discussed in terms of project object relationship space studies).

https://www.amazon.com/Noise-Human-Judgment-Daniel-Kahneman-ebook/dp/B08KQ2FKBX/

As a note, Daniel Kahneman passed away in 2024 at age 90. We are all fortunate that he continued his work and diligence throughout his life to improve our understanding of general and human decision making.

Terminology and Clarity of Communication

Think in Matrix-Space:

I try to avoid the term ‘embedding’ and ‘embedding model’ as in many ways it is a completely unnecessary new word that is unclear and misleading and that functions counterproductively to make what should be clear concrete accessible useful and practical appear to be far away unclear mysterious elite mystical unattainable and unnecessary. By making up obscure words to cover up something that should be transparent it makes it almost unavoidable that the meaning becomes fictionally reified so that people feel there is a mysterious unattainable something accessible only to an elite priest-class. History should be enough to show us that this kind of non-democratic, non-meritocratic tendency is the wrong direction to go in, regardless of how tempting and attractive it is or how such a deleterious set of habits is so irresistibly attractive.

There is no mysterious priesthood of ‘embeddingness’ where high-priests understand more than lay-people. Vectors existed before and after 2023, and before 2023 all the high priests said vectors were useless for what they become very clearly useful for in 2023.

A. no one understands an epistemic ‘how’ to ‘explain’ the effectiveness or issues

B. Everyone CAN understand how to use these tools.

By bringing language and understanding concepts into a STEM-numerical “matrix” of vectors, you can then do anything in the whole toolkit of STEM that you can think to do with those vectors. This is a new frontier that were have been on the edge of for many years, yet we did not even know it was there (however metaphorically prophetic ‘The Matrix’ was in retrospect). Now everybody can be an Edmond Hillary, a Mark Watney, an Ada Lovelace, a Dora the Explorer, whoever that first person was to stand on the shore of Australia perhaps 40k years ago, and go into these Matrix-Worlds and find things no one knew were there AND bring them back for practical or artistic use.

Embracing Vectors in Matrix-Spaces

(Note: This paper is part of a larger collection of papers advocating for some approaches to describing what is going on with these new spaces and phenomena we are discovering, yet are not finding easy success at communicating about or planning for.)

Direct use of matrix-vectors by people or institutions on their own data can also nearly or perhaps completely avoid some of the real or perceived risks in terms of a fear of a rogue dystopian autonomous ‘AI’-something acting malevolently against you for its agenda or some other attackers agenda and not to just help you with the productivity-task that you are trying to do.

This may be not only a kind of best of both worlds where you get helpful advantages of STEM tools without giving up any personal or institutional agenda/direction/values etc., and also an example of how the strong tendency to try to frame ‘AI’ as being somehow absolutely separate from nature humanity biology planet earth etc., is a strange and largely delusional obsession that does not match a much more integrated reality.

For example, if you want to search or sort of work with person or institutionally-private information, there is a vast spectrum of matrix-vector, graph, and data-structure operation-functions that you can do (things you can do) that have no ‘ai-in-the-middle’ in the form of an autonomous agent who simply have to ‘trust’ to share your agenda.

And this may also give evidence that we are over-anthropomorphizing inherently-stateless generative uses of these same meaning-vector matrices. By compulsively hiding everything behind a clickbait-popular ‘Gen AI!’ wrapper, we may be perpetuating or exaggerating a self-entertainment illusion because we enjoy the theater and stage-craft of the drama, not because the fun-illusion matches the facts of reality.

Models have Trade-Offs

Because many research-paper-topic subject-areas are disciplines or sub-disciplines themselves over time, it should be possible to make high quality vector profiles of those specific topics, perhaps like refining the wording of a prompt.

Also, different vector producing foundation models (aka ‘embedding models’) will be better or worse at tasks that a given model is well adapted to, just like people prefer this or that generative foundation model for their use case or language or medium. So if this project is close to a system you want to use, try out a variety of ‘embedding-models’ (models that output vectors) and compare their performance systematically to find a model that works well enough for your use-case.

Future Tasks & Directions:

As a segue between what is the current code does and what has been discussed and a larger set of topics: think about what has been done so far in measuring the appropriateness of a tag or label for each document in a set.

1.

2. We are, if in a narrow way, generating metadata for those documents as a form of document processing, which is a crucial step (metadata generation/extraction is this crucial step) for:

- making structured training data

- making making structure testing data

- making meta-data for a vector-database system

- making meta-data for RAG a vector-database system

- making meta-data to structure a graph database system

- make meta-data to interface vector and non-vector database systems

Especially if you store and compare the vector-values of each item to each-other, you are effectively making a RAG or Retrieval Augmented Generation system…but there is no AI-generation. In a sense, you are taking up the generation role, so this still be RAG, but not generative-AI-RAG, rather: Human-RAG powered by the same matric-space that powers AI-RAG.

Again, how STEM tools work is not isolated from reality and ultra-alien in some sci-fi-horror strange-iverse; things are very interconnected and often in mundane every-day ways.

Using distance relationship to automatically construct a graph database…

matching to examples:

matching with vectors of articles you have found interesting…

Local vs. Cloud:

- cloud services are not designed for a large number of small queries

- cloud services often have no direct-vector options at all

- cloud services often throttle users who attempt more than ~20 queries per day (not enough to do the tasks we are talking about here)

- cloud services notoriously visibly or invisibly degrade the services of some people to keep the whole platform running across peaks of traffic and use: you might need to know if parts of your results were influenced by such fuzzed-out data resolution.

- As in the standard 96%-deep-web vs. 4%-clear-web rough division, most data is not safe to send openly into the internet or to share access to with any other party.

- The output of a generative cloud model may be on the more consistent end of the spectrum (as with claud3) or may be wildly erratic (as with gpt4).

- There may be no way of knowing if a cloud service has made changes to the model and embedder; indeed the whole point of a cloud service may be the assumption that they are continually improving their tokenization and models.

Scale and generative Models: Generative models often have increasing difficulty as the scale of the same problem increases, which may underline that the underlying process is (as Douglass Hofstedter predicted in the 1970’s) (to use Kahneman & Tversky’s terminology) a system-1 intuitive-guestimation, not a systematic externalized deliberate system 2 calculation.

When asking a generative model for a recommended selection of 2,000 or 20,000 items, (perhaps unlike needle-and-haystack questions) is probably very unlikely that the virtual-state-memory of guesstimation will accurately maintain resolution as the number of items scales (as opposed to a modular or iterative system-2 process which is indifferent to how many times it is repeated). In other words, some percentage of what you ask a generative model to look at will simply be ignored or completely mixed up (which may or may not be in predictable regions of the input-window). An important topic here may be ‘good enough’ and the advantages of end-to-end systems. Likely the granularity of guesstimation will improve and if you do not care about the level of noise and loss experienced it may indeed be pragmatic to go ahead with the same end-to-end generative model that you use for many general tasks (rather than putting time into making some other system). On the other hand, if you do need or want the various advantages that come with the direct use of vectors: repeatability, quantifiable output, and no effect of scale on reliability are likely important in some cases.

Parallel-Search:

Depending on your hardware, you should be able to perform batches of vector calculations in parallel (which may also get into cpu vs. gpu tradeoffs). The 1-min per hundred cpu iterations seen here, is entirely non-parallel and non-concurrent.

Compared with waiting seconds or minutes for a generative model to generate, you might be able to do not just one query in a fraction of a second, but hundreds or thousands (or more) in the same sub-second time.

Single Pass vs. Multi-Pass:

Multi-pass systems, granularity and the curse of word clouds:

- Single Pass vs. Multi-Pass

- summary vs. deeper look at text

- granularity and distance

While on the surface there may not be a similarity or parallel between multi-step rag that seeks to extract and retrieve information (based on vectors as well, again, the retrieval part of rag uses this same direct-vector-system) that cannot be matched by one single chunk based on one single vector, there may be parallels as well when doing vector-based analytics and decision making as we are doing here.

As in the case of infamous word-clouds, very very little information of any usefulness remains after turning a document or body of documents into a, however fun looking, word-cloud. If this were otherwise, then many NLP (natural language processing) problems and achievements would have been met many years earlier with much simpler technology. Concept-vectors as in deep-learning foundation models on the other hand contain much more useful information.

But think about this:

We have 2,000 or however many articles we are comparing. Actually, more specifically, we have a small chunk of meta-data about each article: the articles themselves are never seen by our system (and while in 2024 arxiv may be experimenting with an HTML view of some of the papers, the papers themselves are not per se part of the HTML web system; it may seem like an obtuse formality, but a link to a PDF is in many ways a brick wall that is utterly opaque and useless. Hopefully this changes in future, if only for archive where they get to decide, presumably, on the formatting of the papers they accept and post: printer-formats like PDF that may not contain any information other than pixel locations may be a forever-obstacle to text and data reading.)

What would change if we could see the full articles?

While matrix-vectors can be more information-dense than word-clouds (thank goodness), there may be a tautological and ongoing resolution and depth questions when it comes to comparing the vector for an entire object (here a paper and assorted descriptive information about it) vs. the vectors of structured or unstructured chunks of that whole object, which opens the door to many of the tangentially recurring themes in this paper: meta-data, document processing, breaking-down information, reassembling and building back up information.

For example:

Note:

- “blind” = cutting words and sentences in half

- “overlapping” = including the same text as another chunk usually at the beginning and end

If you make a vector (the same size matrix of numbers) for a whole article, and then do the same for

1. each section

2. each paragraph

3. blind non-overlapping chunks 4. blind overlapping chunks (blind = cutting words and sentences in half)

4. non-blind overlapping chunks

5. smart-chunks with meta-data about the whole article (and overlapping)

6. each sentence

7. each word

8. each character

9. each binary representation of each character

Would all the vectors be the same? Definitely not.

Would the vector of the summary be the same as all the other vectors (no, ~none of the ~useful vectors are the same)

Note: there is the recurring extreme edge case of a “blank” document, in that case the vectors would be the same or similar (with details depending on what exactly blanks means (spaces, zeros, on space, nothing at all, etc).

Does going ‘down’ in depth and granularity always give you a ‘better’ vector or some vector with (more or otherwise) useful information? No, once we have gone down below the word level, the vectors are basically useless. The vector for ‘a’ or the hex unicode 0061, or each part of that, give you no useful information about the whole paper.

(possibly a vector database of each word’s vector might be used in some forms of analysis, but probably not useful for most every-day cases…probably). This also raises the question of tokens and tokenizers. Unless you are using a sub-semantic tokenizer that uses letters or unicode-hex, or sub-hex, deviating from the tokens that the model uses will deviate from the vector-space of that model as relates to that article.

So we have a kind of gradient or spectrum of vectors for a given ‘object’ (and in real life there may not be discrete objects at all, as with any continuous time series) and also no absolute way of there being one way to get these real one type or depth of vector for, let’s say, one article.

While it probably seemed at first like there was no likely connection at all between the simple one to one distance scoring we have done here and multi-step retrieval (as in RAG), we now seem to have found ourselves in the middle of a very similar muddle of quandaries of wholes and parts and how best to make connections between separate things, and how to break things down and build them up again, and how to create and use comparable meta-data about a given thing. Our familiar and transparent kiddy-pool suddenly dropped off into loch-ness, or the seas of Enceladus.

We do not even have any idea how we should be vectorizing the whole article at all.

Multi-Pass

There are probably various kinds of and ways of defining multi-pass systems. This is not yet like multi-pass rag where we based the next query on what we found, but we are using and comparing multiple distance metrics already.

One dimension deeper might be to have not only more than one ‘article’ that you are evaluating, but everything we found about the not entirely singular identity of the article may also apply to the definition of the target or the terms of analysis. There may be two or more slightly (or very) different renditions of the target being looked for, and systems in real life will likely need to find stable, transparent and not-noisy ways of dealing with this type of situation.

Use-case Adaptations:

The list of items here was scraped from a daily science article web page, but for other use-cases your list will be something else from somewhere else, with perhaps some other preprocessing step:

- a csv file

- a jsonl file

- a list of txt/doc/pdf files

- a folder of image or sound files

Bottom-Up, Top-Down, and Vector Relationships:

We are not (I conjecture) entirely sure how meaning is contained or reflected in matrices and vectors (or how meaning moves through such a (hot or cold) medium).

For example, when we are doing a classic ‘distance’ calculation we are doing a kind of top-down meaning-vector operation. We make one vector for some scale chunk of material (again, with all the questions about how best to do that). And based on the context of this approach there are all kind of interesting questions that we approach or phrase in a given way:

- How much information is contained in a single vector?

- As models get bigger, or denser, how much more meaning can be packed in?

- What kinds of meaning are contained in a vector, and what kinds of meaning are not?

- In theory is it possible to entirely reproduce either a document or the corpus of documents from which it comes from a sufficiently dense vector (perhaps in the spirit of Laplace’s daemon)?

- How do questions such as content and reconstruction change if two vectors are being used instead of one?

- Do vectors behave according to the same statistical rules as samples from a population?

- etc.

Now let’s look at things slightly differently.

Remember J. R. Firth’s echoing foundational insight from earlier in Natural Language Processing history:

“You shall know a word by the company it keeps” (Firth, J. R. 1957:11)

- Firth, J. R. (1957). Studies in Linguistic Analysis. Wiley-Blackwell.

- https://en.wikipedia.org/wiki/John_Rupert_Firth

- https://cs.brown.edu/courses/csci2952d/readings/lecture1-firth.pdf

When we are using, for example, the generative properties (again, often of the same meaning-matrix models) we are not putting one single macro vector (top down) into our project-thread of processes. Rather, generation using a more bottom up approach. Interestingly, even though we had previously dismissed looking at the vectors of single tokens as not in and of themselves containing any information, generation (bottom up) does (somewhat) just that: putting in one token at a time. But this is not to analyze the properties of each token, rather this is to take a leaf from J. R. Firth’s book: “You shall know a word by the company it keeps” We are using the relationship between the tokens got get information out of the model, perhaps in a mirroring way to how information was put into the model in the first place: encoding the relationships between the tokens that went in when training the model. (Again: linear relationship threads of the relationships words and parts of words have, not vectors of whole documents all at once.)

Document Processing:

This bottom-up use of matrices, outside of generation which of course is most popular, is perhaps under-explored, under-utilized, and under-noticed. Could part of the puzzle of how to map the content of a document at each scale from token to whole-document to whole-corpus lie in these properties of relationships between vectors, or the vectors of the relationships between tokens. (And massive apologies for what portions of this I have gotten wrong.)

The goal is still to start with an unprocessed unstructured corpus or document and end up with a useful set of maps, ideally with the properties of various kinds of databases:

- relational

- non-relational

- graph

- vector

(and likely many other categories as well)

Note: the term ‘database’ is more general than some may suspect: any collection of information.

Cooking-Recipe Example

Let’s try going some depth into another gedankenexperiment (or thought-experiment): cooking recipes!

For now this is a thought-experiment, but as a followup we could do real tests to see what happens.

Our task is to process recipe documents so that we can do specific searches across the recipes later on.

We will be processing any and every recipe that comes in, but we will be searching for specific multi-course seafood recipe information.

- In what order is the preparation, cooking, and serving done?

- What kinds of fish and dairy are a part of what preparation?

- What kinds of fish and dairy are a part of what final dishes?

- What recipes do not use or contain certain types of fish and dairy?

- What are the processes for selecting live mussels?

- What are the processes for cleaning live mussels?

- What other recipes can be used or adapted for using leftovers the next day of today’s primary recipe?

- What maximum cooking temperature is required?

- What steps and resources are shared or not-shared between/across courses of the meal?

- How many range-top units are needed when most are used at the same time?

- What recipes can be used with different kinds of fish when and where availability changes unpredictably?

Here are some basic overall questions we can ask about our tools:

- Will a single vector of a recipe overall contain any granular detail of the steps, sequences, ingredients, exclusions, mins-maxes, etc?

- For which questions will we want to use what kind of database?

- For a given kind of database, how can we reliably extract that information?

As another section for this thought experiment to highlight some of the system needs, imagine you fed each recipe into a standard chunk-RAG system, so you would search for the particular paragraph that most closely matched the one vector of your search-query.

Could such a basic rag work well enough with a basic planning question such as: I’m looking for a recipe where clam juice is not used in more than two of the five seafood courses. Occasionally you might be lucky and find an exact match for that sentence “This is a recipe where clam juice is not used in more than two of the five seafood courses!” But this also highlights perhaps the meta-data conundrum where there are perhaps recombinantly infinite numbers of such internal and external comparison questions such that trying to anticipate such questions to make the answers explicit in directly searchable metadata would be infeasible.

An analogy may be to compare these recipes to, or to look at them as, structured relational databases (e.g. you would extract the information from each recipe and create relational database tables populated with those data). Researchers at a company will work year after year making new queries and joins and engineered-features and doing endless analysis on those static tables: there is no way to pre-do every possible data science and data-analytics operation that could possibly be done on a given set of relational tables.

And that is assuming you can put this/these unstructured data into a single set of relational tables, or one single graph, etc.

There are often dozens (or thousands) of different ways of saying similar things that are spread across the recipe documents you are processing. How often is there ‘one single root-stem way’ that you can embody such a target in a traditional database?

Questions like this are part of the amazing usefulness of meaning-matrices and their vectors. But as we can see from our thought experiments here: even though vectors are a helpful tool that you can use to map and search seafood recipes, it is not immediately clear how that tool should be used beyond the most simplistic questions such as uncooked-dishes (like a salad) vs. cooked (like broiled crab). Cooked vs. uncooked might have been one of those distinctions that would be virtually impossible with GOFAI word-analysis, but becomes instantly effortless with foundation model matrix-vectors. But not all of our document processing needs are such low-hanging fruit.

And how far along is humanity with answering basic questions like this: How should we process recipe documents? E.g. Between top-down vector use, bottom-up vector use, and how should we describe our goal of a vector, non-vector, relational, non-relational, graph, etc. hybrid databases and data-structures?

Uses of Guesstimation-Generation vs. Direct Vector Use:

An interesting example of real world choices may be the choice of whether to do a batch of guesstimated-generated metrics vs.

pseudo-non-generative uses of generative models: binary and quantified classification

not a desirable work-around: just do the math with the vectors

Future Needs: Bridging Fuzzy Vectors and Discrete Graphs

Here are some of the areas related to this use of foundation-model matrix vectors:

- noise reduction

- noise analysis

- document processing

- content extraction

- metadata generation/extraction

- automated structuring of unstructured data

- vector and non-vector database integration

- data retrieval

- data privacy

- agenda-free applications of AI

- measurability and repeatability

- on-premise production data science

- project-state definition

The Vector & The Graph:

The next millennium of science and culture may be shaped by the footprint of this archaic romance.

- Distance to graph vs. wholes and parts

Metadata Vectors and Databases

Gestalts: Parts of whole

However discrete or dynamic, whenever there is a whole of something in an externalized project context there may need to be ‘meta-data’ for describing and navigating that whole.

- summary

- topics

- contexts of use

vector and graph profiling of documents:

- overlapping chunking

- a graph of vectors?

- store which chunks are connected to which other chunks…

- vector clustering

- overlapping chunks

Automating document processing and meta-data extraction using vector-metrics:

Context: Distance in Project Space vs. ‘Stateless’ Distance

Many people are attracted to the talking-black-box model where we ask a question and the AI gives a mysterious answer. We can accept or reject that answer, but the answer is there ready to use. With vector-producing models, the process is a bit more hands on.

What does it mean to ‘ask a question’ of the matric-space itself, as opposed to ‘asking’ a generative model?

- puzzle piece scenarios

-

Generative vs. raw-vectors

What happens if you ask question of or with or using a matrix-space and vector-producing model, as opposed to a generative model?

You can put in the same input.

The output is in a different form.

“Distance”

“Transformation”

Movement along a vector: not a relative distance

measuring meaning-sector location: not a relative distance

Analyzing the answer to a question: not relative distance

‘state’ training and fine tuning:

- it might sound strange for a raw embedding model to have a ‘memory’ and seem more inherently ‘stateless’ but remember that the underlying meaning-space matrix-model may be similar or identical when comparing an anthropomorphic social-illusion generative-model and (on the other hand) a vector-producing model.

Which means that that same fine tuning and DPO or RLHF or appropriateness filtering etc. applies.

Thinking about system-state:

In addition to the interplay between matrix space and more transparently defined data structures such as dictionaries, csv files, dataframe and various databases,

system-state (as usual in a project-context here) is another focus.

rag, vectors, state, and “easy things are hard”:

- ‘Who is here?’

- When is ‘a question’ more than one question?

- When do question-sets require system-state for context?

Let’s take the basic question ‘Who is here’? as a question to put to a generative model, which by default has no memory state.

This is not insurmountable, but it is also not trivial. Let’s say you and five other people are sitting around a generative model (either local or cloud). You ask the ai-model “Who is here?” expecting the model to figure out and then report that there are six parties to the conversation, five human-people and the ai-agent itself.

The usually effective-enough kludge solution is to simply stuff the entire history of the conversation back along with the new item that is being asked: e.g. generative models, as such, like embedding models, are ‘reactive.’

(I just tried this with mistral-small and it did a reasonably good job, needing a bit of hitting to stay on script. Still, that’s great for an open source model that can run on a laptop with only cpu. )

Is there an analogical situation for vectors?

- stateful or psueudo-state vector based behaviors

‘Who’ is not so much a vector-space-question.

A situation like this might involve a multi-step data-integration process, where an architecture will need to use information to assemble or get other information…may be.

Questions in Matrix Space:

1. Distance questions

- multi-party

- single-person

2. tokenization of questions:

- quasi-generation:

3. questions, answers, and vectors:

- king queen analogy

4. Granularity and Scale of Question:

- summarization question

- single-word questions

- multi-word questions

5. Analogy and transformation:

6. king + man / queen

- addition

- multiplication

- division

7. break-down / split up / derivation / part questions

8. build-up / integration / whole questions

9. cut-up, multi-step, externalization, coordination questions

10. search and retrieval questions

11. rank and sort questions

12. scoring and metrics questions

14. automation and architecture questions

15. production-deployment vs. research questions

16. meta-data / descriptive data / structured data questions

17. translation questions

18. entity extraction questions

19. tool-chain and ‘function calling’ questions

20. structured output questions

fuzzy and brittle, system one and system two,

corpus callosum architects

state and memory

Can you use vectors to find where sections of an unstructured text are? Like part-of-speech or named-entity recognition?

meta-data-mapping & extraction/generation

- How general should the meta-data be?

One of the strengths of vector databases is that wholes and parts are often fluid, fuzzy, and context-dependent.

Concrete topics vs. highly fluid topics such as Shakespeare First Folio?

challenges and goals:

- Actions (vs. reactions) in Matrix Space

Incremental processes, ratchets:
- Modular micro-operations that can be part of a coordinated project and architecture (as opposed to end-to-end generation that make quick gains but are incompatible (like no-code low-code quick-solutions that go fast at first and then grind to a halt).

Noise & Outliner Reduction: Vector Outliers & Filters

Removing outliers and filters:

- the old standard of 1.5 IQR (interquartile range)

A specific use-case for many tools may be, if not seemingly overtly different in purpose at the onset, reducing the amount of data that a person needs to sift through and work with, perhaps analogous to removing outliers and data that are too incomplete or broken or to be used.

‘What am I not looking for?’

In many cases a negative definition has importantly different (and importantly more useful) STEM functionality.

Lumping Vectors:

Comparing results for:

A match to A

B match to B

vs.

A & B match to A-Z

Reasoning Steps vs. Indirect Relationship Spaces

Derivative Questions / Epi-Questions / Follow-On Questions

asking vector questions semantically…

note: this may suggest more use for ‘end-to-end’ general models,

however this either exacerbates or does not address project-externalization needs.

This may relate back to the word-image visual-analogy paper: is the answer contained in the vector of an analogy question, or is the whole question measurably closer to the answer than the parts alone?

Hybrid-document database

1. raw text

2. smart overlap chunked text, where A:

- each chunk’s vector is stored with the whole article id

- each article has a list of chunks along with it

Q: combined-topics?

3. an automatically generated topic graph for the paper

Rainbow Tables & Shallow Word-Clouds

When you are looking at a vector, what are you looking at?

To some extent, we do not know. It may be some kind of variably dense meaning-crystal-geometry-map, onto which we can map words and concepts. But the depth of subtlety of single and generated chains of tokens-vectors and language are not simplistic.

One approach is to take every word in a small (or large) dictionary and run it through your embedding-model to get the vector the results,

and make a kind of rainbow-table of n-grams starting with 1 word.

Part of the challenge of this is the sheer size this would reach if we were to brute force N-gram entire large dictionaries out to all possible strings of words…that is a lot of words.

If to illustrate how little we know about how to get meaning out of the subtle depths of the vectors, we can make word-cloud projections for at least 1-gram (or one word, or 1-stem or lema)

We can probably take advantage of past NLP research and make a selectively picked general mapping of ‘most common’ words and ‘most common’ n-grams of those, to cover a reasonable amount of language without the excesses of brute forcing overly large recombinant sets.

We would then need some basic clustering and visualization tool, to map out what vague clouds of words (and possibly bi-grams) were buzzing around a given vector.

This would be a “sleepy language” (to quote The Tempest) but it would be something.

A next challenge would be to design tools to prod further into the meaning spaces in matrices.

Generative models arguably also explore the nuances of a given vector, but their lack of paper trail and interface is not terribly helpful for this approach per se.

But some kind of hybrid system where feeding sets of vectors instead of tokens in a generative model might be quite useful.

Try it:

- Try putting results back through the tokenizer and rainbow-table to see what tokens and words result.

Cut-ups:

While this may be a bit of a reach too far I want to try to include the standard ‘cutup’ scenareo in this discussion as well to try to at least start exploring what the operational project implications (and achitecture design needs) are or may be.

So let’s start with a standard cut-up scenario. You have one original something (such as a cooking recipe or set of recipes), you break it up into N pieces with N sets of instructions and distribute that to N project roles: whereupon each role (a person, a group, etc.) must operate on their own segment but also coordinate with the other roles and collaborate to act coherently on the whole project.

It could be something as concrete as there being 5 courses to a meal, five teams/roles making each of the five courses, but sharing the same set of ingredients and cooking equipment, operating on the same schedule, and given each just their sections of more than one possible set of course-meal recipes, their task is to design or select courses and produce and serve a full multi-course meal.

What vector and database and search and communication challenges will arise with projects such as this?

Breakdown-Build-up: Cutup & Reassemble the Broken World

An implicit part of cut-up tasks is that proverbial or literal documents need to be re-assembled. Here we have not only an internal document processing context of breaking down the ‘parts’ of a document into structured formats, and breakind down problems into structured sets of procedures and solutions, but also the external aspects of needing to

- A. reassemble something when you only have part and you start out having no else who else there is or what they have.

and/or

- B. operate modularly in a situation where you never know entirely who else there is, what they have, or how what you are using will be made.

And, of course,

- C. Mixtures of A and B.

Summary Note:

This is surely not the most complete or perfect solution, but where there had been no solution we now have a solution and one that can be enhanced, adapted, and improved in many ways, and which is not a completely black box, and all thanks to an ability to directly use the vectors of AI concept matrix models on portable and local hardware and software in an affordable way, open source, and accessible enough that it was all made in less than a day.

Code Here: https://github.com/lineality/arxiv_explorer_tools

As a challenge, think about or try going the next step and extend this tool to getting and reading the entire ‘pdf’ paper for an even better match.

Also see:

https://docs.mistral.ai/capabilities/embeddings/

https://docs.anthropic.com/en/docs/embeddings

https://openai.com/index/new-embedding-models-and-api-updates/

https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/

https://platform.openai.com/docs/guides/embeddings/

apple event

https://www.youtube.com/watch?v=RXeOiIDNNek

About The Series

This mini-article is part of a series to support clear discussions about Artificial Intelligence (AI-ML). A more in-depth discussion and framework proposal is available in this github repo:

https://github.com/lineality/object_relationship_spaces_ai_ml

--

--