"Alt text" briefly describes an image and is used if the browser cannot display the picture. Alt text is typically used in Web documents to provide a text equivalent for a photo, which can be read by screen readers and other assistive technology to provide accessibility to visually impaired users. Search engines also use it to understand the content of an image and improve its ranking in image search results.
Many images on the web have empty alt text, which usually occurs when the person who created the web page either forgets to add it or doesn't have the proper writing knowledge. If many images on a website have empty alt text, it makes the site less accessible to visually impaired users. This situation can make it difficult for the visually impaired to navigate and understand the website's content, leading to poor user experience and web accessibility problems.
I wanted to create a more accessible web environment for everyone and brainstormed potential methods to counter this. I eventually thought of a distributed intelligence system that, given an image input, can describe what is inside the picture. I have studied underlying technologies on this and found several layers for the project.
The first layer we need is Contrastive Language–Image Pre-training (CLIP), a vast dataset of image-to-text created by OpenAI. We need a CLIP inference server to take the image and return the descriptive text. Some similar alternative open-sourced technologies include Bootstrapping Language–Image Pre-training (BLIP) by Salesforce. However, this inference requires a significant amount of computer power to complete—it takes at least a minute to generate the complete text, even with the latest GPU. Therefore, for the system to scale, we will need a clever way to make things faster.
I eventually had a eureka moment: hash tables! However, we cannot store all the pixel combinations on the database. So, I explored clamping the image data into a finite pixel space, a concept called 'perceptual hashing'. Hashing algorithms take data and convert it into a fixed-size value, known as a hash. While most hashing algorithms aim to minimize the likelihood of two different keys resulting in the same hash, known as a collision, perceptual hashes aim to do the opposite by maximizing collisions. This method is used to search by image, such as Google Photos, or detect child sexual abuse materials. The perceptual hashing layer is the second layer we need. It will work as a cache layer, intelligently figuring out if the two different sets of pixel data are the same picture, just distorted.
The final goal is to create a product like Let's Encrypt, which heavily contributed to removing unencrypted websites on the web. I imagine a future where no single image on the web is missing the alt tag with this technology.
After resolving all security issues, SharedArrayBuffer returned in 2020, introducing the Cross-Origin-Embedder-Policy (COEP) and Cross-Origin-Resource-Policy (CORP) web platform features designed to improve the security of shared memory and mitigate the risks of Spectre-style attacks.
Given that all of these technologies are ready, we can create an improved version of the iframe that will run on the Worker Thread, synchronously communicating with the main thread. In other words, this will introduce proper, fast, and easy multithreading to the web.
"Library of Babel" by Jorge Luis Borges is a short story in which Borges describes a universe consisting of a vast library comprising an endless expanse of hexagonal rooms, each of which contains four walls of bookshelves. Each book in the library contains every possible combination of 25 symbols and every possible text that can be created with those symbols. "Library of Babel" includes all variations we can make with a specific alphabet length.
I was fascinated by the concept that it answers the infinite monkey theorem—given infinite monkeys making infinite keystrokes, wouldn't they write all of Shakespeare's work? The infinite monkey theorem poses an underrepresented great thought point in our society, given the recent boom of generative AIs, such as the recent ChatGPT (i.e., we now have the infinite AI monkey making infinite keystrokes). Will AI eventually create all information creatable by the human race? What would be the notion of creation from now on?
"Library of Babel" intrinsically has the critical components of such a philosophical question. I looked up previous mathematics and computer science research under "Library of Babel", but there was none. I want to conduct foundational studies on "Library of Babel" and create an encompassing theory. With further studies, we can make a comprehensive library of Babel with all Unicode characters (in contrast to Roman alphabets) or make an infinite grid of pixels containing all possible photos that can exist—Photo Library of Babel.