Skip to main content

Proposal of Research to Professor Cote

1. Let's Alter​

"Alt text" briefly describes an image and is used if the browser cannot display the picture. Alt text is typically used in Web documents to provide a text equivalent for a photo, which can be read by screen readers and other assistive technology to provide accessibility to visually impaired users. Search engines also use it to understand the content of an image and improve its ranking in image search results.

Many images on the web have empty alt text, which usually occurs when the person who created the web page either forgets to add it or doesn't have the proper writing knowledge. If many images on a website have empty alt text, it makes the site less accessible to visually impaired users. This situation can make it difficult for the visually impaired to navigate and understand the website's content, leading to poor user experience and web accessibility problems.

I wanted to create a more accessible web environment for everyone and brainstormed potential methods to counter this. I eventually thought of a distributed intelligence system that, given an image input, can describe what is inside the picture. I have studied underlying technologies on this and found several layers for the project.

The first layer we need is Contrastive Language–Image Pre-training (CLIP), a vast dataset of image-to-text created by OpenAI. We need a CLIP inference server to take the image and return the descriptive text. Some similar alternative open-sourced technologies include Bootstrapping Language–Image Pre-training (BLIP) by Salesforce. However, this inference requires a significant amount of computer power to complete—it takes at least a minute to generate the complete text, even with the latest GPU. Therefore, for the system to scale, we will need a clever way to make things faster.

I eventually had a eureka moment: hash tables! However, we cannot store all the pixel combinations on the database. So, I explored clamping the image data into a finite pixel space, a concept called 'perceptual hashing'. Hashing algorithms take data and convert it into a fixed-size value, known as a hash. While most hashing algorithms aim to minimize the likelihood of two different keys resulting in the same hash, known as a collision, perceptual hashes aim to do the opposite by maximizing collisions. This method is used to search by image, such as Google Photos, or detect child sexual abuse materials. The perceptual hashing layer is the second layer we need. It will work as a cache layer, intelligently figuring out if the two different sets of pixel data are the same picture, just distorted.

The last layer we need contains the wrapping libraries and toolkits for other developers to include. These can be embeddable JavaScript libraries or an HTTP API layer.

The final goal is to create a product like Let's Encrypt, which heavily contributed to removing unencrypted websites on the web. I imagine a future where no single image on the web is missing the alt tag with this technology.

2. iiframe—improved iframe​

An inline frame, often shortened to iframe, is an HTML element that allows you to embed one HTML document within another. It creates a new browsing context within the current document, which can load and display its HTML, CSS, and JavaScript. One everyday use case for iframes is to embed third-party content on a website, such as a video from YouTube or a map from Google Maps. However, iframes can also lead to performance problems because they can introduce additional network requests, increase the amount of JavaScript that needs to be executed, and create separate instances of the Document Object Model (DOM) and other browser resources. Additionally, if the iframe is from a third party, it could introduce security risks because it could be used to load malicious scripts or perform cross-site scripting attacks.

The biggest problem of iframe comes from its core design; iframe and the host web page run on the same thread. If an iframe and the hosting web page are running on the same thread in the browser, a problem with the iframe can impact the performance of the entire page. For example, when the iframe goes into an infinite loop or performs a heavy computation, it can cause the browser's JavaScript engine to become unresponsive. As a result, the entire web page will freeze.

A solution for this problem came out a couple of years ago—Web Workers. Web Workers involve a JavaScript API that allows you to run scripts in the background on a separate thread from the main page's JavaScript thread. This enables you to perform time-consuming tasks without blocking the UI, which can lead to a better user experience. Now, we can think: "What if we can run iframe on Worker Threads?"

Unfortunately, we cannot natively run iframe on Worker Threads. This is because Worker Threads have their global scope and cannot directly access the DOM or other resources of the main page. They can, however, send and receive messages to and from the main page. For an iframe to draw the content to the page, it will use several JavaScript DOM APIs, such as 'requestAnimationFrame()' or 'getBoundingClientRect()', to figure out where to draw the content. Since it is inside a Worker, an iframe cannot access this data, so we need a relay layer. If an iframe requests DOM APIs, we should relay the request to the hosting page, calculate and run the operations, and then return the result to the iframe.

To do this, we need synchronous data transfers. For a Worker Thread and the main thread to communicate synchronously without race conditions, we need a SharedArrayBuffer and Atomic operations. SharedArrayBuffer is a JavaScript API that allows for sharing memory between multiple threads, including Web Workers. It was introduced as part of the ECMAScript 2017 specification and was intended to enable more efficient communication between Web Workers by allowing them to share large data structures. However, in January 2018, researchers disclosed a set of vulnerabilities known as Spectre and Meltdown that affected most modern processors. These vulnerabilities allowed attackers to exploit processors' handling of speculative execution to potentially access sensitive data, such as passwords and encryption keys, from memory. One of the ways that these vulnerabilities could be exploited was through the use of SharedArrayBuffer and other similar APIs that allow for the sharing of memory between multiple threads. Because multiple threads could access the shared memory, an attacker could use a malicious script running in one thread to access sensitive data stored in the shared memory by another thread. As a result of these vulnerabilities, the major browser vendors (Google, Mozilla, Microsoft, and Apple) decided to disable SharedArrayBuffer by default to avoid the potential exploitation of these vulnerabilities.

After resolving all security issues, SharedArrayBuffer returned in 2020, introducing the Cross-Origin-Embedder-Policy (COEP) and Cross-Origin-Resource-Policy (CORP) web platform features designed to improve the security of shared memory and mitigate the risks of Spectre-style attacks.

Given that all of these technologies are ready, we can create an improved version of the iframe that will run on the Worker Thread, synchronously communicating with the main thread. In other words, this will introduce proper, fast, and easy multithreading to the web.

In addition, the world is now watching the very beginning of WebAssembly, which allows high-performance applications to run in web browsers. Before WebAssembly, the web was primarily a platform for running JavaScript code, a relatively slow language for certain computation types. With WebAssembly, it's now possible to run code compiled to a much faster binary format. This opens up new possibilities for web development, such as running complex simulations and computations directly in the browser, creating more responsive and immersive experiences, and running entire desktop-class applications within a web page. With iiframe, we can push it further, allowing WASM files to take advantage of multi-threaded processors.

3. The Photo Library of Babel​

"The Library of Babel" by Jorge Luis Borges is a short story in which Borges describes a universe consisting of a vast library comprising an endless expanse of hexagonal rooms, each of which contains four walls of bookshelves. Each book in the library contains every possible combination of 25 symbols and every possible text that can be created with those symbols. "The Library of Babel" includes all variations we can make with a specific alphabet length.

I was fascinated by the concept that it answers the infinite monkey theorem—given infinite monkeys making infinite keystrokes, wouldn't they write all of Shakespeare's work? The infinite monkey theorem poses an underrepresented great thought point in our society, given the recent boom of generative AIs, such as the recent ChatGPT (i.e., we now have the infinite AI monkey making infinite keystrokes). Will AI eventually create all information creatable by the human race? What would be the notion of creation from now on?

"The Library of Babel" intrinsically has the critical components of such a philosophical question. I looked up previous mathematics and computer science research under "The Library of Babel", but there was none. I want to conduct foundational studies on "The Library of Babel" and create an encompassing theory. With further studies, we can make a comprehensive library of Babel with all Unicode characters (in contrast to Roman alphabets) or make an infinite grid of pixels containing all possible photos that can exist—The Photo Library of Babel.