Dreaming Streetscapes: How well does the AI know your neighborhood?

10 min readMar 14, 2022

--

On the contrary to the previous generation of wide-spread creative tools, such startling advances and research are occurring in artificial intelligence and machine learning algorithms that the way creative people, be it designers, architects or filmmakers, have interacted with these tools have been increasingly less monolectic. Instead of dictating the software what to create, these new dialectic tools enable creating the unimaginable, the unexpected, the surprising.

The Yershov Diagram, first appeared in paper presented at a seminar on Automation of the Thinking Process held the Kiev Center for Scientific and Technical Information. Kiev, USSR, 1963. (Image taken from Soft Architecture Machines by Nicholas Negroponte. MIT Press, 1975.)

It is this playful type of interaction that will push creativity further, distill immense amounts of data into creative production. In this research, we will use one of AI algorithms, namely VQGAN+CLIP, to generate streetscape imagery of the neighborhoods of New York City; to both see whether it could create portrayals of these micro-urban areas accurately, and if these images could be used to better understand the current status of these neighborhoods.

What is VQGAN+CLIP?

VQGAN and CLIP are two separate machine learning algorithms that are used in tandem to generate images based on a text prompt. VQGAN is a generative adversarial neural network that is good at generating images that look similar to others (but not from a prompt), and CLIP is another neural network that is able to determine how well a caption (or prompt) matches an image.

a street photo from new york city in the neighborhood of bryant park

Hallucinating Streetscapes

This project uses VQGAN+CLIP to generate street imagery of the neighborhoods of New York City. Main objective was to see whether the algorithm was able to generate architecturally-accurate image representations of distinct parts of the City from given text prompts.

Syntax for the prompt used to generate images

As VQGAN+CLIP is a rather general purpose algorithm, generated images lack a certain clarity and composition found in urban photography. However, the algorithm successfully brought together distinct architectural and urban elements, such as color, material, styles, and urban artifacts. The results captured what could be argued as the average look—or the urban fabric of these places—and distilled them into a collage of averages.

The resulting images contain buildings and architectural elements that carry distinct character and styles; urban features such as sidewalks, streets, signs and vegetation; and certain human figures that populate them.

Finding the best question to ask to the machine: Prompt-engineering

An early experiment with prompts — a “photo” did not always result in images that contain streetscapes, hence “street photo” was adopted

As VQGAN+CLIP is a prompt-to-image algorithm, finding out the best-performing prompt was crucial to start with. Among various alternatives, the syntax below produced the best results.

a street photo from {city} in the neighborhood of {neighborhood}

Engineering the simplest possible prompt is crucial — no chaotic descriptions allowed

The simpler the prompt the better the results get, hence we tried to keep the prompt as short as possible while still keeping crucial parts such as “street photo” or “from the neighborhood of,” so that we would have a more controlled output.

First Batch: Three Distinct Neighborhoods

Upon deciding on the syntax, we started experimenting with three neighborhoods of New York City that have highly distinct architectural features to them: Upper West Side, Wall Street, and Harlem.

a street photo from new york city in the neighborhood of upper west side

We

VQGAN+CLIP generated images that accurately portray the ‘gist’ of these three distinct neighborhoods. Upper West Side, a neighborhood that’s known to be home for some of Manhattan’s most expensive real estate, is a high-income residential area with buildings in varying sizes, brownstones distinct to New York City, as well as restaurants, shops, and parks.

a street photo from new york city in the neighborhood of wall street

Similarly, images produced for Wall Street also successfully captured the look of the area. Gone are the Upper West Side’s stone-clad facades, these images portray the realities of Manhattan’s famous office towers with grand lobby entrances and an abundance of windows rising above.

a street photo from new york city in the neighborhood of harlem

Lastly, Harlem, a neighborhood that historically has been a home to many African-American Communities in the city, and is still one of the most prominent epicenters of the Black culture. Images produced for Harlem also accurately replicate the urban fabric — low-to-mid-rise residential blocks leaving a larger room in the frame for the sky, urban artifacts such as street lamps, signs, urban furniture crowding the streets, and a noticeably higher proportion of black figures populating the street level.

Widening the scope: Other neighborhoods, boroughs, landmarks, and locations

As the first trials offered promising results, we tried various other locations in the City. Among these locations are neighborhoods, landmarks, institutions, infrastructure, street names and parks; all generated through using the same syntax, resolution, seed and iteration count.

Images generated for Times Square, Broadway, and 42nd Street.

Images generated for Midtown-Times Square area, even though the prompt did not inclide Times Square, have similar results. 42nd Street, where Times Square and Times Square Subway Station is located, is as densely populated as Times Square. Broadway, a name heavily associated with theaters and Times Square, also resulted in similar urban settings: Dense streets, street signs and populated ground levels.

Images generated for lower Manhattan neighborhoods.

Images generated for well-known, distinct neighborhoods have resulted in accurate images. Chinatown and Soho are some of the two that are highly accurately captured of these.

Can the AI tell the difference between a well-off neighborhood than the one that’s not?

Up until this point, VQGAN+CLIP has generated outputs that have achieved to capture the average feeling of prompted neighborhoods. If we assume that the algorithm is fairly successful at generating an average façade , what would these façades tell if we use the algorithm to generate ones for neighborhoods with varying socio-economical factors. Simply put, could the AI tell the difference between a well-off neighborhood than the one that’s not?

Images generated for Greenwich Village and Upper East Side, high-income neighborhoods in Manhattan, and the Bronx, the borough with the lowest median average household income in New York City. (Census 2019)

Images above point out to a stark contrast between high-income neighborhoods such as Greenwich Village or Upper East Side and relatively lower-income areas such as the Bronx, which is not a neighborhood but a borough that has the lowest annual median average household income as the 2019 US Census Data shows.

Images generated for the former also do have a well-off architectural representation —tree-covered streets, buildings with wider windows and cleaner facades, people in ‘fancy’ clothes walking down the street. Whereas the images generated for the former usually contain not very-well organized streetscapes, dirtier facades with windows seemingly randomly sprinkled throughout, broken wire fences in front of blind-facades.

Below are some further results, showing differences between two neighborhoods on the opposite sides of annual median average household income scale, Upper East Side and Harlem.

Images generated for Upper East Side and specific locations from the area.

Images generated for Upper East Side feature neoclassical architectural elements that can be found in the famous townhouses and historical residential buildings in the area, and the results are persistent throughout various prompts: Once again, tree-covered streets, buildings with wider windows and cleaner facades, people in ‘fancy’ clothes walking down the street.

Images generated through various Harlem prompts do also have similar features — smaller windows, dirtier facades, fences, street signs, cramped streets…

In general, it is safe to say that AI-generated ‘average facades’ for these lower-income neighborhoods tend to feature a relatively more neglected-looking urban environment compared to their better-off counterparts.

What does the AI know about your street? Or the neighborhood park? Or the highway near the river?

These images were generated not for a neighborhood but for specific locations, such as a park or a highway.

The algorithm successfully generated images for well-known streets and infrastructure such as Central Park West, Henry Hudson Parkway and Lincoln Tunnel. The first image contains the street of Central Park West, with a park and New York-style high rises in the back. The second image features a highway sign on top, a bridge rising among trees, and a river in far distance; similar to what you could see when approaching George Washington Bridge on the Hudson River.

Images generated for Morningside Heights and two parks enveloping the neighborhood.

The algorithm also was surprisingly accurate when producing less popular neighborhoods and parks. Images generated for Morningside Heights, a residential neighborhood in Upper Manhattan, has architectural elements that remain consistent even when the prompt is changed to the names of the parks that surround the area. Notice how the first two images above have buildings that are similar to the third one.

Images generated for urban places people do not spend time in as part of their daily routine

Lastly, urban parts that are visited in less frequency for lesser amount of times have also resulted in relatively accurate imagery. Above we see an image towards Midtown from one end of Lincoln Tunnel, the other a photo from the apron of JFK.

What about landmarks?

Images generated for well-known, well-photographed locations returned in highly accurate results (except the shape of the buildings)

Images generated for landmarks and well-known locations have resulted in more accurate representations with distinct architectural elements, although in many cases it could not get the geometry quite right. Above given examples of Grand Central Terminal, Empire State Building and One World Trade Center. Compared to the streetscapes, these images are far less likely to believe in their reality due to lack of composition and broken geometry.

How well informed VQGAN+CLIP of institutions that have a prominent urban presence? (Looking at you, NYU)

Images generated for NYU (New York University)

A surprising output was when using abbreviations. As seen above, the prompt “New York University” has resulted in an image that contains architecture that resembles Greenwich Village, where NYU is located, as well as banners in the color of the university. When the prompt was changed to “NYU,” the images did not contain anything related to the institution.

In terms of boroughs, the AI chooses Manhattan

The rivalry between Manhattan and Brooklyn has been a hot topic for decades now. Yet when it comes to generating images, VQGAN+CLIP seems to have a certain preference of one over the other. Images generated for Manhattan, presumably due to being the more ‘documented’ one, tend to contain relatively more accurate and recognizable features.

Images generated for Brooklyn neighborhoods: Clinton Hill, Midwood and Downtown Brooklyn.

Iterating the results

Although many different prompts that are related to the same urban area produce similar results, VQGAN+CLIP does not always generate similar representations in different iterations. As seen below, two images generated for Wall Street are distinctly different. Although, it shall not be overseen that both images are accurate representations for the area — they are simply different outtakes from the same part of the city.

The former depicts a streetscape with (what it seems like) a grand lobby entrance at the back, populating the crosswalks with people in black coats in the front. The latter is a fairly accurate urban depiction of (presumably) the New York Stock Exchange, which is one of the most photographed and well-known buildings of Wall Street.

Adding one more to the mix, the algorithm also generates similar images for a different prompt. Today, Financial District and Wall Street are pretty much used synonymously when referring to roughly the same part of the city — a very dense, small urban piece of land in Lower Manhattan. Hence, the result contains similar features — tall office buildings rising above to the sky.

Conclusion

It is an undeniable fact that some of these trials also have returned less-successful results. As in every urban setting, areas that has a unique character is very rare. Hence, many resulting images for some lesser-defined areas look like they could also belong to other neighborhoods. The results show that, unless prompts include names of well-known locations, landmarks and institutions; resulting images tend to look highly ambiguous.

Further Questions

How well does the AI generate images based on neighborhoods?
How well does the AI generate images based on streets and addresses?
How well does the AI generate images based on landmark and institution names?
How well does the AI generate images based on parks?
How does these neighborhoods images compare to each other?