It is well-known that human sight and vision are vital in how we view the world around us. Although often treated as synonyms, sight and vision are two very different concepts. Sight is physical – it is a sensory experience in which the reflected light from shapes and objects enters through the cornea and lens to reach the retina via the optic nerve. Signals are then sent to the brain where it is converted into images.
Vision is how the mind – a person’s ability to think and reason – interprets these images, often based on past learning and memories. Vision is therefore a metaphysical concept. Sight enables a person to witness an event, but vision helps the person to understand the significance of the particular event and draw interpretations from it.
A practical example is that during a walk in the park, you sight a dog playing with a ball. When the owner throws the ball the dog brings it back to the owner. When receiving the ball the owner pats the dog on the head. Your vision tells you that there is a special and loving relationship between owner and dog and that the dog enjoys the ball game.
Sight and vision are important human capabilities because they help us make sense of the world around us and also to learn from new observations. No wonder that more than 50% of the human cortex – the surface of the brain – is dedicated to processing visual information.
Artificial intelligence (AI) vision and transformers
Just as in the case of humans, visual ability is currently of paramount importance to artificial intelligence (AI) due to its role in enabling machines to perceive and understand the visual world in a manner similar to humans. As explained above, visual perception, the ability to interpret, and make sense of visual information is a fundamental aspect of human intelligence, and replicating this capability in AI systems has numerous implications and applications.
Although computer vision has been introduced during the 1960s, it is only more recently that it is extensively used in the multi-modality learning by AI, mainly due to the introduction of transformers. A transformer is a deep learning model and a component used in many neural network designs for the processing of sequential data, such as natural language text, genome sequences, sound signals or time series data. It is used primarily in the fields of natural language processing (NLP), and computer vision (CV).
In particular transformer models have been used extensively for vision-language tasks such as visual question answering (VQA), visual common sense reasoning (VSR), cross-modal retrieval, image captioning and transforming text into images. Many of the latest AI applications are, therefore, very good at seeing. Vision enables neural networks to learn additional things previously not possible. It enables AI not only to generate images from text but also to describe images, summarise videos, as well as understand and explain complex diagrams.
The importance of the visual ability in AI
Since our world is very visual, visual ability is not only very useful for neural networks, but crucial. Some of the reasons why visual ability in AI are important to organisations and business are:
Better informed decisions: A large portion of the information in the world is presented visually, such as images, videos, diagrams, and graphical data. By developing visual abilities, AI systems can effectively analyse and interpret this wealth of visual data, enabling them to extract meaningful insights and enable better informed decisions in business.
Retail and e-commerce: AI can analyse visual data from images and videos to improve the customer shopping experience. For instance, AI can interpret product images to automatically tag and categorise products, improving search accuracy and enabling personalised product recommendations. Visual data interpretation can also help identify trends, detect counterfeit products, and optimise shelf layouts for better product placement.
Manufacturing and quality control: AI can interpret visual data from cameras and sensors to monitor manufacturing processes, detect defects in products, and thus ensure product quality. By analysing visual information, AI can identify anomalies, measure dimensions, and compare finished products against quality standards to assist in reducing manufacturing errors, improving efficiency, and ensuring consistent product quality.
Social media and brand monitoring: AI can interpret visual data from social media platforms to monitor brand mentions, analyse customer sentiment, and identify product and consumer trends. By analysing images and videos, AI can detect brand logos, analyse user-generated content, and provide insights into consumer perception and behaviour. This could help businesses understand their brand reputation better, track the effectiveness of marketing campaigns, and identify significant influencers.
Security and surveillance: AI can interpret visual data from security cameras and surveillance systems to enhance overall security measures. AI-powered security systems can analyse visual information to detect and track suspicious activities, identify individuals, and recognise patterns that indicate potential security threats. AI-powered security systems that benefit from object recognition could thus help to improve the effectiveness of security monitoring by detecting, classifying threats, and enabling proactive responses to potential incidents.
Autonomous delivery and object recognition: AI with visual ability can recognise and identify objects in images or video streams, which is crucial for various applications. For instance, visual data interpretation is critical for autonomous delivery vehicles to perceive and navigate the environment. Autonomous vehicles must recognise traffic signs, pedestrians, and obstacles to navigate safely. AI systems, therefore, constantly analyse visual data from cameras, LiDAR, and other sensors to detect and interpret road signs, traffic lights, pedestrians, and other vehicles. This enables autonomous vehicles to make real-time decisions, maintain safe distances, and navigate complex traffic scenarios.
Visual reasoning and healthcare: Visual ability facilitates reasoning based on visual information. AI systems can perform tasks like visual question answering, where they analyse an image, graph or diagram and answer questions related to its content. This capability is particularly useful in fields such as medical diagnosis, where AI can examine medical images such as X-rays, MRIs, CT and PET scans to provide insights to healthcare professionals and aid in the diagnosis and treatment of patients. By analysing these images, AI systems can detect abnormalities, assist in early detection of diseases, and provide informed recommendations to healthcare professionals. Research has shown that visual data interpretation by AI can help radiologists and doctors make more accurate diagnoses and improve patient outcomes. AI is widely used in rural hospitals where medical specialists are often in short supply.
Enhanced human-machine interaction: Incorporating visual ability into AI systems improves their ability to interact with humans. For example, AI-powered chatbots can analyse facial expressions and gestures to better understand user intent and emotions. The outcome of the analysis is used to enable more natural and intuitive human-machine communication.
Creative applications: Visual ability enables AI to engage in creative activities, such as generating art for marketing purposes, designing new products or creating visual content for the business. By understanding aesthetics and visual patterns, AI systems can produce visually appealing and contextually appropriate outputs.
Accessibility and inclusion: AI with visual ability can assist individuals with visual impairments by describing the visual world, recognising objects, and providing real-time assistance. This promotes accessibility and inclusion for people with disabilities within the business.
Understanding context: Visual perception aids in understanding the context of a given situation. By analysing visual cues, AI systems can infer important details about objects, scenes, and people. This contextual understanding allows AI to reason, make predictions, and interact with the world more intelligently. This ability is currently often used for the post-interview analysis of employee behaviour during interviews.
The value of the visual ability of AI
It is apparent that visual ability empowers AI systems to comprehend, analyse and interact with the visual world, making AI systems more capable of handling tasks and applications that require visual perception. By bridging the gap between visual information and intelligent decision-making, AI with visual ability has far-reaching implications across various domains and industries, and beyond.
The above discussed examples demonstrate the value of AI in interpreting visual data across various business domains. By leveraging AI’s visual capabilities, businesses can gain actionable insights, automate processes, improve decision-making, and enhance overall efficiency and competitiveness.
Professor Louis C H Fourie is an extraordinary professor in information systems at the University of the Western Cape