
The AI sector is undergoing a generative revolution as tech giants like Google, Microsoft, Apple and Meta are competing to build a new generation of AI models that prioritize a human centered approach. This race for AI supremacy has added another layer of drama to the industry. A month ago, we saw OpenAI making some uncertain situations, including the hiring and firing of CEO and co-founder Sam Altman. We have the whole OpenAI saga in our previous blog. Now let’s shift our attention to Google’s DESPERATE move with the introduction of Gemini, an advanced iteration of their generative AI models, which was later integrated into their bar chatbot. We’ll explain why it came across as desperate further in the blog.
What exactly is Gemini?
Google officially announced Gemini in a blog post published on the 6th of December last year. According to Google DeepMind, Gemini is an ‘Anything to Anything’ model. It can generate diverse outputs in response to multiple types of inputs, which means it’s multimodal in nature. As a multimodal large language model, Gemini is built to display reasoning across different types of information such as text, images, video and audio, including computer code written by humans. In contrast to OpenAI’s Chat GPT, which is primarily a text based chatbot, Gemini is supposed to not only respond to text prompts, but also provide answers and reasoning. By understanding images, text, audio and video, Gemini is also capable of generating real time responses by examining video or image feeds, explain reasoning in math and physics. It was also shown to have the ability to live track moving objects. Additionally, Gemini can understand, explain and generate high quality code in programming languages like Python, Java and go. It’s ability to work across languages and reason about complex information makes it, in theory, one of the leading foundation models for coding in this world.
Gemini has its three variants.
The first version of Gemini is available in three variants, ultra Pro and nano. Ultra for handling high complexity tasks, Pro for improved performance and deployability at scale, and Nano for on device applications like smartphones, smartwatches, face ware. Gemini Ultra is the first model to surpass human expertise in massive multitask language understanding, or MMLU, a widely adopted benchmark assessing the knowledge and problem solving capabilities of AI models. That’s quite a big deal in the field of AI because MMLU is a series of tests that use a combination of 57 subjects like math, physics, history, law, medicine, ethics for testing both real world knowledge and real world problem solving capabilities, Gemini scored a whopping 90% on that test.
Natively Multimodal
The best part of Gemini is that it is natively multimodal. It has been pre-trained on multimodalities from the very beginning. The team presented Gemini with a mix of various modalities incorporating both images and text. Gemini then responded to make predictions about what could come next. It was tested on logic and reasoning by showing images of the earth, the sun and Saturn so that it could arrange it in the right order or to test its cultural understanding. It was shown a few still images and asked to guess the movie. To test it on object detection and tracking, it was presented with the classic ball and cup shuffling game where it answered correctly under which cup the ball was placed.
Source: GoogleBlog
Source: GoogleBlog
Fake Demo Video?
Although this demo video shared by Google seemed interesting, it is a bit disappointing that it was a wee bit staged and edited. So was it all fake? Well, not exactly. You see, everything that we see in this video was done through text prompts given to it and not the voice commands that it shows, which does not really take anything away from Gemini’s abilities to process multimodal data. Gemini’s ability to seamlessly combine these modes together enables new possibilities for what you can do.
Benefits of Gemini
As discussed by Google’s software engineer Taylor Applebaum and research scientist Sebastian Novazin, Gemini can help in fields like science and finance. From around 200,000 research papers on genetics, Gemini filtered out 250 relevant papers and highlighted the key data, and this was done within the time of a lunch break. Gemini can not only understand research papers, but being multimodal, it can also understand numbers and figures and create a new, refined version of it. This multimodal capability can help in understanding various forms of financial data, including charts, graphs and textual information. This versatility allows for a comprehensive analysis of financial reports and market data, contributing to a well informed financial strategy.
Gemini can also process raw audio signals as is. Typically when an LLM model interacts with audio, it runs it through a speech recognition system to convert it into text, and then it feeds that text into another model that understands and processes it. In this process, some of the nuances of pronunciation or tone could obviously be lost. But as explained by Google DeepMind’s research scientist Adria Rekha sense, Gemini can understand and reply on raw audio signals end to end. For example, with its audio capabilities, Gemini could be helpful for enhancing the experience of watching, say, a YouTube video. It’s ability to directly understand and respond to raw audio signals could be leveraged to provide real time, accurate, multilingual subtitles. This would eliminate the need for an intermediary step of converting a spoken language to text and then translating it. Users could enjoy videos in their preferred language without losing nuances of the original content, and this could probably be done real time. Also, for the viewers with visual impairment, Gemini could help with audio description as well. Gemini might also be helpful in wearable tech like the ray ban meta sunglasses, wherein users can interact with their smart glasses through natural speech. They can also receive spoken translation in their preferred language as Gemini can understand the nuances and pronunciations of voice accurately.
What if I say that it can also help us with your homework? Yes, it can, as explained by Google DeepMind’s interaction designer Sam Chuang, if you show it a handwritten mathematical problem, Gemini can understand it, point out the mistakes, as well as provide you with a solution. Gemini’s ability to understand handwritten mathematical problems enables it to assist students in solving complex equations. It can be a versatile tool for grasping complex academic concepts. Students can receive prompt feedback on their academic work, promoting a more dynamic and interactive learning experience. Further, more Gemini might also help in fields like rocket science and climatology to correct their fallacies and predictions.
Gemini can go beyond text interface. Instead of coming up with generic answers, it can understand user intent and reason with a personalized interface that is crafted and adapted to suit the specific needs desired by the individual or organization for whom it is created. Google deep mind’s engineering director, Palash Nandy, showed how Gemini can help in planning an event. When asked about a suggestion for a kid’s birthday party who loves animals,
Gemini comes up with an outcome that is visually rich and interactable. Nuances in similar questions, but with fundamentally different environments could generate better, non-repeated solutions. For instance, determining the amount of additional exercise required is dependent on individual factors such as one’s current fitness routine, professional objectives, and personal goals. Gemini could help with creating a personalized fitness plan. A professional athlete might necessitate a different exercise plan compared to someone who is working in banking. Similarly, appropriate decor for your daughter’s birthday celebration would vary significantly depending on whether it’s her third birthday or a Quinceanera.
Multimodal AI Future?
While this technology may seem impressive and positioned as a formidable competitor to OpenAI’s chatgpt. The edited demonstration video of Gemini only offers an approximation of its potential functionality. The showcased features are not currently available in the precise manner presented in the video shared by Google. The question arises: why did Google stage this demo video? Let’s try to unfold why Google might be so desperate to introduce its upgraded model.
After OpenAI launched chat GPT, it reached 100 million users only in 2 months, which was almost 5 times faster than TikTok that took 9 months to reach 100 million active users and 10 times faster than Instagram that took almost 2 years to reach 100 million users. Google launched its chatbot bard in February 2022 but failed to display its competency. As a result, Alphabet s shares went down by 8%. But similarly, alphabet’s stock price increased by 5.3 percentage points in a single day after Gemini was introduced. While all of this is noteworthy, we should not overlook the fact that Google has been synonymous with the internet for the better part of two decades now. And it’s not without reason – Google has served as a substantial repository of videos and links to legitimate websites with informative articles. Essentially, these are knowledge bases intricately connected to the internet, ensuring that only the best information ascends to the top of the search result page. This holds true for every single query it generates. Not to mention the agility with which it poses answers and solutions to each query. So, understanding this, Google deeply understands relevance – discerning which piece of data is more legitimate than the other. This unique position renders Google highly reliable as an LLM model.
Furthermore, despite ChatGPT gaining significant traction and forming partnerships with Microsoft as it morphs as a co-pilot and in related services, it’s understandable that these primarily serve as B2B services for large corporations, functioning more as internal tools. Simultaneously, the consumer front is predominantly dominated by Google, given its access to the largest user base connected to the internet through its android program. This implies that, to a greater extent, Google can penetrate the B2C market more easily than ChatGPT ever could.
Considering these factors, even though Google’s shares saw a temporary 8% dip, we are witnessing the formation of two distinct trajectories. Google is poised to maintain a strong hold on the b2c market when it comes to LLMs, while ChatGPT is carving its niche in the b2b realm with Microsoft. Historically being a b2c company, Google can shrug off the market response to ChatGPT and others because of its own deep integration capabilities into android and their stronghold on general search.
So, Is chat GPT is at risk due to Gemini? Not at all. Because the data shows that Gemini is competing with GPT 4, but soon OpenAI will be coming up with gpt5. And that is where it gets interesting. Not to mention the fact that Chat GPT holds its own 100 million monthly active users. In addition to Microsoft and Google, there are other major players who could potentially play an increasingly important role in the AI sector in the coming years.
One of them is amazon, who has its own AI platform called titan bedrock AI. As many of you may already know, Amazon Web Services (or AWS) has the largest market share in the cloud computing segment. Another notable player is apple. It already has its own large language model that its engineers are using internally, referring to it as ‘AppleGPT’. Last but not the least, there s Facebook or Meta. Meta recently launched llama 2, a foundation language model that is offered with a commercial license and is available for free as one of the best open-source models. This move underscores Meta’s commitment to AI and its role as a major player in this space.
The introduction of Google’s Gemini marks a significant stride in the evolving landscape of artificial intelligence. Positioned as a multimodal large language model, Gemini goes beyond traditional text interfaces, showcasing the potential to answer questions across diverse inputs such as text, images, video, audio, and code. The model’s capabilities, as demonstrated in an albeit staged, but impressive video, encompass real-time responses, code generation, even surpass human experts in massive multitask language understanding. Multimodal seems to be the future and we are excited to be on the ride.
Categories
- 1_5000_com
- 10
- 10300_sat
- 10400_sat
- 11800_prod
- 1win
- 9700_sat2
- adobe photoshop
- AI beer
- Are Cbd Oil Dog Treets Outlawed Anywhere In The Us 711
- Artificial Intelligence
- aviator
- Big Snai Casino 51
- blog
- Bookkeeping
- Ecommerce
- FinTech
- IT Education
- IT Вакансії
- IT Образование
- izzi
- Kasyno
- Machine Learning
- Mostbet App 317
- Other
- Parimatch
- Plinko
- Policy
- Sober living
- Software development
- SpinNV
- Tron Network 625
- www.zsolovi.cz
- Финтех
- Форекс обучение