Breakthrough in FPGAs could make custom chips faster, larger

Today we are worshipping the gods of the algorithm, according to one prominent magazine. It’s not a bad comparison. Everything from search results to our machine learning efforts are the basis of a series of equations that purport to solve for something that feels almost ineffable, human. Teaching a computer to see. Helping to figure out how to take our comings and goings and turn it into a schedule. Understanding our thermostat settings and turning that into a schedule.

But if our new gods are algorithms, then the chips that are performing those complicated equations are their shrines, and the more specific the shrine, the better your prayers work. The Greeks knew that. They built shrines to each of their individual gods with statues, symbols and other trappings of faith specific to their deity of choice. When it comes to algorithms computer scientists are less vested in faith but they are aware that their equations do run faster or more efficiently on a specially designed piece of silicon. But because algorithms change over time and hardware usually stays the same, the flexibility of being able to reprogram your hardware to match your changing algorithm becomes essential. That’s why big companies like Intel and Microsoft are turning to chips called Field Programmable Gate Arrays, or FPGAs.

Intel marries custom cores to its x86 architecture to help large data center customers (like [company]eBay[/company] or [company]Facebook[/company]) improve their performance. Because when worshipping algorithms, a custom shrine makes those prayers work better, and a shrine that changes with the algorithm is the best of both worlds. But like all religions, using FPGAs extracts a price.

The challenge with custom chips is that they are slower than general purpose processors like x86 or ARM-based cores. By making them software programmable, handy for algorithms that you might want to change later, and more flexible, you sacrifice speed in getting information on and off the chip. There is generally a bottleneck when shuttling information to an FPGA, so while it can solve problems really quickly and can adapt to solve different problems with a minor change in programming, sending it the data it needs to solve that problem slows things down.

But for certain applications, such as search engine algorithms or even Microsoft’s recent choice of using FPGA’s for neural networks, the flexibility of being able to tweak your hardware is more important than the performance hit. But what if in exchange for a larger piece of silicon, you didn’t have to take the performance hit? That’s the premise behind Flex Logic, a startup that launched this week with less than $10 million in funding and the IP for an FPGA that is both flexible and wired completely differently so it doesn’t create a bottleneck in getting data onto the core.

Flex Logic CEO Geoff Tate explained that the company has changed the wiring inside the FPGA so instead of having the FPGA outside the processor you can put it directly on the chip making it an integrated package or an SoC.

This makes the total area of the eventual chip larger, but boosts performance and lowers the overall cost. The Flex Logic cores also can snap together meaning that the design of these FPGAs is fairly flexible and modular. So far Flex Logic is launching with a product called the ESLX core in a variation that offers 2,500 LUTs or look up tables (a measure of performance in FPGAs). This core can be combined with other ESLX cores to give a company more performance and each one adds about 15 cents to the overall device. That cost is mitigated by putting t on the chip as an SoC however.

The initial sample chip is in the company’s hands and customers are testing it with the fist chip expected to be in products later this year, said Tate. Because Flex Logic is selling IP, much like [company]ARM[/company] does, rather than the silicon itself, Tate expects that it will be able to translate its designs fairly rapidly to the demands of the market. It plans to make a larger and a smaller design of its ESLX core as well as to make a 40 nanometer version of the core to complement its current 28 nanometer version, but Tate is waiting to see what the market demands.

He expects the products to first appear in the networking and communications space. Other possible applications for the cores could include encryption in the security field or manufacturing software defined radios, which could be tuned to different radio protocols as needed. If we can make faster, flexible chips this is truly a breakthrough worth investigating. I’ll be keeping an eye on Flex Logic to see the customers it signs up and the tradeoffs its technology demands in the field.

TeraDeep wants to bring deep learning to your dumb devices

Open the closet of any gadget geek or computer nerd, and you’re likely to find a lot of skeletons. Stacked deep in a cardboard box or Tupperware tub, there they are: The remains of webcams, routers, phones and other devices deemed too obsolete to keep using and left to rot, metaphorically speaking, until they eventually find their way to a Best Buy recycling bin.

However, an under-the-radar startup called TeraDeep has developed a way to revive at least a few of those old devices by giving them the power of deep learning. The company has built a module that it calls the CAMCUE, which runs on an ARM-based processor and is designed to plug into other gear and run deep neural network algorithms on the inputs they send through. It could turn an old webcam into something with the smart features of a Dropcam, if not smarter.

“You can basically turn our little device into anything you want,” said TeraDeep co-founder and CTO Eugenio Culurciello during a recent interview. That potential is why the company won a Structure Data award as one of most-promising startups to launch in 2014, and will be presenting at our Structure Data conference in March.

Didier Lacroix (left) and Eugenio Culurciello (right)

Didier Lacroix (left) and Eugenio Culurciello (right)

But before TeraDeep can start transforming the world’s dumb gear into smart gear, the company needs to grow — a lot. It’s headquartered in San Mateo, California, and is the brainchild of Culurciello, who moonlights as an associate professor of engineering at Purdue University in Indiana. It has 10 employees, only three of which are full-time. It has a prototype of the CAMCUE, but isn’t ready to start mass-producing the modules and getting them into developers’ hands.

I recently saw a prototype of it at a deep learning conference in San Francisco, and was impressed by its how well it worked, albeit in a simple use case. Culurciello hooked the CAMCUE up to a webcam and to a laptop, and as he panned the camera, the display on the computer screen would alert the presence of a human when I was in the shot.

“As long as you look human-like, it’s going to detect you,” he said.

The prototype system can be set to detect a number of objects, including iPhones, which it was able to do when the phone was held vertically.

teradeep setup

The webcam setup on a conference table.

TeraDeep also has developed a web application, software libraries and a cloud platform that Culurciello said should make it fairly easy for power users and application developers, initially, and then perhaps everyday consumers to train TeraDeep-powered devices to do what they want them to do. It could be “as easy as uploading a bunch of images,” he said.

“You don’t need to be a programmer to make these things do magic,” TeraDeep CEO Didier Lacroix added.

But Culurciello and Lacroix have bigger plans for the company’s technology — which is the culmination of several years of work by Culurciello to develop specialized hardware for neural network algorithms — than just turning old webcams into smarter webcams. They’d like the company to become a platform player in the emerging artificial intelligence market, selling embedded hardware and software to fulfill the needs of hobbyists and large-scale device manufacturers alike.

A TeraDeep module, up close.

A TeraDeep module, up close.

It already has a few of the pieces in place. Aside from the CAMCUE module, which Lacroix said will soon shrink to about the surface area of a credit card, the company has also tuned its core technology (called nn-x, or neural network accelerator) to run on existing smartphone platforms. This means developers could build mobile apps that do computer vision at high speed and low power without relying on GPUs.

TeraDeep has also worked in system-on-a-chip design for partners that might want to embed more computing power into their devices. Think drones, cars and refrigerators, or smart-home gadgets a la the Amazon Echo and Jibo that rely heavily on voice recognition.

Lacroix said all the possibilities, and the interest it has received from folks who’ve seen and heard about the technology, are great, but noted that it might lead such a small company to suffer from a lack of focus or perhaps option paralysis.

“It’s overwhelming. We are a small company, and people get very excited,” he said. “… We cannot do everything. That’s a challenge for us.”

AWS can’t keep its new SSD-backed instances in stock

Amazon Web Services doesn’t have enough capacity to handle demand for its new C3 instances, which has led to a rush order of new servers. In almost any other scenario, that would mean a big payday for someone like Dell or HP.

Researchers try to make microchips more efficient by making them smarter

A new research project from Carnegie Mellon University, funded by a $2.6 million grant from the National Science Foundation, aims to make microchips smarter and more efficient by analyzing the data they collect about themselves. The Statistical Learning in Chip project is focused on developing an integrated machine learning engine that can help chips dynamically manage their resource consumption and keep it at optimum levels. This would make the chips, and the devices running on them, more energy-efficient, resulting in longer battery life and cooler operating temperatures.

Up next: software-defined caching on your processors

An MIT professor has conducted some handy research that could help make applications run faster and use less energy by overcoming an inherent drawback of multicore processors. The problem is that although the local caches on chips save them the latency of having to access RAM, the hardware-wired algorithms powering them often assign data to cache locations randomly without considering the core trying to access it. The new software-based technique, called Jigsaw, tracks which cores are accessing what data — and how much — and assigns data locale accordingly. The paper detailing Jigsaw is available here.

Video: Next dual-core ARM chips faster than today’s quad-cores

Sure, the latest dual- or quad-core in your mobile device looks spritely, but have you seen the next generation of ARM chips? Texas Instruments is showing off its OMAP 5 processor in a browser test with a quad-core chip, and it’s nearly twice as fast.