News & PublicationsResearch & DevelopmentTechnologies & ServicesPartners & ReferencesOrganisation & JobsDE/EN

Performance Primitives for Embedded Vision


DSP performance library and development framework for embedded image processing

Software rondo process

PfeLib accelerates and eases the implementation of high-performance computer vision algorithms on embedded real-time platforms. Developers of computer vision algorithms are often in the need of functions that are not available in 3rd party libraries. Thus, the concept behind PfeLib goes far beyond of simply providing a set of high performance image processing routines.

We offer software licenses for PfeLib as well as individual development and services in and around PfeLib.

Target platforms for PfeLib

  • TI C6000 DSP platforms (C64x, C67x, DM64x)
  • Generic platform for test and validation
  • X86 platform (wraps to Intel Performance Primitives)

Features

  • Predefined and proven framework for cross- and multiplatform development
  • PfeRtdxHost: Development tool that enables verification and performance optimization directly on the embedded platform
  • ROS-DMA: Systematic approach to overcome the bottleneck of limited memory resources on embedded systems
  • Contains a selection of highly optimized algorithms e.g. linear filters, arithmetic operations, warping, ...

PfeRtdxHost: Image data transfer between development host and DSP target

PfeRtdxHost

The above screenshot shows images connected to the DSP target via RTDX channels. The leftmost image contains camera raw data that serves as input data for the bayer demosaicing function running on a C6416 DSP. The middle image displays the received results of the computations on the DSP. The rightmost image displays deviations of the result image against a reference image in a differential viewing mode. The screenshot additionally demonstrate the ROI capabilities of PfeDxHost and PfeLib.

DMA double buffering with Resource Optimized Slicing

DMA double buffering with Resource Optimized Slicing

Performance diagram for a bayer demosaicing function on a C6416 DSP: The consumed CPU cycles per pixel are plotted over various image sizes (less cycles means faster execution). Each plot represents a memory configuration.
IRAM:All image data is in the on-chip memory. Yields optimum performance, but often infeasible in practice.
L2CACHE:Image data in external SDRAM, 64kByte L2 cache is used. Significant performance drawbacks.
ROS-DMA:Image data in external SDRAM, but with PfeLib's ROS-DMA double buffering technology enabled instead of L2 cache.

Systematic Performance Optimizations on DSP platform:

Systematic Performance Optimizations on DSP platform

Enormous performance enhancements can be achieved on a TI C6416 DSP when starting with ordinary ANSI-C code. The diagram illustrates the effects of several performance optimizations of a PfeLib routine. PfeLib provides an effective framework that supports such optimizations: Performance monitoring, visual verification against the not-optimized functional behavior code, full support for development on DSP simulators. PfeLib supports creation, development, optimization and test of user specific functions within the same environment already used for PfeLib‘s built-in functions.

Facts
General-library framework with defined code structure and helper routines
-basic set of performance optimized routines already inside
-expandability for customer specific needs
PfeLib framework-well defined, uniform API
-test, verification and optimization with PfeRtdxHost
-several platform specific optimizations can coexist
-Region of Interest (ROI) capability
-ROS-DMA double buffering
-performance timing support
PfeRtdxHost-Win32 GUI application
-transfer input image data to DSP via TI's RTDX
-receive result images for visualization and functional verification
-load / save a variety of image file formats like bmp, jpg, png, tiff…
-support for 8, 16 and 32(!) bit grayscale and 24 bit RGB images
-differential visualization mode for comparing result images with reference results
ROS-DMA-DMA double buffering implementation with Resource Optimized Slicing for TI C6x DSP platforms
-avoid using of L2 cache
-ease the problem of limited on-chip memory on DSPs
-up to six times faster processing compared to using L2 cache