Real Time Stereoscopic Streaming

The networking model

Our network model

The current streaming vision

Before introducing the model we used for the project, I will try to describe the existing streaming model.
In the current vision of streaming, the receiving application “knows” exactly what kind of data it is supposed to receive, so the way to handle those data is obviously implemented within both the sender and the receiver.
For instance, if we consider streaming a sound, the “sending” application will do something like
– Read the sound from a microphone or a file
– Compress the sound to gain bandwidth, using “xxx” method
– Send the compressed buffer

And of course, the receiving application will have the opposite process:
– Receive from network
– Uncompress using “xxx”
– Send to the speaker

Then… what if the developers want to change the compression method? They will probably have to modify both applications. And the problem is even bigger if they want to send video instead of audio.

So in most of case, an application used to receive a stream of a given type will not work for another type of stream, merely because it doesn’t know how to handle it.

The proposed model

As explained before, we use custom DirectShow filters, which have to be seen as Windows low-level components. It can be considered as an interface, and since the network part is done within those components, the user of those filters will not even know which process they are doing.
With this model, the sending application works quite like in the current model, except it will send the row data and the media type.
As for the receiver, the decoding graph will be built according to the transmitted media type, and NOT statically, as it’s often the case. The advantage is simple: This application doesn’t need to know where the data are from, nor the type of the data, nor the encoding method.
If we consider an application displaying a remote video, it will work exactly the same way for a remote web cam than for a divX-encoded file.
Another good point of this model is its flexibility to encoding. It’s quite easy to insert an encoding filter between the source filter and the sending one, and application with dynamic encoding can be written easily above this components layer.
(By dynamic encoding, I mean an application that would start by sending raw data, then if the network is congested, will choose an encoding method and send encoded data)

RTP

Streaming a video (and in our case, a couple of video) generally require a large bandwidth. Also the data are generally not crucial, so it is not so important if a packet is lost.
TCP provide an interesting layer, however is not suitable here, because of acknowledge procedure (if a packet is not received, then the sender will send it again), and because a TCP frame contains a lot of useless information for our application.
On the other hand, UDP could have been used, but doesn’t perform any control, and cannot tell us if a received frame is outdated or not.
So RTP is a protocol over UDP, without controlling packet, but allowing us to transmit the pieces of information we require for real time streaming.
Unlike UDP and TCP, RTP is an application-layer protocol, which means we had to implement it directly within the DirectShow components.

Using RTP is not so hard: we merely need a UDP connection, and we include the following RTP header on every packet sent. It is a protocol belonging to the application layer, so even the definition of this protocol is quite clear; many applications modify a bit the use of the header. So did we for our filter. The Payload Type is not really a payload type, the sequence number is not initialized to a random value, but to 0, and we reset this value every new frame. And the timestamp is not really a time.
However those modifications seem important, we are still compliant with the RTP standard. The best proof is that DVTS can read the packets we sent, and display them.

The RTP header

The RTP header contains the following fields:
<------------------------ 32 bits ------------------------->
V=2 P X CC M Sequence number
Timestamp
Source identifier
Contributing Sources identifier (CSRC)

Here below is the meaning of all those fields, and the way we use them.
Version V: It contains the version of the protocol. Most of time set to 2. The length of this field is 2 bits
Padding P: 1 bit. If ‘1’, the last packet contains additional bytes to keep a constant size.
Extension X: 1 bit. If X=1 the header is followed by an extended header.
CSRC count CC: 4 bits, defining the number of CSRC of the header
Marker M: 1 bit, use defined by the profile. In our case, 1 means “packet belonging to a key frame”.
Payload Type PT: 7 bits, define the type of the media.
Sequence number: 16 bits, initial value is set randomly, and is incremented for each packet. Usually used to detect missing packet. We modified a bit the use of this field to match our needs: The sequence number is still incremented 1 by 1, but is set to 0 for the very first packet of every frame. So we can use this field as an offset for the packets we receive.
Timestamp: 32 bits, Ideally the time of the first byte of the packet is being sent. In our application, this time is incremented by 1 every frame. So the couple sequence number-timestamp is a unique identifier for our packets.
SSRC: 32 bits, usually a random number that identify the source.
CSRC: CC*32 bits, Contributing sources

Multicast or multi-connections

A common way to broadcast a media is to use multicast. Multicast packets are sent to a specific IP address, belonging to the range of “multicast” addresses. By doing this, routers will not “route” the packet, but rather forward it to each of its interfaces. Because the data will be duplicated many times, it is easy to imagine many receivers. The biggest advantage of this method is the cost in time: the sender sends data only once, and the routers do all the duplications.
Otherwise, because of the exponential duplications, the networks of all over the world may become congested. This is for this very point that most of routers disable the multicast function. Anyway, we had choice to use this method to broadcast our packets, or to use another way: multi-connections.
Multi-connections is an explicit name: that means for every client a new connection is created. Of course, the more there are clients, the busier will become the sending application. However, this solution is far more elegant than multicasting our packets, which would generate quantities of useless data. Thus, this is the one we decided to use for our sending filter. When a TCP connection completes its handshake (asking for format), we retrieve the IP address of the client, we retrieve a local UDP port to use, and finally we create a new RTP connection.
The only problem of the final version is that we haven’t implemented the disconnection procedure yet. So when a client exits, the sending application keeps sending data.

Optimizations: Regions of interest & “On the fly” compressions

The first idea to optimize the network performances was to use regions of interest. In other words, to split the image or the video into 2 distinct regions: the Region Of Interest (ROI), which is mainly the inner part of the video/image, and the Region Of Non Interest (RONI), which is of course the outer part of the image or video.
The idea was to “mark” the packet in order to know if they belong to the interesting region or not, then the router would use 2 queues, and sort out the different packets. If congestion occurs, the router will start dropping packet from the RONI queue.
In order to mark the packets, the sending filter merely uses a kind of Euclidean distance between the offset of the data about to be sent and the center of the screen. The good point of this idea is this optimization is performed by routers, not by the sending computer. On the other hand, there are two major drawbacks for this method:
1) The code used for our “test” router has to be deployed all over the world.
2) The sending filter is made to accept any kind of stream. So the method is irrelevant if the stream is not a bitmap video. In the case of compressed streams, or different streams that video, it will obviously not work. Indeed, most of nowadays web cam or digital cameras hardware-compress the video they produce, which means we would need to decompress before sending if we want to use a ROI/RONI method.
3) A router can usually read only the IP header of packet. RTP is an application layer protocol, which is not true for IP. So we used the TOS field of the IP header to store this information, but doing this is not as easy as changing a field in the RTP header, which is handled entirely in the application layer.

Since the sending and the receiving filters are part of the media subsystem of Windows, it is very easy to associate them with other components. During we were working on optimizing the displaying, we tried to put a DivX encoder before the RTP sender. Of course, we expected a huge latency, because doing this implies an on-the-fly compression, and on the remote computer another real-time decompression. But we tried anyway, mainly because it didn’t require any development. And oddly enough, we had absolutely no latency.

A stereoscopic stream uncompressed (on the left), from 640×480 24bits cameras uses 80MBbp. The same stream DivX encoded (on the right) uses about 1Mbps, i.e. 1.25% of the bandwidth used without compression.

The problem of using a compressing codec before sending is that it just makes the ROI/RONI method inappropriate. (Marking a packet from a compressed buffer has no meaning).

The final idea for optimizing is to combine ROI-RONI and a compression method. Since it is impossible to mark packet from a compressed buffer, the idea is first to created a stream splitter, which will split the stream into the two regions. Then we can use two sending filters that will properly mark the packets.

A graph using a ROI-RONI method and some encoding filters.

With this method, the main stream is split into an interesting stream and a not interesting stream, which are individually compressed, and sent using different sending filters. At the end of my internship, I could make a working prototype of the splitter, but the multiplexing filter (the opposite operation: taking 2 streams to produce only one) still needs to be done. In addition, no applications have been written to produce this kind of graph. The graph must be built manually, and the purpose of the splitting filter is only to test whether the ROI-RONI method is suitable in the case of real-time compressed transmissions.

Leave a Reply

Your email address will not be published. Required fields are marked *