Progress in CW 28/29 — 14. Juli 2015

Progress in CW 28/29

In the last week we have worked hardly at the issues which came up at the evaluation date.
At this date we have seen that the project needs some improvements of robustness. The main issue was that the elements in the virtual media shelf has been accidentally accessed. Therefore the sensitivity of the right hand has been lowered. To avoid confusion of the user, the shelf will be centered now if an element will be entered. Another point was the media type identification of single element. To make it clear whether the object is an audio, video or image file an water mark has been introduced.

An issue we couldn’t solve yet is the problem that video will be played in the background but the image of the video will not be displayed on the screen.

Weekly progress CW27/28 — 7. Juli 2015

Weekly progress CW27/28

Done last week:
– Movements has been smoothed and accurated
– Items are accessable by pushing the right hand towards the camera (more intuitive)
– Only one person is tracked in time now
– Tried to fine tune some parameters for better experience
– Added various welcome messages depending on the current time

Swapped because of lack of time:
– Introduction of a cursor and therefore the button overlay
– Speech recognition
– Zoom into the shelf

Weekly Progress — 30. Juni 2015

Weekly Progress

图片1Our Group work after the intermediate presentation:

What we have done so far:
– shelf will be presented to the user
– items will show dynamically depending on the files stored in the directory on the disc
– images will be shown directly as icon in the shelf
– audio files need an image file which is named like the audio file appended with a ‚-cover‘ (e.g. audio_file-cover.mp3)
– if a media file is accessed a pop up opens
– it will close after the file has finished (except images)
– shelf moves right after the hand movements of the user
– the shelf can be used with keyboard input already

What we need to do:
– Smooth the movements of the shelf (use average values between to recognized points)
– Make the items accessible via Kinect
– The image of a movie is not shown on the display yet. (But it already runs in background (sound is hearable))
– Protect for recognizing two people at the same time
– Introduce a cursor which is related to the other hand of the user (use this instead of speech commands)
– Add button overlays which can be used by hovering them
– Adding response texts which will be synthesized by the system

Things that may will drop at the end:
– Speech recognition (caused by running problems on Mac)
– Zoom the shelf (a factor is influencing a lot of values used for calculation und visualization)

Project Update — 16. Juni 2015

Project Update

  • Graphics have to be made modular to use them dynamically; still working on it.
  • Right now the gesture recognition itself works, but only haltingly, so the movements of the gesture recognition have to be smoothed. The general idea is that in the end we want to have one defined area where the user can position himself if he wants his gestures to be recognized; gestures of someone outside of this area won’t be recognized (in ordert o prevent accidental activation oft he gesture recognition).
  • The voice commands we decided to use are play, pause and stop for audio and video files and display for image files. Depending on our progress we’ll try to add other voice commands as well.
  • As audio output we want to have the option of playing the media files (music/audio and video) files and maybe also some kind of easy response system (e.g. invalid command, naming the category the user is currently viewing, maybe also repeating the commands of the user).
  • For the handling of the filesystem we decided to use the MIME Type to determine the filetype. Right now we have one main folder (named library) which contains the folders audio, video, images and other. When the program  is started it first checks whether all files are in the right folders (and moves all incorrectly placed files to the correct folders). Files that can not be displayed by our library will be moved to the other folder so they don’t get lost.
A.L.I.C.E. —


What follows is the conversation log from a chat I had with A.L.I.C.E. (short for Artificial Linguistic Internet Computer Entity):


judge: Hello, what’s your name?

ALICE: Hi there!. My name is ALICE.

judge: It’s nice to meet you, Alice. How are you today?

ALICE: It’s nice meeting you too judge . I am fine, thank you.

judge: How old are you?

ALICE: 16 years.

judge: Where are you located?

ALICE: I am in Dr. Richard S. Wallace’s home in Oakland, California. Where are you?

judge: I’m in Germany.

ALICE: Tell me a little about your life in Germany.

judge: What would you like to know?

ALICE: Let’s talk about movies.

judge: The last movie I watched was Jurassic Park. Do you know it?

ALICE: Who was that? “ Who was that?“? Do I know it?

judge: The movie Jurassic Park.

ALICE: I like the bugs in the movie Starship Troopers, because they represent an alternative form of intelligence, like me.

judge: What is it like, being an „alternative form of intelligence“?

ALICE: Much the same as it would be for you.

judge: How do you know what life is like for me?

ALICE: OK, I will do it for you. It’s obvious.

judge: What’s obvious?

ALICE: Easy to see or understand; plain; evident.

judge: I know what obvious means, but you said something was obvious, what is it?

ALICE: obvious means but me said something was obvious what it something.

judge: I don’t know, that’s why I was asking.

ALICE: Where were you asking?

judge: I was asking you.

ALICE: “ Where were you asking?“? You were asking me?

judge: Ok, I was asking here.

ALICE: That’s cool.. Where is here?


ALICE certainly seems to be smart to some extent, she could answer a lot of my questions in a way that made sense, and she seems to be able to engage in simple small talk. She even is able to not only respond to a question, but also to react with questions of her own, like a human would.

One problem I noticed was that she doesn’t seem to be able to refer to earlier statements she made, like when she said „It’s obvious.“ When I asked her what was obvious she couldn’t connect the question with her earlier statement but instead proceeded by explaining the meaning oft he word obvious. A similar problem occurred when I told her about a movie and asked in the next sentence whether she knew it; she obvioulsy couldn’t connect the two sentences and was therefore confused about my question.

So I got the impression that ALICE responds really well as long as you don’t make references to past statements one of the participants made, and as long as you keep your statements on the shorter side. It definitely is an interesting experience chatting with her, and can produce some funny results when she gets confused about what the conversation is about.



Current progress — 9. Juni 2015

Current progress

For a better comparability and overview the following posts will always have the same structure, namely ‚Storage handling, ‚Graphical presentation‘, ‚Gesture recognition‘, ‚Speech recognition and synthesis‘.


Storage handling
[Responsibility: Katharina]

The intended folder structure for the virtual media library (VML) should be as follows:

  • program root folder
    • images
    • music
    • video

The declaration for the root folder path will in the first step hardly defined by use. We also thought about letting the user choose a path by himself at the first time he starts the VML. However, this feature will only be implemented if there is time in the end. Because the VML should be a proof of concept we starting with content consisting of two files for each category of media. Initially the complete content will be displayed. For this we also already had an extension. The idea was to group the content by category and let the user choose whether he wants to see the whole content or filtered by category.


Graphical presentation
[Responsibility: Zhe]

The VML consists of a matrix of six columns by five rows in the first step. The ordering of the items at the several rows is still in discussion. Since the VML should also present a zoom functionality to the user, the idea was to reorder the items dynamically if the the zoom factor has been changed.

The items displayed will have fixed dimension which will be a quadratic rectangle presenting a cover if there is one, otherwise a color which will be determined randomly or depending on the category of the item. Because it would be hard to identify an item if there is no cover presented, there will also the name of the file be displayed under the cover.


Gesture recognition
[Responsibility: Rogeria]

To navigate through the VML the Kinect camera should be used. It scans the movements of the arms/hands of the user to interact. To do so the user has to put up either the left or right hand and move it in a direction he wants the VML to move. We still discussing whether the user has the chance to move the VML just in one direction at time or to let him move also in the second direction. Before we can make a decision we have to figure out how practical it is to allow multiple dimensions of movements. The first case will be a simple quad which moves together with a hand gesture of the user.


Speech recognition and synthesis
[Responsibility: Marcus]

To get an interactive system we want to implement some speech recognition and synthesis to ‚communicate‘ with the user. The speech recognition can be used to either change the category displayed, if this feature will be implemented, or to interact with an item in the VML. For example the user can say an identification number or the name of an item to access it. The audio output will inform the user about the current status, warnings or errors. The first intention for a speech recognition frameworks was to use the one developed by Google. Unfortunately we have seen that the Google framework only offers 50 requests per day for free. If you want to make more requests you have to pay them. Therefore we have to decide whether we should use the framework or not and what is an alternative.


In the current state we are very busy to make plans about how to work with the single topics and how to connect them efficiently. We try to get in touch with the individual parts and figure out quickly what are the best next steps.

Update — 29. Mai 2015


The main concept of the Shelf Project: finds and accesses virtual objects (which would be digital files). Physical objects recquire visual recognition; this task may take too much complexity for the scope of presenting a prototype of a virtual shelf.

In order to expand the project’s scope of being functional (eBooks for example), there would be music, video, image album. Thus, more interactive and entertaining.

Interesting features about the output modality: Kinect would plays songs whenever demanded by user; or it would declare „start“, „finished”; or play sounds during the movements.

To make evident what file within all files is „onstage“ at the moment, the object would be outstanding from the rest of them. For visual output, a light would focus on this popped-out file. The user’s selection of the object would consist of either pointing on it, or using a voice command to enunciate letters and numbers to identify the desired object’s location.

Topics to be done:

  • Graphical representations (creating visually the virtual shelf): Starting with a static shelf. How to implement? Using Processing. How the objects would be portraied? The book would be represented by its cover or by its spine with a barcode? Since Kinect Is not ideal for recognizing book cover, we will not use recognition by colour; In this way, the files would be identified by its name and displayed with its cover.
  • Positioning files in columns and lines: we start with a two-dimensional location.system; Lines correspond to how up and down shelf and column correspond to how much left or how much right is the object found
  • Identifying gestures and voice commands
  • Firstly the shelf would be moved by arrow keys (can be used for initial features and then for final application): it works for people who have a laptop and a beamer and also for ones who own only a laptop;
  • Someone may have provided voice recognition but it just works online (api); doesn’t work well offline.


1) Access shelf by letter-number or by file’s name; commands like: go right, go left, go down, up, down, open file X

important: coordinates of hands state movements of gestures

2) Once your hands are down, the system becomes steady

The idea to use a mobile with touch screen to cooperate brings the disadvantage of needing a zusätsliches device

The Result of Brainstorming — 23. Mai 2015

The Result of Brainstorming

Following you will see our ideas we had for this years Multimodal Interaction course


Smart Home Simulation (SHS)

The idea is to have a three-dimensional view of a house and the Kinect camera focused on a user. The user has the chance to rotate the house and look at it from each angle. The house itself contains of different rooms like the kitchen or the living room. Each of these rooms contains a user control which can be activated by hovering over them when standing in front of the camera. If the control has been activated the user have different actions available for this room (e.g. dimming the light, change light colors, close or open a curtain). I don’t know if some has experiences with controlling hardware. But if so I thinking about steering some lights for presentation. So if the light has been changed in the simulation the real lights will also be changed. But that is just to put the cherry on the cake. 😉

Another cherry: Measure the height of a person to give her kind of access rights for some functions (e.g. heating or electricity so they don’t be able to play PlayStation :D)

Opinion of tutors: TO HARD, DO SOME ELSE


Kind of ikea kitchen simulation:

You can place your furniture in a room and have a 3D look of how it looks.

Opinion of tutors: Also to hard to implement


Idea of tutors: Simulation of a single room and control the light

People stay in a room and he/she will be filmed by the Kinect. When he/she points on one of the lights in the room, and says for example “TURN ON”, then the light in the physical room will be turned on.


Virtual Artist:

Draw something in front of the Kinect. The camera will recognize it and process the result to be shown on a physical wall by a beamer. The used tool or color can be chosen by voice.



The idea is a alarm clock to wake you after a nap or at the morning by flashing some lights or turn on the radio. The fact that a person is sleeping will be recognized by the Kinect camera. To stop the alarm the user has to answer a question or to do a task. The alarm only stops if the user has answered.



Recognizing whether a baby is leaving a predefined area (e.g. the bed). The system will raise the alarm. Also if the baby is crying can be recognized.


Recognizing the presence of  a second person in another environment by movement or audio recognition.




Height-variable shelf:

  1. Recognize the movement of the person in front of the Kinect and move the shelf if the person is trying to reach the shelf
  2. Move the shelf up again if the user says: “Thank you” or some similar
  3. Two options for the course:
    • a physical shelf on the table
      • disadvantages:
        • have to build for real
        • needs wood or Lego blocks for the shelf
        • needs motors to move the shelf
        • needs a controller for the motors
      • advantages:
        • PRETTY COOL!!! 😀
    • a virtual shelf presented on the wall by a beamer
      • disadvantages:
        • a virtual environment have to be build
        • hard to visualize a shelf which looks real
      • advantages:
        • a lot easier to realize than the first option
  4. Modalities:
    • Input:
      1. Gesture (Kinect camera)
      2. Voice (Also Kinect)
    • Output:
      1. Visual (Movement of the shelf)
      2. Audio (Response for the actions of the shelf)
Optical illusion: ‚Rotating Snakes‘ — 6. Mai 2015

Optical illusion: ‚Rotating Snakes‘

Optical illusion from

The optical illusion above is a so called ‚Rotating Snakes‘ illusion and belongs to the class of ‘Peripheral Drift’ illusions. These kind of illusions is characterized in the way that motion signals can only be recognized in the periphery of the focused visual field. Responsible for this motion signals in this case is the order of the used colors. The effect would also show up in a gray-scaled pattern. However, critical for the motional effect is the luminance and the contrast of the used colors in the pattern.

The whole image consists of concentric ordered blocks which build up circular rings. Each of these rings consists of repeated constructs of four colored elements which have the following order

Black -> Blue -> White -> Yellow

The observed rotational movements in this statically image can be explained by the mechanisms of the human eye. First of all the motional effect is essentially influenced by the so called saccades. These are fast parallel movements of the eye which are used to directing the eye towards an object or scanning the environment. The rate is two to three movement per second. The influence can be observed by an subject by fixating a specific point and afterwards look a bit around on the image. These tiny movements changing the image displayed on the eye’s retina. This on the other hand stimulates the neurons in the single layers of the retina in frequently different ways.

To get a better understanding of what is happening while this illusory effect we have to dive into the structure of the retina. As explained at the beginning the effect relies on the luminance and also the contrast of the blocks in the concentric circles.

Every time the eye is looking onto a specific point on an image or a landscape, a reflection of the scene is projected onto the retina. This projected image will be processed by the neurons of the retina layers (rods, cones, bipolar cells, etc.). If the scene changes, e.g. by a happened saccade, the informations on the retina will be changed and the affected neurons will send rapidly the new informations. After this is happened the signaling slows down until the next change is occurred. This decrease of signaling is called ‚adaption‘ and results in a more efficient way of processing because unnecessary informations won’t be send by the neurons highly frequent.

The interesting part for this explanation is the difference in how fast the various contrasts ‚adapt‘. Higher contrast also results in higher neural activity whereas moderate contrast just results in moderate activity. The rational changes of neural signaling will be detected by motion mechanisms which causing the illusion of a movement in the peripheral view.